文摘
This paper presents a variable transformation strategy for enriching the variables´ information content and defining the project target in actual data mining applications based on relational databases with data at different grains. In an actual solution for assessing the schools´ quality based on official school survey and students tests data, variables at the student and teachers´ grains had to become features of the schools they belonged. The formal problem was how to summarize the relevant information content of the attribute distributions in a few summarizing concepts (features). Instead of the typical lowest order distribution momenta, the proposed transformations based on the distribution histogram produced a weighted score for the input variables. Following the CRISP-DM method, the problem interpretation has been precisely defined as a binary decision problem on a granularly transformed student grade. The proposed granular transformation embedded additional human expert´s knowledge to the input variables at the school level. Logistic regression produced a classification score for good schools and the AUC_ROC and Max_KS assessed that score performance on statistically independent datasets. A 10-fold cross-validation experimental procedure showed that this domain-driven data mining approach produced statistically significant improvement at a 0.99 confidence level over the usual distribution central tendency approach.