Variable Transformation for Granularity Change in Hierarchical Databases in Actual Data Mining Solutions
详细信息    查看全文
  • 关键词:Granularity transformation ; Relational data mining ; School quality assessment ; Educational decision support system ; CRISP ; DM ; Domain ; driven data mining ; Logistic regression ; Ten ; fold cross ; validation
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2015
  • 出版时间:2015
  • 年:2015
  • 卷:9375
  • 期:1
  • 页码:146-155
  • 全文大小:196 KB
  • 参考文献:1.INEP Databases. <http://​portal.​inep.​gov.​br/​basica-levantamentos-acessar >. Accessed 15 March 2015. (In Portuguese)
    2.Travitzki, R.: ENEM: limites e possibilidades do Exame Nacional do Ensino Médio enquanto indicador de qualidade escolar. Ph.D. thesis, USP, São Paulo (2013). (In Portuguese)
    3.Shearer, C.: The CRISP-DM model: the new blueprint for data mining. J. Data Warehouse. 5(4), 13–22 (2000)
    4.Fawcett, T.: An introduction to ROC analysis. Patt. Recognition Lett. 27, 861–874 (2006)CrossRef
    5.Bolton, R.J., Hand, D.J.: Statistical fraud detection: a review. Statist. Sci. 17(3), 235–255 (2002)MATH MathSciNet CrossRef
    6.Nordin, F., Kowalkowski, C.: Solutions offerings: a critical review and reconceptualisation. J. Serv. Manage. 21(4), 441–459 (2010)CrossRef
    7.Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans Info. Theor. 8(2), 179–187 (1962)MATH CrossRef
    8.Hair, Jr., J.F., Black, W.C., Babin, B.J., Anderson, R.E., Tatham, R.L.: Multivariate Data Analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River (2006)
    9.Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River (2007)MATH
    10.Sousa, M.U.R.S., Silva, K.P., Adeodato, P.J.L.: Data mining applied to the processes celerity of Pernambuco’s state court of accounts. In: Proceedings of CONTECSI 2008 (2008). (In Portuguese)
    11.Flusser, J., Suk, T.: Pattern recognition by affine moment invariants. Pattern Recogn. 26(1), 167–174 (1993)MathSciNet CrossRef
    12.Cao, L.: Introduction to domain driven data mining. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds.) Data Mining for Business Applications, pp. 3–10. Springer, US (2008)
    13.Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach. Learn. J. 42(3), 203–231 (2001)MATH CrossRef
    14.Conover, W.J.: Practical Nonparametric Statistics, 3rd edn. Wiley, New York (1999)
    15.Adeodato, P.J.L., Vasconcelos, G.C., et al.: The power of sampling and stacking for the PAKDD-2007 cross-selling problem. Int. J. Data Warehouse. Min. 4(2), 22–31 (2008)CrossRef
    16.Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Waltham (2012)
    17.Kavukcuoglu, K.: Learning feature hierarchies for object recognition. Ph.D. thesis, Department Computer Science, New York University, January 2011
  • 作者单位:Paulo J. L. Adeodato (18)

    18. Centro de Informática, Universidade Federal de Pernambuco, Recife, Brazil
  • 丛书名:Intelligent Data Engineering and Automated Learning ᾿IDEAL 2015
  • ISBN:978-3-319-24834-9
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
文摘
This paper presents a variable transformation strategy for enriching the variables´ information content and defining the project target in actual data mining applications based on relational databases with data at different grains. In an actual solution for assessing the schools´ quality based on official school survey and students tests data, variables at the student and teachers´ grains had to become features of the schools they belonged. The formal problem was how to summarize the relevant information content of the attribute distributions in a few summarizing concepts (features). Instead of the typical lowest order distribution momenta, the proposed transformations based on the distribution histogram produced a weighted score for the input variables. Following the CRISP-DM method, the problem interpretation has been precisely defined as a binary decision problem on a granularly transformed student grade. The proposed granular transformation embedded additional human expert´s knowledge to the input variables at the school level. Logistic regression produced a classification score for good schools and the AUC_ROC and Max_KS assessed that score performance on statistically independent datasets. A 10-fold cross-validation experimental procedure showed that this domain-driven data mining approach produced statistically significant improvement at a 0.99 confidence level over the usual distribution central tendency approach.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700