数据挖掘中属性约简及分类算法研究

英文题名：Research of Algorithms of Attribute Reduction and Classification in Data Mining
作者：李永华
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 粗糙集理论 ; 加权平均属性重要性 ; 属性约简 ; 决策树
英文关键词：Data Mining ; Rough Sets Theory ; Weighted Mean Attribute
学位年度：2009
导师：蒋芸
学科代码：081203
学位授予单位：西北师范大学
论文提交日期：2009-05-27

摘要

数据挖掘是指从数据库中抽取隐含的、具有潜在使用价值信息的过程,是一种新型的数据分析技术,己经被广泛应用于金融、保险、政府、教育、运输以及国防等领域。粗糙集理论是波兰数学家Z.Pawlak于1982年提出的一种新的处理模糊和不确定性知识的数学工具。本文结合粗糙集理论着重探讨了数据挖掘中属性约简与分类这两个核心问题。以信息系统为研究对象,通过研究完备信息系统下经典粗糙集模型的属性约简算法理论和方法,并指出了其中存在的不足,提出了一种基于粗糙集的改进的属性约简算法;对传统的决策树算法通过实例分析,指出算法中存在的问题,提出了一种传统的决策树算法的改进算法——基于属性加权平均重要性的决策树构造算法WMAS。本文主要工作及创新点如下:
     1.在对各种属性约简启发式算法中属性重要性研究基础上,提出了属性加权平均重要性的概念,该重要性综合考虑了属性对决策分类的重要性和在属性中的重要性。
     2.如何高效的实现粗糙集的属性约简,一直是粗糙集理论研究的重要内容。理论已经证明,搜索粗糙集属性约简的最优解是一个NP问题,因此,目前的研究已集中于如何求得属性约简的次优解上。本文先讨论了经典粗糙集的约简算法,在此基础上提出了一种基于粗糙集的属性约简改进算法,该算法在属性约简中不仅考虑到属性的重要性而且考虑了属性的信息量,能够得到信息系统的一个约简,且不需要求核,减少计算量,提高计算速度。
     3.通过对基于信息熵的决策树构造算法的研究得出,该方法存在的主要问题是一棵决策树中子数的重复,以及有些属性在一棵决策树中的某一路径上被多次检验,本文将属性加权平均重要性用于选择分离属性来构造决策树,且实现了基于属性加权平均重要性的决策树构造算法WMAS,该方法可以克服上述缺点,降低了复杂度,提高了分类精度。
     本文通过实例和实验对提出的算法进行了验证和证明。
Data Mining means the process of extracting cryptic and potential helpful information from a mass of Data. It is one kind of brand new Data analysis technology and popular in the filed of banking finance, insurance, government, education, transportation and national defense etc. The theory of rough sets, presented by Polish mathematician Pawlak Z., is a powerful mathematical tool for analyzing uncertain, fuzzy knowledge. Based on the rough sets, this dissertation focuses on the core issues including attribute reduction and classification in data mining. It points out the shortcomings by studying the theory and method of attribute reduction algorithms in complete information system. And an improved algorithm for attribute reduction based on rough sets is proposed. By analyzing the traditional decision tree algorithm with instance, the problems from the traditional decision tree algorithm are pointed out and the improved of traditional decision tree algorithm, which is named decision tree constructing algorithm based on the weighted mean attribute significance(WMAS), is put forward. Main research results are as follows:
     1. A concept of the weighted mean attribute significance, which considers both the importance of attribute and its contribution to classification, is proposed based on the study of attribute significance in various attribute reduction algorithms.
     2. How to achieve efficient attribute reduction in rough sets has always been an important aspect of study. Current research has focused on how to get the sub-optimal solution of attribute reduction as it has been proved that searching the optimal solution of attribute reduction is an NP problem. First, this article will discuss the classical reduction algorithms, and then an improved algorithm for attribute reduction based on rough sets is presented, which consider not only the attribute significance but also the amount of information of attribute. It can get one reduction of information system, while the computing is decreased and speed is increased without solving the core.
     3. By studying the classic decision tree based on information entropy, we find out that it is confined to the problems that some sub trees appear repeatedly in the decision tree and some attributes are measured for many times on certain route of the decision tree. In order to overcome the defect, the attribute selection criterion, based on the Weighted Mean Attribute Significance, is proposed. And furthermore, we proposed decision tree constructing algorithm WMAS based on weighted mean attribute significance. It reduces the complexity and improves the classification accuracy.
     And it is verified with instance and experiments that the algorithm is advantageous. Significance; Attribute Reduction; Decision Tree

引文

[1] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, August, 2000.
    [2] Pawlak Z. Rough Sets [J]. International Journal of Information and Computer Science. 1982,11(5):341-356.
    [3] Pawlak Z. Rough Sets and Intelligent Data Analysis [J]. Information Science, 2002,147 (14):1-12.
    [4]张文修,吴伟志,梁吉业,李德玉.粗糙集理论与方法[M].北京:科学出版社,2003,1-25.
    [5]王国胤. Rough集理论与知识获取[M].西安:西安交通大学出版社,2001,1-30.
    [6] Bautista R, Millan M, Diaz J F. An Efficient Implementation to Calculate Relative Core and Reducts [A]. 18th International Conference of the North American on Fuzzy Information Processing Society. New York.1999.
    [7] Pawlak Z. Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht, 1991.
    [8] slowinski R. Intelligent Decision Support-handbook of Applications and Advances of Rough Sets Theory. Dordrecht: Kluwer Academic Publishers, 1992.
    [9]李克文,吴孟达,张雄明.约简的一种启发式算法[J].计算机工程与科学,2004,26(1): 92-94.
    [10]刘靖,陈福生,张勤.基于粗糙集和模糊集的属性约简算法[J].计算机工程与科学,2005, 27(2):42-44.
    [11]曾黄麟.粗集理论及其应用[M].重庆:重庆大学出版社,1998,23-41.
    [12]常犁云,王国胤,吴渝.一种基于Rough set理论的属性约简及规则提取方法[J].软件学报,1999,10(11):1206-1211.
    [13]张文修,吴伟志.粗糙集理论与方法[M].北京:科学出版社,2001.
    [14]刘清. Rough集及Rough推理[M].北京:科学出版社,2001:40-80.
    [15]史开泉,崔玉泉. S一粗集和它的生成结构[J].山东大学学报(理学版),2002.6: 471-474.
    [16]史开泉.函数S一粗集[J].山东大学学报(理学版),2005.2,40(1): 1-7.
    [17] Jusheng Mi, Weizhi Wu, Wenxiu Zhang. Approaches to Knowledge Reduction Based on Variable Precision Rough Set Model [J]. Information Science, 2004(159):255-272.
    [18]周创德.粗糙集属性约简研究[D].学士学位论文.合肥工业大学,合肥,2008.
    [19] A. Skowron. Rough Sets in KDD. Special Invited Speaking, WCC 2000 in Beijing, Aug. 2000.
    [20]刘清,刘少辉和郑非. Rough逻辑及其在数据挖掘中的应用[J].软件学报,2001,12(3): 415-419.
    [21] Xiaohua Hu, Cercone N. Learning in Relational Database: A Rough Set Approach [J]. Computational Intelligence, 1995,11(2):323-327.
    [22] Xiaohua Hu. Knowledge Discovery in Database: An Attribute-oriented Rough Set Approach [D]. Doctoral Dissertation. University of Regina, Canada, 1995.
    [23]苗夺谦,胡桂荣.知识约简的一种启发式算法[J].计算机研究与发展,1999,36(6): 681-684.
    [24]王国胤,于洪,杨大春.基于条件信息熵的决策表约简[J].计算机学报,2002,25(7): 759-766.
    [25]胡可云.基于概念格和粗糙集的数据挖掘方法研究[D].博士学位论文.清华大学,北京,2001.
    [26] Jue Wang, Ju Wang. A Kind of Reduction Algorithms based on Discernibility Matrix: The Ordered Attributes Method [J]. J.Computer and Technology,2001,16(6):489-504.
    [27] Suqing Han, Jue Wang. Reduct and Attribute Oeder [J]. Journal of Computer Science and Technology, 2004,19(4):1429-1449.
    [28] Hong T P, Tseng L H, Wang S L. Learning Rules from Incomplete Training Examples by Rough Set [J]. Expert Systems Applications, 2002,22:285-293.
    [29]周献中,黄兵.基于粗集的不完备信息系统属性约简[J].南京理工大学学报, 2003, 27(5):630-635.
    [30] Jakub W. A Parallel Algorithm for Knowledge Discovery System [C].Proc ofPARELEC’98, Bialystok, Poland: Press Syndicate of the Technical University of Bialystok, 1998:228-230.
    [31]覃政仁,吴渝,王国胤.一种基于Rough Set的海量数据分割算法[J].模式识别与人工智能,2006,19(2):249-256.
    [32]孙涛,懂立岩,李军等.用于粗糙集约简的并行算法[J].吉林大学学报(理学版),2006, 44(2):211-216.
    [33]苏健,高济.基于元信息的粗糙集规则并行挖掘方法[J].计算机科学,2003,30(3):35-39.
    [34]吕跃进,刘南星,陈磊.一种基于并行遗传算法的粗糙集属性约简[J].计算机科学, 2008, 35(3):219-221
    [35] Bazan, G. J. A comparison of dynamic non-dynamic rough set methods for extracting laws from decision tables. Rough Set Methods and Applications, physica-verlag, 1998,321-365.
    [36]胡学钢.从数据库中提取知识的模型研究[D].博士学位论文.合肥工业大学,合肥, 2000.
    [37] Xiaohua Hua, Lin T. Y., Jiaochao Han. A new rough set model based on database systems[J]. G. Wang etal. (Eds):RSFDGRC 2003, LNAI2639,2003,114-121.
    [38] Hunt EB, J Marin, Stone P T. Experiments in Inclution [M]. Academic Press.1966.
    [39] Quinlan J. R. Induction of Decision Tree [J]. Machine Learing. 1986,1(1):81-106.
    [40]毛聪莉.基于粗糙集的决策树学习算法研究[D].硕士学位论文.湖南大学,长沙,2008.
    [41] Binbin Qu, Yansheng Lu. A Rough Sets & Genetic Based Approach for Rule Induction [J]. In: Proceedings of the 5th World Congress on Intelligent Control and Automation. China: IEEE, 2004: 4300- 4303.
    [42] Magidson J. SPSS for Windows CHAID6.0 [M],SPSS INC.1993.
    [43] J.C.Schlimmer and D.Fisher. Acase Study of Incremental Concept Induction.In Proc.5th Natl. Conf.Artificial Intelligence(AAAI’86), pp 496-501.San Mateo: Morgan Kaufmann, 1986.
    [44] P.E.Utgoff. An Incremental ID3. In Proc.Fifth Int. Conf. Machine Learning. pp107-120. San Mateo, CA, 1988.
    [45] Quinlan J. R. C4.5:Programes for Machine Learning [M]. Morgan Kaufman,1992.
    [46] Mehta M, Agrawal R, Rissanen J. SLIQ: A FastScalableClassifier for Data Mining[R]. Research Center, San Jose, Canifornia,1996.
    [47] Shafer J, Agrawal R, Mehta M. SPRINT: A Scalable Parallel Classifier for Data Mining [R]. Research Center, San Jose, Canifornia,1996.
    [48]洪家荣,丁明峰,李星原,王丽薇.一种新的决策树归纳学习算法[J].计算机学报,1995, 18(6):267-287.
    [49] Kamber M, Winstone L, Gong W et al. Generalization and Decision Tree Induction: Efficient Classification in Data Mining. Workshop Research Issues on Data Engineering, Birmingham, England, April. 1997:111-120.
    [50]刘小虎,李生.决策树的优化算法[J].软件学报,1998,9(10):797-800.
    [51]陈文伟.决策支持系统教程[M].北京:清华大学出版社, 2004.
    [52]史忠植.知识发现[M].北京:清华大学出版社,2002.
    [53] Aha D. W. Tolerating noisy, irrelevant and novel attributes instance-based learning algorithm. International Journal of Man-Machine Studies, 1992,(6):267-287.
    [54] Kirs K, Rendell L. The feature selection problem: traditional methods and a new algorithm. In: AAAI-92 Proceedings of the 9th National Conference on Artificial Intelligence. 1992,129-134.
    [55] Peilei Tu, Jenyao Chung. A new decision-tree classification algorithm for machine leaning.In Proceedings on the 1992 IEEE International Conference on Tools for Artificial Intelligence. Arlington, VA, 1992
    [56] Brodley C E, Utgoff P E. Multivariate decision tree. Machine Learning, 1995(19):45-77.
    [57] Pagallo G, Haussler D. Boolean discovery in empirical leaning. Machine Learning,1990,(5).
    [58] John Durkin,蔡竞峰,蔡自兴.决策树技术及其当前研究方向[J].控制工程, 2005,12(1): 15-18.
    [59] Yun Jiang, Zhanhuai Li, Yong Wang, Longbao Zhang. A Better Classification Based on Rough Set and Neural Network for Medical Images. International Conference on Data Mining, ICDM2006. IEEE Computer Society. December 2006, Hongkomg, pp.853-857, 2006.
    [60] Sonajharia Minz and Rajni Jain. Rough Set Based Decision Tree Model for Classification. DaWaK2003, LNCS 2737, pp. 172-181, 2003.
    [61]吴成东,许可.基于粗糙集和决策树的数据挖掘方法[J].东北大学学报, 2009, 27(5): 481-484.
    [62]吴艳艳.粗集结合决策树的一种数据挖掘算法[J].计算机工程与科学,2006,26(2):60-62.
    [63] Lingyun Tong, Liping An. Incremental Learning of Decision Rules Based on Rough Set Theory. Proceeding soft he4 m World Congress on Intelligent Control and Automation June, 2002, shanghai, P. R. Chine.
    [64]卫金茂.数据挖掘若干问题研究[D].博士学位论文.华南理工大学,广州, 2001.
    [65] Jinmao Wei, Dao Huang. Rough Set Based Decision Tree. Proc on of the 4th World Congress on Intelligence Control and Automation. 2002.426-431.
    [66]蒋芸,李战怀,张强等.一种基于粗糙集构造决策树的新方法[J].计算机应用.2004.24 (8):21～23.
    [67] Yun Jiang, Zhanhuai Li, Yang Zhang, Qiang Zhang. A New Approach for Selecting Attributes Base on Rough Set Theory. Intelligent Data Engineering and Automated learning, IDEAL2004,5th International Conference Exeter UK, springer-Verlag Berlin Lecture Notes in Computer Science,pp.152-158,2004.
    [68]管红波,田大纲.基于属性重要性的决策树规则提取算法[J].系统工程与电子技术, 2004,26(3):334-337.
    [69]赵翔,向一丹.一种基于粗糙集的决策树生成算法[J].华东船舶工业学院学, 2005,19(4):73-76.
    [70]乔梅,韩文秀.基于Rough集的决策树算法[J].天津大学学报, 2005,38(9):842-846.
    [71]关晓蔷,刘煜伟.一种基于粗糙集的决策树构造方法[J].科技情报开发与经济,2006, 16(13):136-137.
    [72] J.M. Wei et al. Rough set based approach for inducing decision trees, Knowl. Based Syst. (2007), doi:10. 1016/j. knosys. 2006.10.001.
    [73]王名扬,卫金茂,伊卫国.变精度粗集模型在决策树生成过程中的应用[J].计算机工程与科学,2005,27(1):96-98.
    [74] Shupin Wang, Inmao Wei, Junping You. A VPRSM Based Approach for Inducing Decision Tree. RSKT 2006, LNAI 4062, pp. 421-429, 2006.
    [75]常志玲,周庆敏,杨清莲.基于变精度粗糙集的决策树优化算法研究[J].计算机工程与设计, 2006,27(17):3175-3177.
    [76]王志强.基于粗糙集的决策树算法研究及在CRM中的应用[D].硕士学位论文.广西大学,南宁,2008.
    [77]苗夺谦,王珏.基于粗糙集的多变量决策树的构造方法[J].软件学报,1997,8(6):425-431.
    [78] Xumin Liu, Houkuan Huang, Weixiang Xu. A Contribution to Decision tree Construction Based on Rough Set Theory. RSCTC 2004,LNAI 3066, pp. 637-642, 2004.
    [79]李雄飞,李军.数据挖掘与知识发现[M].北京:高等教育出版社,2003.
    [80]韩玲.基于粗糙集理论的属性约简及应用研究[D].硕士学位论文,合肥工业大学,合肥, 2007.
    [81]粱吉业,曲开社,徐宗本.信息系统的属性约简[J].系统工程理论与实践,2001,(12):76-80.
    [82]聂冰,丁明艳,李文.基于粗糙集属性重要性的数据约简[J].电子测量技术, 2008,31(5): 117-119.
    [83]裴小兵,王元珍.一种分明矩阵法的推广[J].计算机工程与应用,2005,(5):22-23.
    [84] Kothari R, Dong M. Decision Trees for Classification: A review and some new results. In:S R Pal Eds. Lecture Notes in Pattern Recognition, Singapore, World Scientific Publishing Company, 2001.
    [85]蒋芸.多媒体数据挖掘技术研究——基于粗糙集的医学图像挖掘研究[D].博士学位论文,西北工业大学,西安,2007.
    [86] Hu X., Cercone N.. Data Mining Via Generalization, Discretization and Rough Set FeatureSelection. Knowledge and Information System: An Internation Journal, 1999,1(1).
    [87] Zhi-Hua Zhou. AI Softwares&Codes [CP/OL]. http://cs.nju.edu.cn/people/zhouzh/zhouzh.files/ai_resource/software.htm, 2002-02-29.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700