数据挖掘中判定树算法的研究与优化

作者：丁悦
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 分类规则 ; 遗传算法 ; GSDA算法
英文关键词：data mining ; classification rule ; genetic algorithm ; GSDA algorithm
学位年度：2008
导师：吴海涛
学科代码：081203
学位授予单位：上海师范大学
论文提交日期：2008-04-01

摘要

数据分类是数据挖掘的一个重要方法。数据分类是通过分析训练集数据,产生关于类别的精确描述或模式,这种类描述可以用来对未来的数据进行分类,有着广泛的应用前景。目前常用的分类规则挖掘方法有决策树方法、贝叶斯分类算法、遗传算法和粗集理论等。
     在上述方法中,决策树算法描述较简单,容易转化成分类规则,但同时存在得不到全局最优解的问题;遗传算法虽然能解决大空间、多峰值和非线性等高复杂度问题,但也存在算法收敛于局部最小值的过早收敛问题。由此,本文提出了一种基于混合遗传模拟退火算法的分类决策树方法(GSDA算法)。GSDA算法将遗传算法引入到已有的分类决策树挖掘算法中,提出了一个新的基于混合遗传模拟的算法。本算法在决策树的编码上,改进了常用的二进制编码方式,采用了决策树直接编码的方式,提高了运算的精确性。与此同时,GSDA算法还引入了混合优化的思想,弥补了常用算法中局部性最优的问题。提出了相应的适应度函数,同时提出了适合本文的剪枝操作,使得挖掘出的规则不但正确性更高,而且整体算法更简洁、更易理解。
     在随后的初步实验中,本文使用了四个不同的数据库:天气数据库、Cleveland数据库、Heart Disease数据库和Breast Cancer-W数据库,并将GSDA算法的实验结果与经典算法ID3算法进行了比较,获得了较优的结果。
Data classification has become one of the important research aspects of data mining. Data mining generates precise description or model of the predetermined set of data classes or concepts by giving data object partition according to the features of a group of data objects. These models then can be used to classify future data objects which has a good prospect in application. The most popular classification methods at present include genetic algorithm, decision tress, neural network, etc.
     Among the three methods mentioned above, the decision tree algorithm is simple in description and is easy to translate it into classification rules. However it can hardly find the global optimum solution. Although the genetic algorithm can solve the problem of huge searching space, multiple-peak value, and non-linearity, it also has the drawback of early convergence. Therefore, a classification rule mining method called GADA based on hybrid genetic and simulated annealing algorithm is proposed. This algorithm introduces direct tree encoding method to improve the accuracy. Meanwhile it introduces hybrid optimization to solve the problem of local optimization. We also improve such aspects of fitness function and pruning operation to make the accuracy of the mining rules much higher and the algorithm simpler and easier to understand. All these are explained in the following experiment.
     We use four different databases: weather database, Cleveland database,Heart Disease database and Breast Cancer-W database to compare the result of GSDA algorithm and classic ID3 algorithm. It is proved that the GSDA algorithm performs better than ID3 algorithm.

引文

[1] Han J, Kambr M. Data mining: Concepts and techniques[M]. Beijing Higher Education Press, 2001. 1-3.
    [2] 闪四清,陈茵,程雁等.数据挖掘——概念、模型、方法和算法[M].北京:清华大学出版社,2003.
    [3] Fayyad U. Data mining and knowledge discovery in databases implications for scientific databases[A].Scientific and Statistical Database Management, Proceedings, Ninth International Conference [C]. IEEE,1997.2～11.
    [4] Cheng QM, Jason TL, Wang. DNA sequence classification via an expectation maximization algorithm and neutral networks: a case study. Systems, Man and Cybernetics, Part C: Applications and Reviews [J]. IEEE Transactions, 2001,31(4):468～475.
    [5] Adomavicius G, Tuzhilin A. Using data mining methods to build customer profiles [J] Computer, 2001,34(2):74～82.
    [6] Syeda M, Yan Q Z, Pan Y. Parallel granular neural networks for fast credit card fraud detection. Fuzzy Systems[A]. Proceedings of the 2002 IEEE International Conference[C],2002,1:572～577.
    [7] Jiawei Han, Micheline Kamber 著,范明,孟小峰等译.数据挖掘概念与技术[M], 机械工业出版社.
    [8] Quinlan J R. Discovering rules from large collections of examples: A case study In: Machine D, ed, Expert Systems in the Micro Electronic Age, Edinburgh University Press, 1979.
    [9] Quinlan J R. Bagging, Boosting and C4.5 [A]. In: Proc 13th International Conference on Artificial Intelligence [C]. Portland, Ore, 1996, 725-730.
    [10] Holland J H. Adaptation in Nature and Artificial Systems[M ]. MIT Press,1992.
    [11] Goldberg D E. Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA, Addison-Wesley Publishing, 1989.
    [12] 朱明.数据挖掘.中国科学技术大学出版社[M].22-38, 50-54, 100-125.
    [13] Dorian Pyle. Business Modeling and Data Mining. China Machine Press.172-195.
    [14] Olivia Parr Rud. Data Mining Cook book. China Machine Press.43-60.
    [15] 韩慧,毛锋,王文渊.数据挖掘中决策树算法的最新进展[J].计算机应用研究,2004:5-7.
    [16] L.Raileanu, K.Stoffel. Theoretical Comparison between Gini Index and Information Gain Criteria[J]. Annals of Math, and Artificial Intelligence. 2004, 41(1): 77-93.
    [17] Daesun Oh, Keshab K. Parhi. Low Complexity Design of High Speed Parallel Decision Feedback Equalizers[J]. IEEE 17th International Conference on Application-specific Systems, Architectures and Processors 2006: 118-124.
    [18] LA Breslow, Aha W. David. Simplifying Decision Tree: a Survey[J]. Knowledge Engineering Review, 1997,12 (1): 1-40.
    [19] T.Eloniaa, M. kaariainen. An Analysis of Reduced Error Pruning[J]. Journal of Artificial Intelligence ersearch,2001, 20 ( 15):163-187.
    [20] Floriana Esposito, Donato Malerba, Giovaai Semeraro, A Comparative analysis of methods Pruning decision Trees[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997.19(5):476-490.
    [21] B.Cestnik, LBratko. On Estimating Probabilities in Tree Pruning[J], Machine Learning: EWSL91, YKodratoffed., Lectuer Notes in Artificial Intelilgence. Berlin: Springer-Verlag, 1991.26(482): 138-150.
    [22] Cestnik B, Bratko I. On estimating probabilities in tree pruning[A]. Proc of European Working Sessions on Learning[C]. Porto:Springer-Verlag, 1991, 138-150.
    [23] J.R.Quinlan. Simplifying Decision Trees[J] Inte1.Man-machine Studies,1987,3 (27): 221-234
    [24] J.R.Quinlan. Induction of Deision Trees[J]. Machinge Learning, 1986:1-81.
    [25] Oates T, Jensen D. The effects of training set sizes on decision tree[A]. Proc of the 14th Intel Conf on Machine Learning[C]. Nashville: Morgan Kaufman. 1997. 254-262.
    [26] TNiblett, LBratko, Learning Decision Rules in Noisy Domains[J]. Proc. Expert Systems86, Cambridge: Cambridge University Press,1986:163-174.
    [27] Holland J.H. Adaptation in Natural and Artificial Systems. Ann Arbor, MI: Univ. MichiganPress, 1975.
    [28] 王静莲, 刘弘, 基于决策树的遗传算法在数据挖掘领域的应用[J],计算机工程与应用, 2005.28 :153-155.
    [29] 华文立, 胡学刚, 基于遗传算法的决策树优化模型[J], 计算机技术与发展, 2007,Vol.17, No.3, Mar.
    [30] B. Chandra, Sati Mazumdar, Vincent Arena, N. Parimi. Elegant Decision Tree Algorithm for Classification in Data Mining[J], Third International Conference on Web Information Systems Engineering (Workshops) - (WISEw'02) ,11 December 2002 to 11 December 2002
    [31] Athanassios Papagelis, Dimitrios Kalles. GATree Genetically Evolved Decision Trees[J]. Proceeding of the 12th IEEE International Conference on Tools with Artificial Intelligence, 2000.
    [32] Zhiwei Fu. Using Genetic Algorithms-based Approach for Better Decision Trees:A Computational Study[M].Springer-Verlag Berlin Heidelberg, 2002.
    [33] Deborah R. Carvalho, Alex A. Freitas. New results for a Hybrid Decision Tree/Genetic Algorithm for Data Mining[C], Proc. 4th Int. Conf. on Recent Advances in Soft Computing (RASC-2002), Nottingham Trent University, December 2002. 260-265.
    [34] Taghi M.Khoshgoftaar, Genetic Programming-based Decision Trees for Software Quality Classification[J]. Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence,2003.
    [35] 赵弘, 周瑞祥, 林廷圻. 基于 Levenberg-Marquardt 算法的神经网络监督控制[J]. 西安交通大学学报, 2002, 36(5): 523-527.
    [36] Object Management Group. Unified Modeling Language: Superstructure Version 2.0[EB/OL]. http://www.omg.org/technology/documents/formal/uml.htm, June 2005.
    [37] 李少宏, 张小凤, 张光斌. 基于单个矢量传感器的频率方位联合估计[J]. 陕西师范大学学报(自然科学版).2005:33.
    [38] 谢云. 模拟退火算法原理及实现.[J] 高等学校计算数学学报. 1999,9:212-218.
    [39] D B Fogel. Asymptotic Convergence Properties of Genetic Algorithms and Evolutionary Programming[J]: Analysis and Experiments. Cybenretics and Systems. 1994, (25):389-407.
    [40] Rudolph.G. Convergence Analysis of Canonical Genetic Algorithms[J]. IEEE Trans. On Neural Networks. 1994.5(1) :96-10.
    [41] Stephen D. Bay. Multivariate Discretization of Continuous Variables for Set Mining, http://masterchinarennet/~sinokdd.
    [42] B.Walczak, D.LMassart. Rough Sets theory[J]. Chemometrics and Intelligent Laboratory Systems 47(1999): 1-16.