决策树分类算法的改进及其应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着人们对数据挖掘理论的不断探讨和研究,数据挖掘技术在各行各业中的应用日趋广泛和成熟。在诸多的数据挖掘技术和方法中,决策树方法是用于数据分类和预测领域的重要方法之一,它是一种以实例为基础的归纳式学习算法,从一组无次序、无规则的实例中推理出决策树形式的分类规则,进而预测未知数据。
     ID3算法是决策树构造方法中最为常用的实现方法,它在数据分类和预测领域得到广泛应用,然而,在实际应用中,发现ID3算法存在很多不足之处。因此,本文重点研究决策树方法中的ID3算法,分析ID3及其改进算法的优缺点,给出合理的优化方案,以完善ID3算法,使其具有更好地分类效果。具体的优化方案主要体现在以下两个方面:
     第一,简化ID3算法的启发式函数。本文通过近似值的方法,对ID3算法的信息增益公式进行近似推导,消除其中复杂的对数运算,最终得到适用于多类的、具有通用性和一般性的简化启发式函数。新的ID3简化算法选择信息增益最小的属性作为测试属性,在计算信息增益时,避免了对数运算,只包含计算机较易处理的基本运算符号,所以,在一定程度上减少了选取最优属性的计算量,提高了算法的执行效率。
     第二,解决ID3算法的多值偏向问题。本文引入权值函数的概念从根本上克服ID3算法的多值偏向问题。其核心思想是:通过引入基于属性取值个数的单调权值函数,为不同属性自动分配不同权值,以权衡属性取值个数与信息增益之间的关系,进而得到新的最优属性选取标准。通过实例分析和算法比较,改进后的ID3算法选取的测试属性更为合理,进而从形成的决策树中提取的规则更为符合人们的实际需求。
     最后,本文通过一个实例实现了ID3优化算法在学员续费决策问题中的应用。根据学员分类应用流程,将学员基本信息表和学员反馈信息表整合而成的新数据集作为ID3优化算法的挖掘样本集合,最终形成决策树,并从中提取出知识规则。利用从大量学员相关数据背后挖掘出的知识规则可以辅助企业管理者更准确的做出判断和决策,提高了企业效益。
The Data Mining technique is widely applied and it becomes more and more mature along with the discussion and research about Data Mining theory. The decision tree method is important one that is used to data classification and forecast domain in many Data Mining techniques and methods. It is an inductive arithmetic which bases instances, and it can find the classification rule through illation from immethodical and ruleless instances. Then we can make use of the rules to forecast unknown data .
     ID3 algorithm is the most frequently-used achieved method in decision tree constructors, and it is widely applied in data classification and forecast domain. But we find lots of defect about ID3 in practical application. So the paper researches the defect of ID3 and improved algorithm, and gives the rational prioritization scheme to perfect the ID3. The prioritization scheme comprises two aspects as follows:
     Firstly, we predigest the heuristic function of ID3. The paper approximately derives the information gain formulae to remove the logarithm operation, and we derive the simplified heuristic function that is the same with several sorts and possesses universal property and universality. The new shortcut calculation of ID3 selects the attribute whose information gain is the least as attributetest, and avoids logarithm operation when calculating information gain. So the shortcut calculation of ID3 decreases calculated amount and improves the execution efficiency of arithmetic.
     Secondly, the paper introduces the weight function to overcome the problem of variety bias. The weight function weighs the relation between number of attribute value and information gain through assigning different weights for different attributes, then we can derive the new standard of Choosing Attributes. After instance analysis and algorithm comparison, the selected attributetest is more logical through modified ID3. Then the rules from decision tree more answer for the needs of people.
     Lastly, the paper realizes the application of ID3 optimization algorithm in decision problem of students’renewal tuition through an instance. According to the application process, we integrate students’essential information table and feedback table into new data set which is used to ID3 optimization algorithm. Finally, we derive decision tree and distill rules from decision tree. According to these rules, company Manager could more exactly make judgement and decision. And these rules could improve the benefit of company.
引文
[1]纪希禹,韩秋明,李微,等.数据挖掘技术应用实例[M].北京:机械工业出版社,2009.4
    [2]朱玉全,杨鹤标,孙蕾.数据挖掘技术[M].南京:东南大学出版社,2006.11
    [3]杨会志.数据挖掘技术的主要方法及其发展方向[J].河北科技大学学报,2000,21(3):77-80
    [4]朱绍文,胡宏银,王泉德,等.决策树采掘技术及发展趋势[J].计算机工程,2000.10, 26(10):1-3
    [5]黄解军,潘和平,万幼川.数据挖掘技术的应用研究[J].计算机工程与应用,2003,(2): 45-48
    [6]L.Breiman,J. H.Friedman, R.A.Olshen, etl. Classification and Regression Trees[M]. New York: Wadsworth International Group, 1984
    [7]Quinlan.J.R. Induction of decision trees[J]. Machine Learning,1986,1(1):81-106
    [8]Ruggieri S. Efficient C4.5[J]. IEEE Transactions on Knowledge and Data Engineering, 2002,14(2):438-444
    [9]Quinlan J.R. C4.5: Programs for machine learning[M]. San Mateo, California: Morgan Kaufmann publishers, 1993
    [10]Mehta M, Agrawal R, Rissanen J. SLIQ: A Fast Scalable Classifier for Data Mining[R]. San Jose, California: IBM Almaden Research Center, 1995
    [11]Shafer J, Agrawal R, Metha M. SPRINT: A Scalable Parallel Classifier for Data Mining[R]. San Jose, California: IBM Almaden Research Center, 1996
    [12]Rastogi R, Shim K. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning[R]. Murray Hill: Bell Laboratories, 1998
    [13]Gehrke J, Ramakrishnan R, Ganti V. Rainforest: A framework for Fast Decision Tree Construction of Large Datasets[C]. A.Gupta, O.Shmueli, J.Widom. Proceedings of 24rd International conference on Very Large Data Bases(VLDB’98). New York: Morgan Kaufmann publishers,1998:416-427
    [14]J.Gehrke, V.Ganti, R.Ramakrishnan, etl. BOAT-Optimistic Decision Tree Construction[C]. A.Delis, C.Faloutsos, S.Ghandeharizadeh. Proceedings of the 1999 SIGMOD Conference. Philadelphia, Pennsylvania: ACM Press ,1999:169-180
    [15]陈文伟,黄金才.数据仓库与数据挖掘[M].北京:人民邮电出版社,2004.1
    [16]钟鸣,刘晓霞,陈文伟.示例学习算法IBLE和ID3的比较研究[J].计算机研究与发展,1993,(1):32-38
    [17]钟鸣,陈文伟.示例学习的抽象信道模型及其应用[J].计算机研究与发展,1992,(1),37-43
    [18]肖勇,陈意云.用遗传算法构造决策树[J].计算机研究与发展,1998.1,35(1):49-52
    [19]吴菲,黄梯云.用遗传算法构造二元决策树[J].计算机研究与发展,1999.11,36(11):1323-1328
    [20]王军,张庆杰,李爽,等.极小极大规则学习及在决策树规则简化中的应用[J].计算机研究与发展,1998.9,35(9):806-809
    [21]俞文彬,谢康林,张忠能.基于属性分类的数据挖掘方法[J].小型微型计算机系统,2000.3,21(3):305-308
    [22]何劲松,施泽生.基于自相关函数的决策树算法[J].计算机学报,2001.7,21(7):784-785
    [23]韩松来,张辉,周华平.基于关联度函数的决策树分类算法[ J ] .计算机应用,2005.11,25(11):2655-2657
    [24]邹永贵,范程华.基于属性重要度的ID3改进算法[J].计算机应用,2008.6,28:144-145
    [25]陆秋,程小辉.基于属性相似度的决策树算法[J].计算机工程,2009.3,35(6):82-84
    [26]姚笑秋.一种改进的分类算法及其应用[D].武汉:华中科技大学,2006
    [27]Hongwu Luo, Yongjie Chen, Wendong Zhang. An Improved ID3 Algorithm Based on Attribute Importance-Weighted[C]. Zhengbing Hu, Ping Ma. Proceedings of 2010 2nd International Workshop on Database Technology and Applications (DBTA 2010). New York: IEEE Press,2010:116-119
    [28]李世娟,马骥,白鹭.基于改进ID3算法的决策树构建[J].沈阳大学学报,2009.12, 21(6):23-25
    [29]刘慧巍,张雷,瞿军昌.数据挖掘中决策树算法的研究及其改进[J].辽宁师专学报,2005.10,7(4):23-24,71
    [30]苗夺谦,王钰.基于粗糙集的多变量决策树构造方法[J].软件学报,1997.6,8(6):425-431
    [31]黄冬梅,哈明虎,王熙照.决策树与模糊决策树的比较[J].河北大学学报:自然科学版,2000.9,20(3):218-221
    [32]洪家荣,丁明峰,李星原,等.一种新的决策树归纳学习算法[J].计算机学报,1995,18(6):470-473
    [33]毕建东,杨桂芳.基于熵的决策树的分枝合并算法[J].哈尔滨工业大学学报,1997,29(2):44-46
    [34]王熙照,洪家荣.区间值属性决策树学习算法[J].计算机学报,1998.8,9(8):637-641
    [35]刘小虎,李生.决策树的优化算法[J].软件学报,1998,9(10):797-800
    [36]郭茂祖,刘扬.一种新的基于属性-值对的决策树归纳算法[J].小型微型计算机系统,2001,22(4):459-462
    [37]王熙照,杨晨晓.分支合并对决策树归纳学习的影响[J].计算机学报,2007.8,30(8):1251-1258
    [38]曲开社,成文丽,王俊红.ID3算法的一种改进算法[J].计算机工程与应用,2003,25:104-107
    [39]王静红,王熙照,邵艳华,等.决策树算法的研究及优化[J].微机发展,2004.9,14(9):30-32
    [40]张桂杰,王帅.决策树分类ID3算法研究[J].吉林师范大学学报:自然科学版,2008.8,(3):135-137
    [41]Liu Yuxun, Xie Niuniu. Improved ID3 algorithm[C]. Proceedings of 2010 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT 2010). New York: IEEE Press, 2010:465-468
    [42]段玉春,朱晓艳,孙玉强.一种改进的ID3算法[J].南阳师范学院学报,2006.9,5(9):63-64
    [43]陶维,王海涛.一种基于ID3决策树的优化算法[J].自动化技术与应用,2009,28(10):38-41
    [44]龙际珍,任海叶,易华容.一种改进决策树算法的探讨[J].株洲师范高等专科学校学报,2006.4,11(2):64-66
    [45]梁协雄,雷汝焕,曹长修.现代数据挖掘技术研究进展[J].重庆大学学报, 2004.3, 27 (3): 21-27
    [46]罗可,蔡碧野,卜胜贤,等.数据挖掘及其发展研究[J].计算机工程与应用,2002, (14): 182-184
    [47]李水生,陈意云,黄刘生.数据采掘技术回顾[J].小型微型计算机系统,1998.4, 19(4):74-81
    [48]耿晓中,张冬梅.数据挖掘综述[J].长春师范学院学报:自然科学版,2006.6,25(3):24-27
    [49]王光宏,蒋平.数据挖掘综述[J].同济大学学报,2004.2,32(2):246-252
    [50]钟晓,马少平,张钹,等.数据挖掘综述[J].模式识别与人工智能,2001.3,14(1):48-55
    [51]邵峰晶,于忠清.数据挖掘原理与算法[M].北京:中国水利水电出版社,2003
    [52]栾丽华,吉根林.决策树分类技术研究[J].计算机工程,2004.5,30(9):94-96
    [53]刘世平.数据挖掘技术及应用[M].北京:高等教育出版社,2010
    [54]王鹤.基于决策树的ID3算法的研究与改进[D].天津:河北工业大学,2008
    [55]Chen Jin, Luo De-lin, Mu Fen-xiang. An Improved ID3 Decision Tree Algorithm[C]. Proceedings of 2009 4th International Conference on Computer Science & Education (ICCSE 2009).Amoy, Xiamen University, 2009: 127-130
    [56]夏克文,刘明霄,张志伟,等.基于属性相似度的属性约简算法[J].河北工业大学学报,2005.8,34(4):20-23
    [57]陶荣,张永胜,杜宏保.基于粗集论中属性依赖度的ID3改进算法[J].河南科技大学学报:自然科学版,2010.2,31(1):42-45
    [58]叶明全,胡学钢.一种基于灰色关联度的决策树改进算法[J].计算机工程与应用,2007,43(32):171-173
    [59]同济大学数学系.高等数学(第六版) [M].北京:高等教育出版社,2007.4
    [60]凹函数[EB/OL]. http://baike.baidu.com/view/5188.htm

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700