ID3算法的优化研究及其在构件库中的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术的迅速发展以及人们获取数据手段的多样化,各行各业不断积累了大量数据,面对浩瀚的数据海洋,如何更好地利用这些数据资源,找出大量数据背后隐藏的信息和知识,已成为商业领域广泛关注的问题。因此,在人们的实际需求的推动下,数据挖掘技术应运而生,并得以在社会生活的各个领域蓬勃发展。在诸多的数据挖掘技术和方法中,用于数据分类的决策树方法是数据挖掘研究领域的一项重要课题。
     ID3 (Interactive Dicremiser versions 3)算法是决策树方法中最为常用的方法之一,它以其自身的多种优势,在机器学习领域得到广泛应用。然而,数据挖掘技术发展至今,在ID3算法的实际应用中,也发现ID3算法存在很多不足。因此,本文重点深入研究决策树方法中的ID3算法,分析ID3及其改进算法的优缺点,给出关于“简化ID3算法的启发式函数”和“解决ID3算法的多值偏向问题”两个方面的合理优化方案,以完善ID3算法。首先,本文通过近似值的方法,对ID3算法的属性选取标准进行简化,消除其中复杂的对数运算,最终得到适用于多类的、具有通用性和一般性的启发式函数简化形式。ID3简化算法选择信息增益最小的属性作为测试属性,在计算信息增益时,避免了对数运算,只包含计算机较易处理的基本运算符号,所以,在一定程度上减少了选取最优属性的计算量,提高了算法的执行效率;其次,本文引入平衡函数的概念从根本上克服ID3算法的多值偏向问题。其核心思想是:通过引入基于属性取值个数的单调平衡函数,平衡属性取值个数与信息增益之间的关系,进而得到新的最优属性选取标准。通过实例分析和算法比较,改进后的ID3算法选取的测试属性更为合理,进而从形成的决策树中提取的规则更为符合人们的实际需求。
     最后,本文通过一个实例实现了ID3优化算法在构件库中应用。根据算法在构件库中的应用流程,将构件基本信息表和用户反馈信息表整合而成的新数据集作为ID3优化算法的挖掘样本集合,最终形成决策树,并从中提取出构件复用规则。利用从大量构件背后挖掘出的知识规则可以辅助构件复用者更好地理解和选取构件,节约了用户决策时间。
Every walk of life accumulate mass data constantly along with rapid development of information technique and the diversity method of obtaining data. Facing expansile data sea how to use the data resource, find information and knowledge behind the data have become a widely concerned problem in business domain. Accordingly, with drive of people's effective requirement, data mining technique emerges at a historic moment, and develops rapidly in every field of life. The method of decision tree used in data classification is an important task of data mining domain.
     ID3 (Interactive Dicremiser versions 3)algorithm is one of the most frequently-used decision tree methods, and it is widely applied in machine learning domain because of its much advantage. But we find lots of defect about ID 3 in practical application. So the paper indepth researches the defect of ID3 and improved algorithm, and gives the rational prioritization scheme about predigesting the heuristic function of ID3 and overcoming the problem of variety bias to perfect the ID3. Firstly, the paper approximately derives the information gain formulae to remove the logarithm operation, and we derive the simplified heuristic function that is the same with several sorts and possesses universal property and universality. The shortcut calculation of ID3 selects the attribute whose information gain is the least as attributetest, and avoids logarithm operation when calculating information gain. So the shortcut calculation of ID3 decreases calculated amount and improves the execution efficiency of arithmetic. Secondly, the paper introduces the equilibrium function to overcome the problem of variety bias. The equilibrium function balances the relation between number of attribute value and information gain, then we can derive the new standard of Choosing Attributes. After instance analysis and algorithm comparison, the selected attributetest is more logical through modified ID3. Then the rules from decision tree more answer for the needs of people.
     Lastly, the paper realizes the application of ID3 optimization algorithm in component library through an instance. According to the application process, we integrate the history record table of component and feedback table of user into new data set which is used to ID3 optimization algorithm. Finally, we derive decision tree and distill rules from decision tree. According to these rules, reuser of component can understand and select component, and economize the decision time.
引文
[1]Mehmed Kantardzic. Data Mining Concepts, Models, Methods, and Algorithms. First Edition. New York:Wiley-IEEE Press. Oct.2003
    [2]纪希禹,韩秋明,李微,等.数据挖掘技术应用实例.北京:机械工业出版社,2009.4
    [3]朱玉全,杨鹤标,孙蕾.数据挖掘技术.南京:东南大学出版社,2006.11
    [4]朱绍文,胡宏银,王泉德,等.决策树采掘技术及发展趋势.计算机工程,2000.10,26(10):1-3
    [5]L.Breiman,J. H.Friedman, R.A.Olshen, etl. Classification and Regression Trees. New York:Wadsworth International Group,1984
    [6]Quinlan.J.R. Induction of decision trees. Machine Learning,1986,1(1):81-106
    [7]K.Kira,L.Rendell, A Practical Approach to Feature Selection. In:Proceedings of the Ninth International Workshop on Machine Learning (ML 1992), San Mateo, California, Morgan Kaufmann publishers,1992
    [8]Ruggieri S. Efficient C4.5. IEEE Transactions on Knowledge and Data Engineering,2002,14(2):438-444
    [9]Quinlan J.R. C4.5:Programs for machine learning. San Mateo, California:Morgan Kaufmann publishers,1993
    [10]I.Kononenko. Estimating Attributes:Analysis and extentions of RELIEF. In: Machine Learning ECML-94, Berlin, Springer-Verlag,1994
    [11]Mehta M, Agrawal R, Rissanen J. SLIQ:A Fast Scalable Classifier for Data Mining. San Jose, California:IBM Almaden Research Center,1995
    [12]Shafer J, Agrawal R, Metha M. SPRINT:A Scalable Parallel Classifier for Data Mining. San Jose, California:IBM Almaden Research Center,1996
    [13]M.Robnik-Sikonja, I.Kononenko. An adaptation of Relief for attribute estimation in regression, In:Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), San Mateo, California, Morgan Kaufmann publishers,1997
    [14]Rastogi R, Shim K. PUBLIC:A Decision Tree Classifier that Integrates Building and Pruning. Murray Hill:Bell Laboratories,1998
    [15]Gehrke J, Ramakrishnan R, Ganti V. Rainforest:A framework for Fast Decision Tree Construction of Large Datasets. In:proceeding of 24rd International conference on Very Large Data Bases(VLDB'98),New York, Morgan Kaufmann publishers,1998
    [16]C.Olaru, L.Wehenkel. A complete fuzzy decision tree technique. Fuzzy Sets Systems.2003,138(2):221-254
    [17]J.Abonyi, H.Roubos, F.Szeiferrt. Data-driven generation of compact, accurate, and Linguistically sound fuzzy classifiers based on a decision tree initialization. International Journal of Approximate Reasoning,2003,32(1):1-21
    [18]陈文伟,黄金才.数据仓库与数据挖掘.北京:人民邮电出版社,2004.1
    [19]洪家荣,丁明峰,李星原,等.一种新的决策树归纳学习算法.计算机学报,1995,18(6):470-473
    [20]毕建东,杨桂芳.基于熵的决策树的分支合并算法.哈尔滨工业大学学报,1997,29(2):44-46
    [21]王熙照,洪家荣.基于区间值属性决策树学习算法.计算机学报,1998.8,9(8):637-641
    [22]刘小虎,李生.决策树的优化算法.软件学报,1998,9(10):797-800
    [23]王熙照,杨晨晓.分支合并对决策树归纳学习的影响.计算机学报,2007.8,30(8):1251-1258
    [24]Yan Qin Zhao, Zhen Chong Wang, Cheng Zhi Jiang,etl. Coal Mine Safety Monitoring Based on Improved_ID3 Algorithm. Advanced Materials Research, 2011,201-203:990-994
    [25]何劲松,施泽生.基于自相关函数的决策树算法.计算机学报,2001.7,21(7):784-785
    [26]韩松来,张辉,周华平.基于关联度函数的决策树分类算法.计算机应用,2005.11,25(11):2655-2657
    [27]范力进,鄂旭.新属性重要性的规则提取方法.计算机工程与应用,2009,45(14):149-151
    [28]陆秋,程小辉.基于属性相似度的决策树算法.计算机工程,2009.3,35(6):82-84
    [29]Hongwu Luo, Yongjie Chen, Wendong Zhang. An Improved ID3 Algorithm Based on Attribute Importance-Weighted. In:proceeding of 2nd International Workshop on Database Technology and Applications (DBTA 2010), New York, IEEE Press, Nov.2010
    [30]成文丽.基于决策树的数据挖掘算法的技术研究[硕士学位论文].太原,太原理工大学,2003
    [31]郭茂祖,刘扬.一种新的基于属性-值对的决策树归纳算法.小型微型计算机系统,2001,22(4):459-462
    [32]曲开社,成文丽,王俊红.ID3算法的一种改进算法.计算机工程与应用,2003,39(25):104-107
    [33]张保华,数据挖掘技术的研究及在图书借阅系统中的应用[硕士学位论文].南京,南京理工大学,2008
    [34]王光宏,蒋平.数据挖掘综述.同济大学学报,2004.2,32(2):246-252
    [35]梁协雄,雷汝焕,曹长修.现代数据挖掘技术研究进展.重庆大学学报,2004.3,27(3):21-27
    [36]罗可,蔡碧野,卜胜贤,等.数据挖掘及其发展研究.计算机工程与应用,2002,(14):182-184
    [37]李水生,陈意云,黄刘生.数据挖掘技术回顾.小型微型计算机系统,1998.4,19(4):74-81
    [38]钟晓,马少平,张钹,等.数据挖掘综述.模式识别与人工智能,2001.3,14(1):48-55
    [39]黄解军,潘和平,万幼川.数据挖掘技术的应用研究.计算机工程与应用,2003,(2):45-48
    [40]Chen Jin, Luo De-lin, Mu Fen-xiang. An Improved ID3 Decision Tree Algorithm. Ln:Proceedings of 20094th International Conference on Computer Science & Education (ICCSE 2009), Amoy, Xiamen University,2009
    [41]叶明全,胡学钢.一种基于灰色关联度的决策树改进算法.计算机工程与应用,2007,43(32):171-173
    [42]朱颢东,钟勇.基于改进的ID3信息增益的特征选择方法.计算机工程,2010.4,36(8):37-39
    [43]同济大学数学系.高等数学(第六版).北京:高等教育出版社,2007.4
    [44]卜亚杰.决策树分类算法的研究及应用[硕士学位论文].保定,华北电子大学,2007
    [45]杨芙清.软件复用及相关技术.计算机科学,1999,25(9):1-4
    [46]潘颖,赵俊峰,谢冰.构件库技术的研究与发展.计算机科学,2003,30(5):90-93

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700