基于决策树的组合分类器的构建和部署

英文题名：Construction and Disposition of Combinatorial Classifier Based on Decision Tree
作者：胡记兵
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：决策树 ; SURPASS ; 信息量 ; 线性判别分析 ; 组合分类器
英文关键词：decision tree ; SURPASS ; amount of information ; linear discrimination analysis ; combinatorial classifier
学位年度：2008
导师：蔡家楣 ; 江颉
学科代码：081202
学位授予单位：浙江工业大学
论文提交日期：2008-05-01

摘要

决策树是应用最广泛的数据挖掘方法之一,研究的重点围绕数据处理的准确率、效率及数据降维等方面,增量式学习能力也是决策树算法的主要特征。SURPASS就是高效的增量式算法,能处理超内存的大规模数据集,但它在处理海量数据时也存在效率低下的问题。另外,决策树采用不纯度指标选择最佳分割属性,当数据集很大时,在分割每一步都可能有多个最佳属性,这为在一个数据集上构建决策树森林提供了可能性。传统的单分类器适应不了对高预测准确率的需求,而且数据产生、存储以及利用等方式的改变也促使对分类器研究的不断改进。一些学者发现传统的分类器之间存在着互补的信息,可以利用这些互补的信息来改善分类器的性能。
针对SURPASS算法效率上的问题,本文基于信息论提出了一项基于信息量的指标,使用该指标在决策树分割的每一步,计算每个属性的信息量指标值,算法可选取信息量指标值较大的属性作为最佳属性,以减少对磁盘数据的访问从而提高运行效率。实验数据表明,这种方法是有效的。为使信息量指标具有理论依据,本文利用微分方法导出了信息量指标,通过该方法得到了信息量指标的两种计算方法,并指出了信息量指标在运行效率上的优势。本文还以SURPASS为基分类器实现了随机森林,最后通过实验验证了随机森林的性质。
Decision tree is one of the data mining methods used most popularly. Researches on decision tree emphasize on prediction accuracy, efficiency and decrease of dimensions of datasets. Scalability is a primary feature of decision tree. SURPASS is a decision tree with scalability and ability for dealing with datasets whose size exceed the capacity of main memory, but it lacks in high efficiency when it is dealing with datasets with very large volume. In addition, decision tree uses impurity to select the best split attribute. When dataset dealt by it has large volume, there might be many best split attributes, which provide possibility of building a random forest over the dataset. Traditional single classifier might not meet the requirement of high prediction accuracy and the manners in which data is generated, stored and utilized urge the improvement of classifier. Some scholars have discovered that there exist mutually complementary information between single classifiers and it is suggest that using the information to improve the performance of classifier.
In allusion to the efficiency of SURPASS, this paper propose an index based on amount of information in information theory aiming at selecting the attribute with the larger value of amount of information as the best split attribute to reduce the frequency of accessing disk after computing index of amount of information for every candidate attribute. Experiments show that this method is effective. To make index of amount of information feasible abstractly, this paper educe index of amount of information using differential calculus. Through this method two kind of ways in which index of amount of information is calculated are gained and the superiority of index of amount of information is indicated. Also, this paper build a random forest based on SURPASS and verifies the character of random forest by doing some experiments.

引文

[1]刘秀芳.浅谈数据挖掘[J].科技信息.2007,13:75-76.
    [2]U.M.Fayyad,G.Piatesky_Shapiro,P.Smyth,R.Uthurusamy.Advances in Knowledge Discovery and Data mining[M].AAAI/MITPress 1996:154-161.
    [3]P.E.Ross,Flash of genius,Forbes 162(11),1998:98-104.
    [4]余以胜,张玉峰.数据挖掘技术与用户知识获取[J].情报理论与实践.2003,26(1):65-67.
    [5]HAN J W.数据挖掘:概念与技术[M].北京:机械工业出版社,2001.
    [6]王刚,黄丽华,张成洪,夏洁.数据分类算法研究综述[J].科技导报.2006,24(12):73—76.
    [7]LINTS,LOHWY,SHIHS.A comparison of prediction accuracy,complexity and training time of thirty-three old and new classification algorithms[J].Machine Learning.2000:39.
    [8]钱学森,于景元,戴汝为.一个科学新领域----开放的复杂巨系统及其方法论[J].自然杂志,1990,13(1):3-10
    [9]ALLEN SV.An aggregate connectionist approach for discovery association rules[D].Ph.D.Department of Computer Science and Engineering,Wright State University,2002
    [10]王刚.基于混合智能系统的数据挖掘分类算法研究[D].长沙:中南大学,2004.
    [11]Hunt E.B.,Marin J.and Stone P.J..Experiments in Induction[M].New York:Academic Press,1966.
    [12]L.Breiman,Friedman,R.A.Olshen,C.J.Stone.Classification and regression Trees[Z].Wadsworth,Belmont,CA,J.H.1984.
    [13]Pang-Ning Tan,Michael Steinbach,Vipin Kumar.数据挖掘导论[M].北京:人民邮电出版社,2006.
    [14]J.R.Quinlan.Introduction of decision trees,Machine Learning[J],1986,1:81-106.
    [15]J.R.Quinlan.C4.5:Programs for Machine Learning[Z].Morgan Kaufmann,San Mateo,CA,1993
    [16]Manish Mehta,Rakesh Agrawal,Jorma Rissanen.SLIQ:A Fast Scalable Classifier for Data Mining[C].Proceedings of the 5th International Conference on Extending Database Technology:Advances in Database Technology,1996:18-32.
    [17]John C.Shafer,Rakesh Agrawal,Manish Mehta.SPRINT:A Scalable Parallel Classifier for Data Mining[C].Proceedings of the 22th International Conference on Very Large Data Bases,1996:544-555.
    [18]Rastogi R,Shim K.PUBLIC:A Decision Tree Classifier that Integrates Building and Pruning.In:Proc 24thVLDB,1998:404-415.
    [19]Johannes Gehrke,Raghu Ramakrishnan,Venkatesh Ganti.RainForest - A Framework for Fast Decision Tree Construction of Large Datasets[C].Proceedings of the 24rd International Conference on Very Large Data Bases,1998:416-427.
    [20]Xiao-Bai Li.A scalable decision tree system and its application in pattern recognition and intrusion detection[J],Decision Support Systems,2005,41(n1):112-130.
    [21]房祥飞,刘希玉.决策树在数据挖掘中的新进展和发展前景[J].信息技术与信息化.2006,3:139-142.
    [22]Yoshimitsu Kudoh,Makoto Haraguchi.An Appropriate Abstraction for Constructing a Compact Decision Tree[M].Springer-Verlag Berlin Heidelberg,2000.
    [23]Khaled Alsabti,Sanjay Ranka,Vineet Sinsh.CLOUDS:A DecisionTree Classifier for Large Datasets[C].4th International Conference on Knowledge Discovery and Data Mining,1998.
    [24]Zhiwei Fu.Using Genetic Algorithms-based Approach for Better Decision Trees:A Computational Study[M].Springer-Verlag Berlin Heidelberg,2002.
    [25]Y.Freund,R.E.Schapire,A decision-theoretic generalization of online learning and an application to boosting[J].Journal of Computer and System Science 1997,55(1):119-139.
    [26]Boz O.Extracting decision trees from trained neural net works[C].Edmonton,Alberta,Canada:Proc of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2002.
    [27]Kevin W.Bowyer,Lawrence O.Hall,Thomas Moore,Nitesh Chawla.Parallel decision tree builder for mining very large visualization datasets[A].2000 IEEE International Conference on Systems,Man and Cybernetics[C].Institute of Electrical and Electronics Engineers Inc.,Piscataway,N J,USA,2000:1888-1893.
    [28]Sug,Hyontai.A comprehensively sized decision tree generation method for interactive data mining of very large databases[A].First International Conference on Advanced Data Mining and Applications,ADMA 2005[C].Springer Verlag,Heidelberg,D-69121,Germany,2005:141-148.
    [29]Yoshimitsu Kudoh,Makoto Haraguchi,Yoshiaki Okubo.Data abstractions for decision tree induction[J].Theoretical Computer Science,2003,292(n2):387-416.
    [30]简友光,简曙光.基于Rough集的数据约简算法研究综述[J].计算机与数字工程,2006,34f5):27-29.
    [31]孙士保,秦克云,王育辉.基于区分矩阵和区分函数进行属性约简的数据分类[J].河南科技大学学报(自然科学版),2005,26(4):37-40.
    [32]章云,徐宁.大数据集基于等价类的属性重要性定义和约简[J].仪器仪表学报,2004,25(4):801-803.
    [33]Haixun Wang,Carlo Zaniolo.CMP:A Fast Decision Tree Classifier Using Multivariate Predictions[A].2000 IEEE 16th International Conference on Data Engineering(ICDE'00)[C].Institute of Electrical and Electronics Engineers Computer Society,Los Alamitos,CA,USA,2000:449-460.
    [34]X.-B.Li,J.Sweigart,J.Teng,J.Donohue,L.Thombs and M.Wang.Multivariate decision trees using linear discriminants and tabu search[J].IEEE Transactions on Systems,Man,and Cybernetics Part A:Systems and Humans,2003,33(n2):194-205.
    [35]Potts D.Fast incremental learning of linear model trees[A].8th Pacific Rim International Conference on Artificial Intelligence,PRICAI 2004:Trends in Artificial Intelligence[C].Springer Verlag,Heidelberg,D-69121,Germany,2004:221-230.
    [36]J.Gama,P.Brazdil.Linear tree[J].Intelligent Data Analysis 1999,3(1):1-22.
    [37]曲炜,朱诗兵.信息论基础及应用[M].北京:清华大学出版社,2005.
    [38]J.Kittler,Feature selection and extraction,in:Young,Fu(Eds.).Handbook of Pattern Recognition and Image Processing[M].New York:Academic Press,1986.
    [39]Kittler J.Improving Recognition Rates by Classifier Combination:A Theotical Framework,In:Downtown AC,Imedovos(eds)[Z].Progress in Handwriting Recognition.World Science,1997,231-248.
    [40]孙怀江,胡钟山,杨静宇等,基于证据理论的多分类器融合方法研究[J].计算机学报,200 1,24(3):231-235.
    [41]赵宜虹,程国华,史习智.多分类器融合中一种新的加权算法[J].上海交通大学学报,2002,36(6):765-768.
    [42]吕岳,施鹏飞,赵宇明.多分类器组合的投票表决规则[J].上海交通大学学报,2000,34(5):680-683.
    [43]韩宏,杨静宇.多分类组合及其应用[J].计算机科学,2000,21(1).
    [44]A.F.R.Rahman,M.C.Fairhurst.Multiple classifier decision combination strategies for character recognition:A review[J].International Journal on Document Analysis and Recognition(IJDAR),2003,5(4):166-194.
    [45]王正群,孙兴华,杨静宇.多分类器组合研究[J].计算机工程与应用,2002,20:84-102.
    [46]M.Last,H.Bunke,A.Kandel.A feature-based serial approach to classifier combination[J].Pattern Analysis& Applications,2002,5(4):385-398.
    [47]Geok See Ng,Harcharan Singh.Democracy in pattern classifications:combinations of votes from various pattern classifiers[J].Artificial Intelligence in Engineering,1998,12(3):189-204.
    [48]Mubeccel Demireklera,Hakan Altmcay.Plurality voting-based multiple classifier systems:statistically independent with respect to dependent classifier sets[J].Pattern Recognition,2002,35(11):2365-2379.
    [49]Zheng,Yao,Peng,Liu,Lei,Lei,Junjie,Yin.R-C4.5 decision tree model and its applications to health care dataset[A].2005 International Conference on Services Systems and Services Management,ICSSSM'05[C].Institute of Electrical and Electronics Engineers Inc.,Piscataway,NJ 08855-1331,United States,2005:1099-1103.
    [50]L.Brieman,Bagging predictors[J].Machine Learning 1996,24:123-140.
    [51]W.-Y.Loh,Y.-S.Shih,Split selection methods for classification trees[J].Statistica Sinica,1997,7:815-840.
    [52]张立彬,张其前,胥芳,杜奖胜.基于分类回归树(CART)方法的统计解析模型的应用与研究[J].浙江工业大学学报.2002,30(4):315—318.