基于距离学习的集成KNN分类器的研究

英文题名：Research on Ensemble KNN Classifier Based on Metric-learning Method
作者：于飞
论文级别：硕士
学科专业名称：测试计量仪器
中文关键词：数据挖掘 ; 集成学习 ; 子分类器 ; 距离学习
英文关键词：Data Mining ; Ensemble Learning ; component Learner ; Metric Learning
学位年度：2009
导师：顾宏
学科代码：080402
学位授予单位：大连理工大学
论文提交日期：2009-11-09

摘要

近年来,数据挖掘引起了信息产业界的极大关注,其主要原因是存在可以广泛适用的大量数据,并且迫切需要将这些数据转换成有用的信息和知识。获取的信息和知识可以应用于各种领域,包括商务管理、生产控制、市场分析、工程设计和科学探索等。
     本文主要关注于数据挖掘的一个分支,即分类问题,综合了一种集成算法和一种改良的分类算法,设计了一个基于距离学习的集成的KNN分类器。这种分类器首先对数据集的所有属性进行了的过滤处理,计算训练集所有属性的信息增益,把信息增益小于某一阈值的属性作为不相关属性过滤掉。然后选择了装袋(Bagging)的集成方法来构建子分类器:一方面,利用自助(Bootstrap)法随机抽取了训练数据集的样本以建立多个子分类器,另一方面,对每一个已建立的子分类器的所有属性再次进行了随机剔除,这种对输入属性添加扰动的方法不但保证了子分类器准确性,同时也增加了子分类器之间的差异性。之后,每一个子分类器都选择一种基于距离学习的KNN分类算法来计算分类结果,其中KNN的距离学习模块采用了邻近成分分析(NCA)算法。最后,利用多数投票制综合分类结果,获得最终判定。
     实验数据表明,与单一的集成KNN分类器或者单一的距离学习KNN分类器相比,新分类器的正确率的得到了很大的提升。
Recent years, a huge attention has been focused on the Data Mining in the science and information industry. As more data are gathered, with the amount of data doubling every year, data mining is becoming an increasingly important tool to transform these data into useful information. It is commonly used in a wide range of practices, such as marketing, surveillance, fraud detection and scientific discovery.
     This paper focuses on one of the branches in the area of Data Mining, the Classification. A new Ensemble learning algorithm which is based on a metric-learning KNN classifier is proposed in this paper. Firstly, in a filtering procedure, we use a information gain based threshold to filter the input attributes, According to the evaluation of information gain of all the original inputs attributes, the values which are less than the threshold f are deemed as irrelevant and removed. Secondly, in an assembling procedure of bagging, we use both a regular boostrap way to reshuffle all the instances in the filtered dataset and a perturbation which randomly picks out the input attributes of the filtered dataset to form several component learners. In this way a strong ensemble can be generated with both high accuracy and diversity. Thirdly, in a metric-learning procedure, all the component learners is classified by a reformed KNN classifier called Neighbor Component Analysis, which learned the KNN distance-metric by a designed optimizing method. Finally, in a combining procedure, a majority-voting strategy has been used to colligate all the results which component learners produced.
     A large empirical study shows that this algorithm has a better performance compared to other simple metric-learning algorithms and simple ensemble KNN classifiers.

引文

[1]朱明.数据挖掘[M].合肥:中国科技大学出版社,2002.
    [2]Diamantni,C.Potena,D.Cellini,J.UDDI registry for Knowledge Discovery in Databases services[J].Collaborative Technologies and Systems,Page(s):321-328.2007
    [3]黄燚.数据挖掘初探[J].福建电脑.2005年03期.
    [4]陈文伟.数据挖掘技术[M].北京:北京工业大学出版社,2002.
    [5]谭建豪,章兢,黄耀.数据挖掘技术[M].中国水利水电出版社.2009.
    [6]J.R.Quinlan.Bagging,Boosting,and C4.5[C].In Proc.13th Natl.Conf.Artificial Intelligence,pages 725-730,Portland,Aug.1996.
    [7]吕晓玲,谢邦昌.数据挖掘:方法与应用[M].人民大学出版社.2009.
    [8]Catley,C.Smith,K.McGregor,C.Extending CRISP-DM to incorporate temporal data mining of multidimensional medical data streams:A neonatal intensive care unit case study[J].Computer-Based Medical Systems,2009(5):1-5.
    [9]http://sourceforge.net/projects/pmml/.
    [10]Alex Guazzelli,Michael Zeller,Wen-Ching Lin,Graham Williams.PMML:An Open Standard for Sharing Models[J].The R Journal,vol 1/1,May 2009.
    [11]Gordon S.Linoff,Michael J.A.Berry.Web数据挖掘将客户数据转化为客户价值[M].电子工业出版社.2004.
    [12]黄德双著.基因表达数据谱数据挖掘方法的研究[M].科学出版社.2009,3.
    [13]梁循,杨健,陈华.互联网金融信息系统的设计与实现[M].北京大学出版社.2006.
    [14]S.M.Weiss,C.A.Kulikowski,M.A.Lehr.Neural networks:applications in industry,business And science[J].Communications of ACM,37:93-105,1994.
    [15]Han J,Kamber M.Data Mining:Concepts and Techniques[M].2nded.Morgan Kaufmann,2005.
    [16]周颜军,王双成,王辉.基于贝叶斯的分类器研究[J].东北师大学报,2003,35(2):21-27.
    [17]石洪波,王志海,黄厚宽,等.一种限定的双层贝叶斯分类模型[J].软件模型,2004,15(2):193-199.
    [18]宫秀军,刘少辉,史忠植.一种增量贝叶斯分类模型[J].计算机学报,2002,25(6):645-650.[19]宫秀军,刘少辉,史忠植.主动贝叶斯网络分类器[J].计算机研究与发展,2002,39(5):574-579
    [20]D.Heckerman.Bayesian networks for knowledge discovery[M].In U.M.Fayyad,G.Piatetsky-Shapiro,Advances in Knowledge Discovery and Data Mining,Pages 399-421.Cambridge,MA:kPAI/MIT Press,1996.
    [21] S. Russel, P. Norvig. ArtificialIntelligence:AmodernApproach[J]. Engle-wood Cliffs, NJ:Prentice-Hall,1995.
    [22] J. R. Quinlan. Learning logic definitions from relations[J]. Machine Learning, 5:139-166,1990.
    [23] J. R. Quinlan. C4. 5: Programs for Machine Learning[J]. San Mateo, CA:MorganKaufman, 1993.
    [24] D. E. Rumelhart, G. E. Hinton, R. J. Williams. Learning internal representation by error Propagation[C].In D.E.Rumelhart and J. L. McClelland ParallerDistributed. Processing Cambridge, MA:MIT Press, 1996.
    [25] R Jacobs. Increased rates of convergence through learning rate adapatation[J]. Neural Works, 1:295-307,1988.
    [26] Amir Globerson, Sam Rowies. Metric Learning by Collapsing Class[J]. Advances in Neural Information Processing Systems, 2006.
    [27] Jacob Goldberger, Sam Rowies, Geoff Hinton, Ruslan Salakhutdinov. Neibourhood Component Analysis[J].Advances in Neural Information Processing Systems, 2005.
    [28] Killian Q, Weinberger,John Blitzer, Lawrence K. Saul. Distance Metric learning for large margin nearest neighbour classif ication[C]. The Journal of Machine Learning Research, 2009.
    [29] J Friendman, T Hastie, R Tibshirani. Addictive logistic regression:A statistical view of boosting[J]. Ann Statist, vol 28,no. 2,pp. 337-407. 2004.

    [30] L. Breiman. Bagging Preditors. Machine Learning[J], vol 24, no. 2, pp. 123-140, 1996.
    [31] R. Duda, P. Hart. Pattern Classification and Scene Analysis[M]. New York:John and Wiley&Sons, 1973.
    [32] M. James. Classification Algorithms[J].New York:John Wiley&Sons, 1985.

    [33] 闭小梅. KNN算法综述[J]. 科技创新导报. 2009年 14期.

    [34] Wasito I, Mirkin B. Nearest Neighbour approach in the least-squares data imputation algorithms[J]. Information Sciences,2005,169(1-2):1-25.
    [35] Jonsson P, Wohlin C. An evaluation of k-nearest neighbour imputation using Likert data[C]. Proceedings 10th international Symposium on Software Metrics, 2004:108-118.
    [36] R. S. Michalski, I. Brakto, M. kubat. Machine Learning and Data Mining:Methods and Applications[M]. New York:John Wiley&Sons,1998.
    [37] http://www.cs.cmu.edu/~liuy/distlearn.htm
    [38] Zhi-Hua Zhou, Yang Yu. Ensembling Local Learners Through Multimodal Perturbation [J]. IEEE TRANSACTIONS ON SYSTEM, MAN, AND CYBERNETICS-PART, 2005, vol35.
    [39]K.J.Cherkauer.Human expert level performance on a scientific image analysis task by a system using combined artifical neural networks integrating Multiple Learned Models[J],Proland,OR,1996,pp,15-21.
    [40]T.K.Ho.Whe random subspace method for constructing descision forests[J].IEEE Trans,Pattern Anal.Machine Intell,vol.20,no.8,pp.832-844.
    [41]P Pudil,H Freeman.Nearest neighbours in random subspaces[J].Lecture Notes in Computer Science 1451,Berlin,Germany:Springer,1998,pp.640-648.
    [42]T.M.Mitchell.Machine Learning[M].New York:McGraw-Hill,1997.
    [43]Gediga G,Duntsch Ⅰ.Maximum Consistency of Incomplete Data via Non-Invasive Imputation[C].Artificial Intelligence Review.Springer,2003,19(1):93-107.
    [44]Scheffer J.Dealing with Missing Data[J].Research Letters in the Information and Mathematical Sciences,2002,3(1):153-160.
    [45]石洪波,黄厚宽,王志海.基于Boosting的TAN组合分类器[J].计算机研究与发展,2004.41(2):340-345.
    [46]http://www.ics.uci.edu/mlearn/MLRPrepository.html.
    [47]S.Russel,P.Norvig.ArtificialIntelligence:AModernApproach.Engle-woodCliffs,NJ:Prentice-Hall,1995.
    [48]S.J.Hanson,D.J.Burr.Minkowski back-propagation:Learning in connectionist models with Non-euclidean error singals[J].In Neural Information Processing Systems,American Institute of Physics,1988.
    [49]Richard O Duda,Peter E Hart,David G著,李宏东,姚天翔译.Stock.Pattern Classification[M].机械工业出版社.2004.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700