基于集成学习的覆盖算法研究

英文题名：The Research of Cover Algorithm Based on Ensemble Learning
作者：冯伦阔
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：覆盖算法 ; 覆盖聚类 ; 差异性 ; 集成学习 ; 选择性集成
英文关键词：Cover Algorithm ; Cover Clustering ; Difference ; Ensemble Learning ; Selecting Ensemble
学位年度：2010
导师：贾瑞玉
学科代码：081202
学位授予单位：安徽大学
论文提交日期：2010-02-01

摘要

分类和聚类是两种重要的数据挖掘技术,分类是对数据集中具有同样类标号的数据建立规则或模型,通过这些规则或模型能对数据正确分类。聚类是通过相似度对没有类别标号的数据集中数据进行分组,使得组内对象相似度高而组间相似度低。
     构造型神经网络是一种新型的神经网络,它将网络功能划分成若干独立的功能模块,整个网络可以分层逐步构造。相对于传统神经网络,构造型神经网络具有大规模网络构建相对简单、易理解、内部功能模块相对独立、设计简单、可并行处理等特点,在解决海量数据、解决传统神经网络结构复杂、训练速度慢、扩展神经网络应用领域等方面显示出了巨大的优势和潜力。基于覆盖思想的构造型神经网络是从神经元模型几何意义出发而提出来的,它的核心是领域覆盖算法,算法首先是逐步在样本集的投影域构造出只含同类数据的“球形区域”,然后再将具有共同类标号的“球形区域”组成统一的输出。
     集成学习技术是利用多个学习器来解决同一个问题,这样可以显著地提高学习系统的泛化能力以及稳定性。传统的覆盖算法并不能实现对增量样本的学习过程,本文提出基于集成学习的覆盖增量学习算法,通过样本权值的设置加大对新增样本的学习,并针对不同情形的增量样本给出对应的算法,成功实现覆盖算法对增量样本的学习过程。针对传统领域覆盖算法因为“球形区域”过多导致“拒识样本”过多,交叉覆盖算法因为本身构造时过分依赖训练样本而导致泛化能力较差的问题,本文提出基于集成学习的覆盖算法,该算法一方面大大减少了“拒识样本”,另一方面也显著提高了算法的泛化能力。
     覆盖聚类算法是将传统的领域覆盖算法应用于聚类分析,是利用聚类数据局部聚集的特性进行聚类的算法,算法具有聚类快速、参数设置相对简单的特点,本文利用覆盖聚类算法为K-means算法探索初始中心,改进后的算法不仅可以显著降低K-means的迭代次数,而且还有助于发现K-means的最佳聚类效果。针对覆盖聚类算法聚类效果不理想的问题,本文结合覆盖算法本身特点,提出基于“中心匹配”的新的簇标号匹配方法,并在此基础上提出基于集成学习的覆盖聚类算法,该算法可以提高覆盖算法的聚类效果。
     覆盖算法的分类或者聚类结果就是得到若干个“球形区域”,因此衡量分类器或聚类器的差异性,也就变成衡量“球形区域”的差异性,而“球形区域”是通过中心和半径来确定,本文由此出发,提出了基于中心相似的差异性度量方法,来实现覆盖分类和聚类算法的选择性集成学习,改进后的算法可以大大减少用于集成的个体学习器的个数。
Classification and Clustering are two important data mining techniques, Classification focuses on the data with the same type labels to establish rules or models, by these rules or the models can correctly classify the data. Clustering group the absence category labels data by similarity degree, there are high similarity inner group and low similarity between groups.
     Constructive neural network is a new type of neural network, which divide network function into a number of separate functional blocks, the entire network can be constructed step by step.Compared with traditional neural networks,Constructive neural network have simply and easily constructed,relatively independent of the internal functional modules, simply designed, parallel processing.In the massive data,expand the fields of neural network applications and solve the disadvantages of traditional neural networks,for example, complex structure, slowly training,it shows tremendous and potential advantage.Constructive neural network based on cover thought starts from the geometric meaning of the neuron model, and core is cover algorithm.Algorithm firstly constructs "global zones" in the sample collection projection domain,each zone contain only the same data,then form a unified output,if some"global zones"labeled with the common class.
     Ensemble learning is a technique,that use multiple learning device to solve the same problem and can significantly improve the generalization ability and stability of learning systems.For the traditional cover algorithm can not learn about the incremental samples,this paper presents an weighted cover increment learning algorithm Based on ensemble learning.The sample weight by the setting of the additional samples to increase learning and incremental samples for different situations are given the corresponding algorithms, successfully covering algorithm for incremental learning process the sample.The traditional covering algorithm,because of too many "global zone" lead to "rejection sample " too much, cross-covering algorithm because of their own over-reliance on construction training samples which led to poor generalization ability problem, this paper presents the coverage based on integrated learning algorithm.The algorithm significantly reduced the one hand, the "rejection samples", it also significantly improve the generalization ability of the algorithm.
     Cover clustering algorithm is clustering algorithm,that apply the traditional cover algorithm to cluster analysis, use the partial aggregate characteristic of clustering data,this clustering algorithm has relatively fast and simple parameter setting characteristics,In this paper, cover clustering algorithm is used to explore the initial centers of K-means algorithm,and the improved algorithm not only significantly reduce the number of K-means iteration, but also help to identify the best K-means clustering results.For the clustering result of Cover clustering algorithm is not satisfactory, this paper combines cover algorithm own characteristics, proposes a new cluster label matching method,which bases on the "center match",further,proposes cover clustering algorithm Based on ensemble learning, the algorithm can improve the algorithm clustering efficiency.
     The results of cover algorithm and cover clustering algorithm are "global zones",Therefore, To measure the differences of classifiers or cluster, become to measure differences of "global zones" while "global zones" through the center and radius to identify.So,This paper proposed the method to measure differences based on "Different Center",which can used to the selective ensemble learning of classification and clustering,and the improved algorithm can be greatly reduced the number of individual learning devices.

引文

[1]Jiawei Han, Micheline Kamber.范明,孟小峰等译.数据挖掘：概念与技术[M].北京：机械工业出版社,2001：1-3,187-188,261-262
    [2]薛惠锋,张文宇.智能数据挖掘技术[M].西安：西北工业大学出版社,2005
    [3]周志华,王钰.机器学习及其应用[M].北京：清华大学出版社,2007
    [4]Tom Mitchell.曾华军,张银奎等译.机器学习[M].北京：机械工业出版社,2003
    [5]张铃,张钹.人工神经网络理论及其应用[M].杭州：浙江科学技术出版社,1997：4-5
    [6]蒋宗礼.人工神经网络导论[M].北京：高等教育出版社,2001：16,17-19
    [7]Zhang L,Zhang B.A geometrical representation of McCulloch-Pitts neural model and its applications[J].IEEE Trans.on Neural Networks,1999,10(4):925-929(张铃,张钹.M-P神经元模型的几何意义及其应用[J],软件学报,1998,9(5)：334-338
    [8]吴涛,等.交叉覆盖网络的球形领域构造与功能函数[J].计算机工程与应用,2003,16：43-45
    [9]张铃,吴涛,等.覆盖算法概率模型[J].软件学报,2007,18(11)：2691-2699
    [10]张铃,张钹.多层前向网络的交叉覆盖设计算法[J].软件学报,1997,10(7)：736-741
    [11]Zhang YP,Zhang L,Wu T.A multi-side increase by degrees algorithm at machine learning[J].Acta Electronica Sinica,2005,33(2):327-331(in chinese with english abstract)
    [12]陶品,张钹,等.构造型神经网络双交叉覆盖增量学习算法[J].软件学报,2003,14(2)：194-201
    [13]吴鸣锐,张钹.一种用于大规模模式识别问题的神经网络算法[J].软件学报,2001,12(6)：851-855
    [14]张燕平,张铃,段震.构造性核覆盖算法在图像识别中的应用[J].中国图象图形学报,2004,9(11)：1304-1308
    [15]Wang Lunwen,ZhangLing,Zhang Min,A method of pattern classification based on RS and NCA[R]. Proceedings of the 2nd International Conference on Machine Learning and Cybernatics.Xi'an,China,2003,V:561-568
    [16]赵姝,张燕平,等.覆盖聚类算法[J].安徽大学学报(自然科学版),2005,29(2)：28-32
    [17]王伦文.聚类的粒度分析[J].计算机工程与应用,2006,42(5)：29-31,65
    [18]Valiant L. A Theory of the Learnable[J].Communications of the ACM,1984,27(11),1134-1142
    [19]Kearns M.The Computational Complexity of Machine Learning[M]. Cambridge:MIT Press,1990
    [20]Kearns M,Valiant L.Cryptographic Limitations on learning Boolean Formulae and Finite Automata[C].Proc of the 21st Annual ACM Symp on Theory of Computing.New York:ACM,1989:433-444
    [21]Schapire R.The Strength of Weak Learnability[J].Machine Learning,1990,5(2):197-227
    [22]Anders Krogh,Jesper Vedelsby.Neural network ensembles,cross validation,and active learning[J].NIPS,1995,7:231-238
    [23]Kuncheva L.Limits on the majority vote accuracy in classifier fusion[J].Pattern Analysis and Applications,2003,6(1):22-31
    [24]Tumer K,Glosh J.Order statistics combiners for neural classifiers[C].World Congress on Neural Networks,I:31-34,1995
    [25]Gavin Brown.Diversity in neural network ensembles[D]. United Kingdom:School of Computer Science of University of Birmingham,2004
    [26]Yong Liu,Xin Yao.Negatively correlated neural networks can produce best ensembles[J].Australian Journal of Intelligent Information Processing Systems,4:176-185,1997
    [27]Kuncheva L,Whitaker C.Measures of diversity in classifier ensembles[J].Machine Learning,51:181-207,2003
    [28]Kuncheva L.That elusive diversity in classifier ensembles[C]. In 1st Iberian conference on Pattern Recognition and Image Analysis,available as LNCS volume 2652,pages 1126-1138,2003
    [29]Efron B,Tibshirani R.An introduction to the bootstrap[M].UK:Chapman and Hall,1993
    [30]Buhlmann P,Yu B.Analyzing bagging[J].Annals of Statistics,2002,30(4):927-961
    [31]David Opitz.Feature selection for ensembles[C].Proceedings of 16th National Conference on Artifical Intelligence(AAAI),379-384,1999
    [32]Breiman L. Bagging predicators[J]. Machine Learing,1996,24(2):123-140
    [33]Freund Y.Boosting a weak algorithm by majority.Information and Computation[J],1995,121 (2):156-185
    [34]Zhou Z H, Wu J X, Jiang Y,et al. Genetic algorithm based selective neural network ensemble[C]. In:Cohn A G,ed. Proceedings of the 17th International Joint Conference on Artificial Intelligence C. Seattle, WA:Morgan Kaufmann Publishers,2001.797-802
    [35]Hopfield J.Neural networks and physical systems with emergent collective computational abilities[C].Proceedings of the National Academy of Siences the USA,79,2554-2558,1982
    [36]Rumelhart D,Hinton G.Learning representations by back-progagation errors[J].Nature(London),1986,323,533-336
    [37]Vapnik V.The nature of Statistical Learning Theory[M],NY:Springer-Verlag,1995
    [38]王论文,张铃.构造性神经网络综述[J].模式识别与人工智能,2008,21(1)：49-55
    [39]张铃,张钹.多层反馈神经网络的FP学习和综合算法[J].软件学报,1997,8(1)：252-258
    [40]Freund Y,Schapire R.Experiments with a new boosting algorithm[C].In:proceedings of the 13th International Conference on Machine Learning,Margn Kaufmann,1996,148-156
    [41]Dietterich T. Machine-learning research:four current direction[J]. the AI Magazine,1998, 18(4):97-136
    [42]Ruta D,Gabrys B. Classifier selection for majority voting[J]. Information Fusion,2005,6:63-85
    [43]Kearns M, Valiant L. Learning boolean formulate or factoring,TR-1488[R].Cambridge,MA:Havard University Aiken Computation Laboratory,1988
    [44]Hansen L,Salamon P.Neural network ensembles[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2003,5:81-93
    [45]Kangan Tumer,Joydeep Glhosh.error correlation and error reduction in ensemble classifiers, TX 78712-1084,[R].Austin:Department of Electrical and Computer Engineering,University of Texas,1996
    [46]Huang Y.S, Suen C.Y.A method of combining multiple experts for the recognition of unconstrained handwritten numerals[J],Pattern Analysis and machine intelligence,1995,17(1): 90-94
    [47]Freund Y.Boosting a weak algorithm by majority[J].Information and Computation,1995,121(2):256-185
    [48]付忠良.关于AdaBoost有效性的分析[J].计算机研究与发展,2008,45(10)：1747-1755
    [49]周志华.神经计算中若干问题的研究[D].南京：南京大学,2000
    [50]Schapire R.Measures of Diversity in classifier Ensembles and their relationship with the ensemble Accuracy[J].Machine Learning,2003,51:181-207
    [51]Zhou Z-H,Wu J,Tang W.Ensembling neural networks:many could be better than all.Artificial Intelligence,2002,137(1-2):239-263
    [52]Alexander Strehl, Joydeep Ghosh. Clustering-A Knowledge Reuse Framework for Combining Multiple Partitions[J].Machine Learning,2002,10(3):583-617]
    [53]罗会兰,孔繁胜,李一啸.聚类集成中的差异性度量研究[J].计算机学报,2007,30(8),1315-1324
    [54]唐伟,周志华.基于Bagging的选择性聚类集成[J].软件学报,2005,16(4)：496-502
    [55]Xu L, Krzyzak A, Sun C Y. Methods of combining multiple classifiers and their applications to handwriting recognition [M].IEEE Trans SMC,1992
    [56]Marina Skurichina and Robert Duin. Bagging, Boosting and Random Subspace Method for Linear Classifiers[J]. Pattern Analysis& Application.2002(5):121-135
    [57]Schapire R,A brief introduction of boosting[C].Proc of the 16th Int Joint Conf on Artificial Intelligence,Sam Francisco:Morgan Kaufmann,199:1401-1406
    [58]Breiman L.Random forests[J].Machine Learning,2001,45(1):5-32

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700