基于成对约束的半监督聚类算法研究及其并行化实现

英文题名：Semi-Supervised Clustering Algorithm Based on Pairwise Constraints and Its Parallel Implementation
作者：林超
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：半监督聚类 ; 成对约束 ; 并行计算 ; MapReduce
英文关键词：Semi-supervised clustering ; Pairwise constraints ; Parallel computing ; MapReduce
学位年度：2013
导师：杨燕
学科代码：081203
学位授予单位：西南交通大学
论文提交日期：2013-05-01

摘要

作为数据挖掘领域中的一种重要方法,聚类分析能够发现数据对象自然的分布结构。它根据数据对象之间的相似性,把数据对象分割成簇,并保证同一簇内中数据的相似性尽可能大,而不同簇间数据的相似性尽可能小。从机器学习的角度来看,聚类分析是一种无监督的学习方法,它按照一定的优化准则对数据进行分割,对数据的分析不需要知道其相关的背景知识。但是,现实生活中我们对数据的信息并不是一无所知,并且我们发现通过这些少量的已知信息能够找到数据对象标识或相互之间的约束信息。半监督聚类就是在传统的无监督聚类算法中引入先验知识来指导聚类过程,提高聚类结果精度。
     本文选择引入成对约束作为先验知识来协助指导聚类过程,分别建立了Must-Link和Cannot-Link约束组,用以描述两个样本数据间的关系。其中,Must-Link代表两个样本数据必须被分配到同一划分,而Cannot-Link则代表两个样本数据必须被分配到不同的划分。详细介绍了基于成对约束的半监督聚类算法Cop-Kmeans,对算法比较常见的约束违反的问题,提出了全新的改进方法,在解决约束违反的同时,算法的运行时间效率也优于传统的改进方案。此外,针对成对约束自身特征可能给聚类性能带来的不良影响,进一步提出了相应的改进方案,能够最大限度的削弱这种不良影响,从而能够在一定程度上改善聚类结果精度。
     考虑到当聚类对象是一个大数据集或者高维数据类型时,传统的单机串行聚类算法无论是在内存或者运算能力都无法满足实际需求。本文选择运用“云计算”思想,采用并行处理方式处理大规模的数据集。我们利用MapReduce计算模型对改进的半监督聚类算法进行并行化实现,并在Hadoop搭建的并行处理平台上处理大数据集。实验结果表明,采用并行计算方式能够显著提高聚类效率。
As an important method in the field of data mining, cluster analysis is able to find the natural distribution structure of the data objects. It is a process that divides objects into the similar class according to their attribute. The goal of the cluster is that the similarity of objects from the same group is larger than the similarity of objects from the different group. From the perspective of machine learning, clustering analysis is an unsupervised learning method, and we don't need any background knowledge when analyze on data objects. However, we can always get some information of the data objects to be analyzed, and we find that a small amount of known information can help find the data object identifier or constraint information between two instances. By adding prior knowledge to the traditional unsupervised clustering algorithm and guide the whole clustering process, then we get semi-supervised clustering algorithm with a high accuracy of clustering result.
     In this thesis, we select pairwise constraints to help guide the clustering process. Generally, pairwise constraints contain two parts:Must-link and Cannot-link, they describe the relationship between two samples of data. Wherein, Must-link represents two samples must be assigned to the same cluster, while Cannot-link represents two samples of data must be assigned to the different cluster. This thesis also introduces the semi-supervised clustering algorithm Cop-Kmeans in details, which is based on pairwise constraints. We put forward a new and improved method to solve the constraint violation exists in the Cop-Kmeans, the efficiency of the algorithm is also better than the traditional improvement program. In addition, we find the pairwise constraints may have an adverse effect on clustering performance, so we further propose a corresponding improved program. It is possible to weaken such adverse effects, and improve the accuracy of the clustering result to a certain extent.
     Since the traditional serial clustering algorithm can not meet the requirements either in memory or computing speed when clustering object is a type of large data sets or high-dimensional data, and inspired the idea of "cloud computing", the thesis deals with large-scale data sets in a parallel way. We use Hadoop to set up a parallel processing platform, and parallelize the proposed algorithm according to the MapReduce computing model, so that it can efficiently handle large data sets. Experiments show that parallel computing model can significantly improve the efficiency of clustering.

引文

[1]管仁初.半监督聚类算法的研究与应用：[博士学位论文].吉林：吉林大学计算机科学与技术系,2010.
    [2]Chapelle O, Scholkopf B, Zien A. Semi-supervised learning[M]. Cambridge Mass MIT Press,2006.
    [3]卢加磊,半监督学习中协同训练与多视图方法的比较及改进：[硕士学位论文].山东：中国海洋大学计算机科学与技术系,2010.
    [4]李国杰.信息服务网格一第三代.计算机世界,2011,40.
    [5]刘鹏.云计算[M].电子工业出版社,2010.
    [6]王鹏.云计算的关键技术与应用实例MI.人民邮电出版社,2010.
    [7]Wang L Z, Gregor V, Marcel K, Tao J. Cloud Computing:a Perspective Study [J]. Service Oriented Cyber infrastructure Lab, Rochester Inst.2008,12:1-10.
    [8]Armbrust M, Et A. Above the clouds:A Berkeley view of cloud computing. EECS Department,2009,2:3-5.
    [9]Wagstaff K, Cardie C. Constrained K-means Clustering with Background Knowledge. Proceedings of the Eighteenth International Conference on Machine Learning,2001:577-584.
    [10]Huang H, Cheng Y, and Zhao R L. A Semi-supervised Clustering Algorithm Based on Must-Link Set. Proceedings of the 4th international conference,2008: 492-499
    [11]Basu S, Banerjee A, Mooney R J. Semi-supervised clustering by seeding. Proceedings of the 19th International Conference on Machine Learning, San Fransisco,2002:19-26.
    [12]Gao Y Qi, H You D, Liu H. SEMI-SUPERVISED K-MEANS CLUSTERING FOR MULTI-TYPE RELATIONAL DATA. Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming, 2008:326-330.
    [13]汪军,王传玉,周鸣争.半监督的改进K-均值聚类算法.计算机工程与应用,2009,45(28)：137-139.
    [14]刘涛,尹红健.基于半监督学习的K-均值聚类算法研究.计算机应用研究,2010,27(3)：913-916.
    [15]袁利永,王基一.一种改进的半监督K-means聚类算法.计算机工程与应用,2011,33(6)：138-143.
    [16]Zhan L Z, Xu H, Chen X G. A Semi-supervised Text Clustering Approach Based on K-means Algorithm.2011 International Conference on Engineering and Business Management,2011:2616-2620.
    [17]尹学松,胡恩良,陈松灿.基于成对约束的判别型半监督聚类分析.软件学报,2008,19(11)：2791-2802.
    [18]Hamasuna Y, Endo Y, Miyamoto S. Semi-supervised Fuzzy c-Means Clustering Using Cluster wise Tolerance Based Pairwise Constraints.2010 IEEE International Conference on Granular Computing,2010:186-191.
    [19]Tan W, Yang Y, Li T. An improved COP-Kmeans algorithm for solving constraint violation. Proc. International FLINS Conference on Foundations and Applications of Computational Intelligence,2010:690-696.
    [20]Zhang G C, Yang M, Wei S. Semi-supervised Robust NRFCM for image segmentation with pairwise constraints.2010 International Conference on Artificial Intelligence and Computational Intelligence,2010:525-560.
    [21]贺杨成,王士同,江南.成对约束的属性加权半监督模糊核聚类算法.计算机工程与应用,2011,47(24)：136-138.
    [22]潘俊,孔繁胜,王瑞琴.加权成对约束投影半监督聚类.浙江大学学报(工学版),2011,45(5)：934-940.
    [23]钟将,刘龙海,梁传伟.基于成对约束的主动半监督文本聚类.计算机工程,2011,37(13)：193-186.
    [24]Rutayisire T, Yang Y, Lin C, Zhang J Y. A Modified Cop-Kmeans Algorithm Based on Sequenced Cannot-Link Set. Proc.6th International Conference on Rough Sets and Knowledge Technology (RSKT2011), LNAI6954, Springer, 2011:217-225.
    [25]Barroso L, Dean J, Holzle U. Web search for a planet:The Google cluster architecture[J]. IEEE Micro,2003,23(2):22-28
    [26]Zhao W Z, Ma H F, He Q. Parallel K-means Clustering Based on MapReduce. Proceedings of the 1st International Conference on Cloud Computing, Beijing, 2009:674-679.
    [27]Bhavani R, Sudha G S, Radhika K. A Novel Parallel Hybrid K-means-DE-ACO Clustering Approach for Genomic Clustering using MapReduce.2011 World Congress on Information and Communication Technologies,2011:132-137.
    [28]Chang J, Luo J, Huang J Z, Feng S Z, Fan J P. Minimum Spanning Tree Based Classification Model for Massive Data with MapReduce implementation. Data Mining Workshops(ICDMW),2010 IEEE International Conference,2010: 129-137.
    [29]Alina E, Sun Q J, Benjamin M. Fast Clustering using MapReduce. Proceedings of the 17th ACM SIGKDD international conference, USA,2011:681-689.
    [30]李雪梅,王立宏,宋宜斌.近邻传播半监督聚类算法的并行计算.Computer Engineering and Applications,2011:149-152.
    [31]Robson L F C, Caetano T, Agma J M T, Julio L, Kang U, Christos F. Clustering very large multi-dimensional datasets with MapReduce. Proceedings of the 17th ACM SIGKDD international conference,2011:690-698.
    [32]陈建斌.高维聚类知识发现关键技术研究及应用.北京：电子工业出版社, 2009.
    [33]Xu R, Wunsch D. Survey of clustering algorithms. IEEE transactions on neural networks,2005.12(3):645-678
    [34]Han J W, Kamber M,范明(译),孟小峰(译).数据挖掘概念与技术,北京：机械工业出版社,2007.
    [35]杨燕,靳藩等.聚类有效性评价综述[J].计算机应用研究,2008,25(6)：1630-1632.
    [36]赵恒,张高煌,胡海虹.新的基于数据几何结构的聚类有效性函数[J].系统程与电子技术,2007,29(1)：96-98.
    [37]Basu S, Banjeree A, Mooney R J. Active semi-supervision for pairwise constrained clustering. Proceedings of the SIAM International Conference on Data Mining, Florida,2004:333-344.
    [38]邓超,郭茂组.基于Tri-Trainning和数据剪辑的半监督聚类算法.软件学报,2008.19(3)：663-672.
    [39]Xing E P, Ng A Y, Jordan M I, Et A. Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems,2003,15:505-512.
    [40]Cohn D, Caruana R, McCallum A. Semi-supervised clustering with user feedback[R]. Technical Report TR2003-1892, Cornell University,2003: http://citeseer.ist.psu.edu/cohn03semisupervised.htm
    [41]Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similaritymeasures[C]. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, D.C,2003:39-48.
    [42]MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkele,1967:281-297.
    [43]Wagstaff K. Intelligent clustering with instance-level constraints. Cornell University,2002.
    [44]Edmund B, Nightingale E B, Chen P M, Et A. Speculative execution in a distribute file system[A]. Proceedings of the twentieth ACM symposium on Operating systems principles[C],2005:191-205
    [45]Tom White. Hadoop权威指南-中文版[M].北京：清华大学出版社,2010.
    [46]Dhruba B. The Hadoop Distributed File System:Architecture and design [R]. lucene.apache.org/hadoop,2008.
    [47]陆嘉恒.Hadoop实战.北京：机械工业出版社,2012.
    [48]赵卫中,马慧芳,傅燕翔,史忠值.基于云计算平台Hadoop的并行K-means聚类算法设计研究.计算机科学,2011,38(10)：168-176.
    [49]周丽娟,王慧,王文伯,张宁.面向海量数据的并行Kmeans.华中科技大学学报(自然科学版),2012,40：150-152.
    [50]江小平,李成华,向文,张新访,颜海涛.K-means聚类算法的Map-Reduce并行化实现.华中科技大学学报(自然科学版),2011,39(1)：120-124.
    [51]Xu YF, Zhang Y, Ma R. K-Means algorithm based on Cloud Computing.2012 Fifth International Symposium on Computational Intelligence and Design,2012: 363-365.
    [52]王辉,张望,范明.基于集群环境的K-Means聚类算法的并行化.华中科技大学学报(自然科学版),2008,29：42-47.
    [53]http://archive.ics.uci.edu/ml/datasets.html.
    [54]http://strehl.com/download/x8d5k.txt.
    [55]Xu X, Jager J, Kriegel H. A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery,1999,3:263-290.
    [56]陈建斌.高维聚类知识发现关键技术研究及应用.北京：电子工业出版社,2009.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700