分布式环境下谱聚类算法研究

英文题名：Spectral Clustering in Distributed Systems
作者：孟超
论文级别：硕士
学科专业名称：检测技术与自动化装置
中文关键词：谱聚类 ; 分布式计算 ; Nystrom算法
英文关键词：spectral clustering ; distributed computing ; Nystrom
英文关键词：algorithm
学位年度：2013
导师：崔晓燕
学科代码：081102
学位授予单位：北京邮电大学
论文提交日期：2013-01-17

摘要

聚类分析是人类一项最基本的认识活动,是机器学习中的经典问题。所谓聚类就是按照事物的某些属性,把不同的事物聚集成类,使类间的相似性尽可能小,类内的相似性尽可能大。κ-means聚类算法作为一种基于中心的聚类算法,是一种比较经典和普遍的算法。当数据集为凸球型分布时,κ-means算法有很好的性能,能够得到较好的聚类结果。但是当样本空间不为凸时,κ-means算法往往会失效,而且算法利用迭代最优化方法求解最优解,因此算法会陷入局部最优解的情况。
     为了能在任意形状的样本空间上聚类,且能够收敛于全局最优解,近几年新出现了一种无监督的聚类算法即谱聚类算法克服了κ-means算法陷入局部最优解的缺点。该算法具有识别任意形状样本空间的能力,不会陷入局部最优解,能够很好的应用在实际问题中。但是应用在海量数据样本空间时,谱聚类算法会受到计算机存储空间不足和运行时间的限制,为了使算法能够在海量数据情况下使用,我们可以将该算法移植到分布式环境中,同时使用两种不同的方法将矩阵稀疏化,减小对内存空间的使用。
     本文重点是如何实现基于分布式环境下的高效谱聚类算法,具体内容包括：1.对相似矩阵进行稀疏化,同时比较两种不同的稀疏化方法的优劣。这里我们采用的两种稀疏化的方法有t最近邻方法和Nystrom方法,为了选择一种较优的方法,这里对两种方法从不同角度进行比较。最后通过实验验证我们发现使用t最近邻方法能够得到更好的聚类结果。2.利用以上由t最近邻来实现相似矩阵的稀疏化的方法,我们可以使用MPI模型和谷歌的Map/Reduce系统来搭建我们需求的分布式环境。之后将谱聚类算法移植到分布式平台上进行验证,结果表明,算法能够充分的利用各节点的资源,提高算法的运行效率,具有良好的扩展性,为海量数据的处理提供了很好的解决方案。
Cluster analysis is a basic understanding activity in human life and is a classic problem in machine learning. The clustering algorithms gather different things into a class according to certain attribute of things, so that the similarity between the classes as small as possible, within the class as great as possible. K-means clustering algorithm is a classical and popular algorithm as a center-based clustering algorithm. When the data set distributed for the convex spherical, K-means algorithm has good performance and be able to get a better clustering results. However, when the sample space is convex, K-means algorithm often fails, and the algorithm using iterative optimization method for solving the optimal solution, so the algorithm will fall into the local optimal solution.
     Recent years, An unsupervised clustering algorithm that spectral clustering algorithm emerge for clustering in the sample space of any shape, overcome the K-means algorithm emerging into a local optimum the shortcomings of the solution, and can converge to the global optimal solution. The algorithm has the ability to identify the sample space of arbitrary shape, and will not fall into the local optimal solution. The application in practical problems is well for the algorithm. However, spectral clustring suffers from a problem in both memory use and computational time when the size of data set is large. To perform clustering on large data sets, We can transplanted the algorithm in a distributed environment, at the same time we use different sparse matrix methods to reduce the use of memory space.
     This article focuses on how to achieve efficient spectral clustering algorithm based on a distributed environment. Specific content including:1. Thinning similarity matrix using two different methods and compare two different sparse methods. The two methods are t nearest neighbor method and Nystrom method, in order to select a more excellent method; the two methods were compared from different angles. Finally, we found that the use of the t nearest neighbor method can get better clustering results verified by experiments.2. By the t nearest neighbor to the similarity matrix sparse, we can use the MPI model and Google's MapReduce model to build a distributed environment. After that we transplant spectral clustering algorithm to distributed platforms. The results show that the algorithm is able to fully utilize the resources of each node to improve the operating efficiency of the algorithm and has good scalability. It provides a good solution for the application of massive data.

引文

[1]HANJW, KAMBER M. Data mining concept and techniques [M].范明,孟小峰,译.北京：机械工业出版社,2001.
    [2]Jain A, Murty M, Flynn P. Data clustering:A Review [J] ACM Computing Surveys,1999,31 (3):264-323
    [3]D. Achlioptas, F. McSherry, and B. Scholkopf. Sampling Techniques for Kernel Methods. Proc.Conf. Neural Information Processing Systems, pp.335-342,2002.
    [4]J. M. Buhmann. Data clustering and learning. The handbook of Brain Theory and Neural Networks. M. A. Arbib, ed.1995,278-281.
    [5]焦李成,刘芳,等.智能数据挖掘与知识发现[M].西安：西安电子科技大学出版社,2006.
    [6]AGRAWAL R, GEHRKE J, GUNOPULOS D, et al. Automatic subspace clustering of high dimensional data[J]. Data Mining and Knowledge Discovery,2005, 11(1):5-33.
    [7]KANNAN R, VETTA A. On clustering:good, bad and spectral [J]. Journal of the ACM (JACM),2004,51(3):497-515.
    [8]Wu Z, Leahy R. An optimal graph theoretic approach to data clustering:theory and its application to image segmentation [J]. IEEE Transactions on PAMI,1993, 15(11):1101-1113.
    [9]SHI J, MALIK J. Normalized cuts and image segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
    [10]WANG S, SISKIND J M. Image segmentation with ratio cut [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2003,25(6):675-690.
    [11]DING C, HE X, ZHA H, et al. A min-max cut algorithm for graph partitioning and data clustering [C]. Proc. Of IEEE International Conference for Data Mining, 2001:107-114.
    [12]高琰,谷士文,唐静,等.机器学习中谱聚类方法研究[J].计算机科学,2007,34(2)：201-203.
    [13]Meila M, Shi J. Learning segmentation by random walks, In NIPS,2000:873-879.
    [14]F. R. Bach, M. I. Jordan. Learning Spectral Clustering, Proc.Conf. Neural Information Processing Systems,2003.
    [15]A. Talwalkar, S. Kumar, H. Rowley. Large-scale Manifold Learning. Proc.IEEE Conf. Computer Vision and Pattern Recognition,2008.
    [16]M. Snir, S. Otto. MPI-The complete Reference:The MPI Core. MIT Press,1998.
    [17]Leitao Guo, Hongwei Sun, Zhiguo, Luo. A data Distribution Aware Task Scheduling Strategy for MapReduce System. Computer Science,2009:694-699.
    [18]K. Wu, H. Simon. TRLAN User Guide, Technical Report LBNL-41284, Lawrence Berkeley Laboratory,1997.
    [19]卢开澄.图论及其应用[M].北京：清华大学出版社,1981.
    [20]F. Chung. Spectral graph theory [M]. Washington:Conference Board of the Mathematical Science,1997.
    [21]B. Mohar. The Laplacian spectrum of graphs [M]. New York:John Wiley and Sons,1991:871-898.
    [22]B. Mohar. Some applications of Laplace eigenvalues of graphs [M]. Kluner Academic Publishers.1997,497:227-275.
    [23]M. Fledler. Algebraic connectivity of graphs [J]. Czechoslovak Mathematical Journal,1973,23(2):298-305.
    [24]L. Hagen, A. B. Kahng. New spectral methods for ratio cut partitioning and clustering [J]. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems,1992, 11(9):1074-1085.
    [25]U. Luxburg. A tutorial on spectral clustering [J]. Statistics and Computing,2007, 17(4):395-416.
    [26]P. Perona, W. T. Freeman. A factorization approach to grouping. MH. Burkardt, B. Neumann, eds. Proc. ECCV.1998:655-670.
    [27]Scott G L, Higgins H C. Feature grouping by delocalization of eigenvectors of the proximity matrix [M]. Proc. British Machine Vision Conference.1990:103-108.
    [28]R. Kannan, S. Vempala, A. Vetta. On clustering:good, bad and spectral. In FOCS, 2000:367-377.
    [29]A. Y. Ng, M. I. Jordan, Y. Weiss. On spectral clustering:analysis and an algorithm. Neural Information Processing Systems.2002,849-856.
    [30]S. Zhong, J. Ghosh. A Unified Framework for Model-Based Clustering [J]. Machine Learning Research,2003(4):1001-1037.
    [31]R. B. Lehoucg, D. C. Sorensen, C. Yang. ARPACK User's Guide. STAM,1998.
    [32]V. Hernandez, J. E. Roman, A. Tomas, et al. A Survey of Software for Sparse Eigenvalue Problems. Universidad Politecnica de Valencia,2005.
    [33]C. Fowlkes, S. Belongie, F. Chung. Spectral Grouping Using the Nystrom Method. IEEE Trans. Pattern Analysis and Machine Intelligence,2004,26(2):214-225
    [34]都志辉,李三立,陈渝等.高性能计算并行编程技术-MPI并行程序设计.北京：清华大学出版社,2001.
    [35]莫则尧,袁国兴.消息传递并行编程环境.北京：科学出版社,2001.
    [36]K. Hwang,陆鑫达译.可扩展并行计算：技术结构与编程.北京：机械工业出版社,1998.
    [37]J. Dean. S. Ghemawat. MapReduce:Simplified Data Processing on Large Clusters. Comm. ACM.2008,51(1):107-113.
    [38]岑文初.分布式计算开源框架Hadoop介绍.2008.8.
    [39]MapReduce:http://Hadoop.apache.org/commom/docs/current/mapred_tutorial.ht ml.
    [40]HDFS:http://Hadoop.apache.org/common/docs/current/hdfs_user_guide.html.
    [41]孙广中,肖锋,熊曦.MapReduce模型的调度及容错机制研究.微电子学与计算机,2007,24(9)：178-180.
    [42]K. Maschhoff, D. Sorensen. A Portable Implementation of ARPACK for Distributed Memory Parallel Architectures. Proc. Copper Mountain Conf. Iterative Methods,1996.
    [43]A. Gursoy. Data Decomposition for Parallel K-means Clustering. Parallel Processing and Applied Math,2003:241-248.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700