基于共享近邻的自适应谱聚类算法

英文题名：Adaptive Spectral Clustering Based on Shared Nearest Neighbors
作者：李静伟
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：谱聚类 ; 相似度度量 ; 共享近邻
英文关键词：Spectral Clustering ; Similarity Measure ; Shared Nearest Neighbors
学位年度：2010
导师：张宪超
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2010-09-28

摘要

谱聚类作为一种新颖的聚类算法,近年来在模式识别领域受到广泛关注。它不对数据的全局结构作假设,而是通过直接求图的拉普拉斯矩阵的特征分解,获得聚类判据在放松了的连续域上的全局最优解。因此,它能在任意形状的样本空间上聚类,且收敛于全局最优。由于谱聚类算法直接基于相似度矩阵对应的拉普拉斯矩阵进行求解,因此相似度定义对谱聚类算法的性能有至关重要的影响。
本文首先介绍了谱聚类算法涉及的数学基础知识,并从图划分和随机游走两个角度阐述了谱聚类算法的基本原理,然后对谱聚类中常用的计算相似度的函数——高斯核函数以及现有的相似度改进算法进行了详细的分析和研究。发现当两对数据点的距离相等,数据点邻域也类似时,同一簇中的两点应该比不同簇中的两点具有更高的相似度。但无论高斯核函数还是自调节谱聚类中使用局部邻域的相似度都不能满足该聚类假设。本文在总结已有相似度优缺点的基础上,提出基于共享近邻的自适应高斯核函数。它用两点的共享近邻表征局部密度,从而获知隐含的簇结构信息,并将这一信息与自调节的高斯核函数相结合,使中间有较多数据分布的两点具有更高的相似度。新的相似度矩阵满足聚类的两条假设,具有明显的块对角性,对应的谱聚类算法称为基于共享近邻的自适应谱聚类算法。最后,在若干具有挑战性的人工数据集和4个UCI真实数据集上将该算法和经典谱聚类算法以及自适应谱聚类算法进行了对比实验。实验结果表明该方法相对于经典谱聚类算法和自适应谱聚类算法,性能有明显提高,能有效识别数据点之间的内在联系,得到正确的聚类结果。
Spectral clustering has become one of the most popular clustering algorithms in recent years. Traditional clustering methods often suppose the data coincide with certain distribution such as k-means. But spectral clustering method doesn't make any assumption. It directly computes the leading eigenvectors of Laplacian matrix and gets an approximation to graph partitioning problems, so it is applicable to a wide range especially when the data set is non convex. As spectral clustering bases directly on the corresponding Laplacian matrix of the similarity matrix, similarity measurement is crucial to the performance of spectral clustering.
In this paper it firstly gives an introduction to the mathematical objects used by spectral clustering and explains why spectral clustering algorithms work in two different views. One is graph partitioning and the other is from random walk perspective. Then the mostly used similarity measure, the Gaussian kernel function, and other similarity measures used in spectral clustering are analyzed. With the same Euclidean distance, two points in the same cluster should have higher similarity than two points in different clusters, this is the clustering assumption. Neither the traditional Gaussian kernel function nor the measure using the local scale parameter proposed in Self-Tuning satisfies the clustering assumption. Through analyzing of the strength and weakness of these measurements, we propose a novel similarity measure, namely adaptive Gaussian kernel function based on shared nearest neighbors in this paper. It can detect the intrinsic structure of the cluster embedded in the data sets through exploiting the information about local density embedded in the shared nearest neighbors and obtain higher similarity value when there are many common neighbors between the two data points. This measure holds the clustering assumption, and the affinity matrix has clear block structure. The corresponding spectral clustering method is called adaptive spectral clustering based on shared nearest neighbors. Finally, a number of experiments on some synthetic dataset and four real data sets which comes from the UCI data repository are carried out. Experimental results show that the proposed algorithm achieves considerable improvements over the typical spectral clustering algorithm and the Self-Tuning spectral clustering algorithm. It can reveal the relationship between the data points and find the real clusters.

引文

[1]Malik J., Belongie S., Leung T., et al. Contour and texture analysis for image segmentation [J]. International Journal of Computer Vision,2001,43(1),7-27.
    [2]Bach F. R., Jordan M. I. Blind one-microphone speech separation:A spectral learning approach [C]. Proceedings of NIPS 2004, Vancouver,2004,65-72.
    [3]Weiss Y. Segmentation using eigenvectors:A unified view[C]. International Conference on Computer Vision, Corfu,1999,975-982.
    [4]徐森,卢志茂,顾国昌.解决文本聚类集成问题的两个谱方法[J].自动化学报,2009,35(7)：997-1002.
    [5]MacQueen J. Some methods for classification and analysis of multivariate observations[C]. Proceedings of the 5th Berkeley Symposium on Mathematics,1967, 1(1):281-297.
    [6]Ng R., Han J. Effecient and effective clustering methods for spatial data mining[C].. Proceedings of the 20th VLDB Conference, Santiago,1994:144-155.
    [7]Ester M et al. Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]. Proceedings of Knowledge Discovery and Data-Mining, Portland,1996:226-231.
    [8]Hinneburg A, Keim D. An Efficient Approach to Clustering Large Multimedia Databases with Noise[C]. Proceedings of the 4th ACM SIGKDD, New York,1998:58-65.
    [9]Wang W., Yang J. and Muntz R. STING:A statistical information grid approach to spatial data mining[C]. Proceedings of the 23rd VLDB Conference,..Athens, 1997:186-195.
    [10]Sheikholeslami G., Chatterjee S. and Zhang A. WaveCluster:A-MultiResolutibn Clustering Approach for Very Large Spatial Database[C]. Proceedings of the 24th VLDB Conference, New York,1998:144-155.
    [11]Aggrawal R et al. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications[C]. The ACM-SIGMOD Conference, Washington,1998:94-105.
    [12]Kohonen T. Self-Organizing Maps[M]. Berlin:Springer,1997.
    [13]Donath W. E., Hoffman A. J. Lower Bounds for the Partitioning of Graphs [J]. IBM Journal of Research and Development,1973,17(5):420-425.
    [14]Fiedler M. Algebraic connectivity of graphs [J]. Czechoslovak Mathematical Journal, 1973,23(2):298-305.
    [15]Hagen L., Kahng A. New spectral methods for ratio cut partitioning and clustering [J]. IEEE Transaction on Computer-Aided Design,1992,11(9):1074-1085.
    [16]Shi J., Malik J. Normalized Cuts and Image Segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
    [17]Ding C., He X., Zha H., et al. A min-max cut algorithm for graph partitioning and data clustering[C]. Proceedings of International Conference on Data Mining, California,2001:107-114.
    [18]Zelnik-Manor L., Perona P. Self-tuning spectral clustering[C]. Proceedings of NIPS 2004, Vancouver, B.C.,2004:1601-1608.
    [19]Jebara T., Shchogolev V. B-Matching for Spectral Clustering[C]. Proceedings of 17th European Conference on Machine Learning, Berlin,2006:679-686.
    [20]Gong Y., Chen C. Locality Spectral clustering [C]. In proc of the 21st Australasian Joint Conference on Artificial Intelligence. New Zealand,2008:348-354.
    [21]Azran A., Ghahramani Z. Spectral methods for automatic multiscale data clustering[C]. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, California,2006:190-197.
    [22]Xiang T., Gong S. Spectral clustering with eigenvector selection[J]. Pattern Recognition,2008,.41(3):1012-1029.
    [23]Fisher I., Poland J. Amplifying the block matrix structure for spectral clustering[C]. Proceedings of 14th Annual Machine Learning Conference of Belgium and the Netherlands, Enschede,2005:21-28.
    [24]Golub G. H., Loan C. F. Van. Matrix Computations[M]. Baltimore:Johns Hopkins University Press,1989.
    [25]Fowlkes C, Belongie S, Chung F, et al. Spectral grouping using the Nystrom method[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2004:214-225.
    [26]Yan D., Huang L., and Jordan M. I. Fast approximate spectral clustering[C]. In proc of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris,2009:907-916.
    [27]王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10)：2412-2422.
    [28]S. D. Kamvar, D. Klein, and C. D. Manning. Spectral learning[C]. Proceedings of the International Joint Conferences on Artificial Intelligence,2003:561-566.
    [29]U. von Luxburg. A tutorial on spectral clustering[J]. Statistics and Computing, 2007,17(4):395-416.
    [30]蔡晓妍,戴冠中,杨黎斌.谱聚类算法综述[J].计算机科学,2008,35(7)：14-18.
    [31]高琰,谷士文等.机器学习中谱聚类方法的研究[J].计算机科学,2007,34(2)：201-203.
    [32]Li W, Ng WK, Liu Y et al. Enhancing the Effectiveness of Clustering with Spectral Analysis[J]. IEEE Transactions on Knowledge and Data Engineering,2007, 19(7):887-902.
    [33]Stoer M., Wagner F. A simple min-cut algorithm[J], Journal of the ACM,1997,44(4): 585-591.
    [34]Wagner D., Wagner F. Between Min Cut and Graph Bisection[C]. Proceedings of the 18th International Symposium on Mathematical Foundations of Computer Science, Gdansk,1993:744-750.
    [35]Kannan R, Vempala S, Vetta A. On Clusterings:Good, Bad and Spectral [J]. Journal of the ACM,2004,51(3):497-515.
    [36]Jessen R,Eltoft R, Girolami M et al. Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm[C]. Advances in Neural Information Processing Systems,2006:633-640.
    [37]Li Z, Liu J, Chen S et al. Noise Robust Spectral Clustering[C]. International Conference on Computer Vision,2007:1-8.
    [38]Chang H., Yeung D. Robust path-based spectral clustering [J]. Pattern Recognition, 2008,.41(1):191-203.
    [39]Wang F., Zhang C. Label propagation through linear neighborhoods[C]. Proceedings of the 23nd international conference on Machine learning, Pennsylvania,2006, 985-992.
    [40]田铮,李小斌,句彦伟.谱聚类的扰动分析[J].中国科学,2007,37(4)：527-543.
    [41]Xiang T, Gong S. Spectral Clustering with Eigenvector Selection[J]. Pattern Recognition,2008,41(3):1012-1029.
    [42]任双桥,魏玺章,黎湘等.基于特征可分性的核函数自适应构造[J].计算机学报,2008,31(5)：803-809.
    [43]吴涛.核函数的性质方法及其在障碍检测中的应用[D].长沙：国防科技大学研究生院,2003.
    [44]Jarvis R and Patrick E. Clustering Using a Similarity Measure Based on Shared Nearest Neighbors [J]. IEEE Transactions on Computers,1973,22(11):1025-1034.
    [45]王玲,薄列峰,焦李成.密度敏感的半聚类[J].电子学报,2007,35(8)：1577—1581.
    [46]Asuncion A and Newman D. UCI Machine Learning Repository. Irvine:University of California,2007. http://archive.ics.uci.edu/ml/
    [47]Rand W. Objective criteria for the evaluation of clustering methods[J]. Journal of the American Statistical Association,1971,66(336):846-850.
    [48]Strehl A, Ghosh J. Cluster ensembles-A knowledge reuese framework for combining multiple partitions[J]. Journal of Machine Learning Research,2002,3 (3):583-617.
    [49]Ertoz L, Steinbach M and Kumar V. A New Shared Nearest Neighbor Clustering Algorithm and its Applications[C]. Workshop on Clustering High Dimensional Data and its Applications, Second SIAM International Conference on Data Mining,2002,105-115.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700