文本聚类集成关键技术研究

英文题名：Research of Key Technologies on Document Cluster Ensemble
作者：徐森
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：聚类分析 ; 文本聚类集成 ; 代数变换 ; 低维嵌入 ; 非负矩阵分解 ; 分裂—合并
英文关键词：clustering analysis ; document cluster ensemble ; algebraic transformation ; low dimensional embedding ; non-negative matrix factorization ; divide and merge
学位年度：2010
导师：顾国昌
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2010-01-01

摘要

聚类分析是数据挖掘、模式识别等方向的重要研究内容之一,已被广泛用于数据压缩、信息检索、语音识别、字符识别、图像分割和文本聚类等。另外,在生物学、地质学、地理学、市场营销和异常数据检测等方面也受到越来越多的关注。目前,已有上千种聚类算法,然而没有一种算法可以成功识别出具有不同大小、不同形状、不同密度甚至可能包含噪声的簇。文本数据具有高维、稀疏等特点,这使得许多聚类算法并不适用于文本聚类;另外,文本集规模的海量性对聚类算法的运行效率也提出了很高的要求。作为传统聚类算法的重要扩展,聚类集成技术具备了传统聚类算法所不具备的诸多优点。目前,聚类集成已经发展成为机器学习领域的研究热点之一。本文以文本聚类为应用背景,针对文本聚类集成中的关键问题进行了研究,取得的创新性研究成果包括:
     (1)鉴于谱聚类方法的诸多优点,本文将基于矩阵扰动理论和谱图理论的谱聚类算法引入到文本聚类集成问题中。针对谱聚类算法计算复杂度高的问题,本文基于代数变换,首先将大规模矩阵的特征值分解问题转化为等价的奇异值分解问题,并进一步转化为小规模矩阵的特征值分解问题。由此设计了两个不同的文本聚类集成谱算法SMSA和TMSA。实验结果表明:本文的代数变换方法是切实可行的,代数变换前后算法的运行时间大幅度减少,而且获得的结果非常接近;SMSA和TMSA比基于图划分的聚类集成算法更优越,是解决文本聚类集成问题行之有效的方法。
     (2)本文研究了谱聚类算法的关键思想,从求解“最佳”子空间出发,同时推导出文本和超边的低维嵌入,由此设计了两个基于子空间相似度的聚类集成算法SSICA和SSDCA,实验结果表明:SSICA和SSDCA都获得了比基于图划分的聚类集成算法更优越的结果;SSICA的聚类质量略高于SSDCA。本文进一步泛化SSICA,设计出基于低维嵌入的文本聚类集成方法。该方法首先通过不同的谱聚类算法获得了超边的低维嵌入;随后通过映射的复合间接获得了文本的低维嵌入;最后根据文本在低维空间下的坐标使用简单K均值算法聚类。实验结果表明,该方法比其它常见的基于图划分的聚类集成方法优越,可以有效解决文本聚类集成问题。
     (3)本文将非负矩阵分解(NMF)引入到文本聚类集成问题中,设计了BNMF算法;由于NMF算法收敛速度较慢、易于收敛到较差的局部最优解,本文使用K均值初始化NMF,设计出NMFK算法;另外,针对K均值算法随机初始化所带来的聚类结果不稳定问题,本文使用最小最大原则确定K均值算法的初值,设计出NMFKMMP算法。实验结果表明:使用K均值算法初始化NMF是有效的,NMFK获得了比BNMF算法更加优越、稳定的结果,且运行效率也比BNMF高出许多;NMFKMMP算法可以有效解决文本聚类集成问题,NMFKMMP算法运行高效,并且获得了比其它常见的聚类集成算法更加优越的结果。
     (4)超球K均值算法不能有效识别非超球状的簇,因此易于产生精度较低的文本聚类集成成员。为了进一步提高文本聚类集成算法的聚类质量,本文在集成成员生成阶段引入了CHAMELEON算法的关键思想——“分裂—合并”(DM)策略。首先在聚类成员生成阶段运行使用DM策略的SKM算法r次,每次生成较多的文本子簇,并根据子簇的相似性使用Ward算法合并这些子簇,得到r个聚类成员,随后在聚类集成阶段采用本文设计的聚类集成算法进行集成。实验结果显示,除了基于图划分的聚类集成算法外,基于层次聚类方法的4个聚类集成算法以及本文设计的基于谱聚类方法、基于低维嵌入方法和基于非负矩阵分解方法的多个文本聚类集成算法在使用DM策略后获得的平均规范化互信息(NMI)都有不同程度的提高,这表明DM策略可以有效提高聚类集成算法的聚类质量。
As one of the most important research topics in data mining and pattern recognition, clustering analysis has been widely used in areas such as data compressing, information retrieval, speech recognition, characters recognition, image segmentation, document clustering, etc. Besides, it has been attracted more and more attention in biology, geology, geography, marketing and abnormal data detection. Thousands of clustering algorithms exist, yet no one can successfully recognize clusters with different sizes, different shapes, and different densities or even with noises. Many of them are not applicable for document clustering due to the high dimensionality and sparseness of documents; furthermore, the magnanimity of document sets imposes high efficiency on clustering algorithms. As an important extension to convenient clustering algorithms, cluster ensemble technique has become a hotspot in machine learning area because it has many advantages that convenient clustering algorithms do not have. Taking document clustering as application background, this dissertation studies the key problems in document cluster ensemble, and obtains the following innovative contributions:
     (1) Spectral clustering algorithms which are based on matrix perturbation theory and spectral graph theory, respectively, are brought into document cluster ensemble problem. To make the algorithms extensible to large scale applications, the eigenvalue decomposition problem of large scaled matrixes is avoided by solving the eigenvalue decomposition of small matrixes, and thus computational complexity of the algorithms is effectively reduced. Experiments on real-world document sets show that (a) the algebraic transformation is feasible; (b) both of the proposed ensemble algorithms based on spectral cluatering algorithms outperform other common cluster ensemble techniques based on graph partitioning, and they can effectively solve document cluster ensemble problem.
     (2) The key idea of spectral clustering algorithms is introduced into solving document cluster ensemble problem. Beginning with seeking the“best”subspace, the low dimensional embeddings of documents and hyperedges are attained simultaneously, whereupon two simple and fast algorithms, SSICA and SSDCA, are proposed. Experiments on real-world document sets show that the proposed algorithms outperform other common methods and SSICA is a little better than SSDCA. SSICA is further extended and a low dimensional embedding method is proposed. Firstly, spectral clustering algorithms are performed to obtain the low dimensional embeddings of hyperedges. Then the low dimensional embeddings of documents are attained indirectly by composition of mappings and finally K-means algorithm is performed to cluster the documents according to their coordinates in the low dimensional space. Experiments on real-world document sets show that the proposed method outperforms other common cluster ensemble technologies based on graph partitioning algorithms and can also effectively solve document cluster ensemble problem.
     (3) Non-negative Matrix Factorization (NMF) is brought forth into document cluster ensemble problem and BNMF is proposed. Since NMF converges slowly and tends to converge to poor solution, NMFK is proposed which use K-Means to initialize NMF. Furthermore, in order to solve the unstable problem due to the random initialization of K-Means, NMFKMMP is proposed which uses minimum and maximum principle to attain the initial centroids. Experimental results show that: (a) using K-Means to initialize NMF is effective because NMFK can get better and more stable results than BNMF, and it is more efficient than BNMF; (b) NMFKMMP can effectively solve document cluster ensemble problem because it outperforms other cluster ensemble technologies and is very efficient.
     (4) To further boost the performance of cluster ensemble algorithm, the key idea of CHAMELEON, Divide and Merge (DM) strategy, is introduced to reinforce the ability of spherical K-means algorithm to discover clusters with irregular shapes. In ensemble member generation phase, SKM was first performed to obtain multiple document sub-clusters for r times, each time using Ward, an agglomerative hierarchical method, to merge these sub-clusters according to their similarity, and attained an ensemble of r clusterings. Then the cluster ensemble algorithms are performed to ensemble the r clusterings. Experiments on real-world document sets show that DM strategy increased with different degrees the normalized mutual information (NMI) scores of the cluster ensemble algorithms based on hierarchical clustering method, spectral method, low dimensional embedding method and non-negative matrix factorization method except for the graph partitioning-based method. These results prove that DM strategy can effectively improve the performance of document cluster ensemble algorithms.

引文

[1] Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. MA, USA: Addison-Wesley Longman, 2005.
    [2] Jain A K, Murty M N, Flynn P J. Data clustering: A review. ACM Computing Surveys, 1999, 31(3): 264-323.
    [3]孙吉贵,刘杰,赵连宇.聚类算法研究.软件学报, 2008, 19(1): 48-61.
    [4] Bellman R. Adaptive control processes: a guided tour. Princeton, New Jersey: Princeton University Press, 1961.
    [5] Scott D W, Thompson J R. Probability density estimation in higher dimensions. Proceedings of the Fifteenth Symposium on the Interface, Amsterdam, Holland, 1983. 173-179.
    [6] Salton G, McGill M J. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
    [7] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1998, 24(5): 513-523.
    [8] Duda R, Hart P, Stork D. Pattern Classification. Second Edition. John Wiley & Sons, New York, 2001.
    [9] Scholkopf B, Smola A, Muller K R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 1998, 10(5): 1299-1319.
    [10] Berry M W. Large-scale sparse singular value computations. The International Journal of Supercomputer Applications, 1992, 6(1): 13-49.
    [11] Vapnik V. Statistical learning theory. New York: John Wiley, 1998.
    [12] MacQueen J B. Some methods for classification and analysis of multivariate observations. Proceeding of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967: 281-297.
    [13] Bradley P S, Fayyad U M. Refining initial points for k-means clustering. Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann Publishers Inc., 1998: 91-99.
    [14] Likas A. The global k-means clustering algorithm. Pattern Recognition, , 2003, 36(2): 451-461.
    [15] Dhillon I S, Guan Y, Kogan J. Iterative clustering of high dimensional text data augmented by local search. Proceedings of the 2002 IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2002. http://www.cs.utexas.edu/~inderjit/ public_papers/iterative_icdm02.pdf.
    [16] Dhillon I S, Fan J, Guan Y. Efficient clustering of very large document collections. Data Mining for Scientific and Engineering Applications, 2001. http://www.cs.utexas.edu/~inderjit/public_papers/effclus.pdf.
    [17] Dhillon I S, Modha D S. A data-clustering algorithm on distributed memory multiprocessors. In Revised Papers From Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD M. J. Zaki and C. Ho, Eds. Lecture Notes In Computer Science. London: Springer-Verlag, 2000, 1759: 245-260.
    [18] Ng R T, Han J. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(5): 1003-1016.
    [19] Hoppner F, Klawonn F, Kruse R, Runkler T. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. John Wiley & Sons, New York, 1999.
    [20] Dhillon I S, Modha D S. Concept decompositions for large sparse text data using clustering. Machine Learning, 2001, 42: 143-175.
    [21] Ward J H Jr. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 1963, 58(301): 236-244.
    [22] Savaresi S M, Boley D L. On the performance of bisecting K-means and PDDP. Proceedings of the first SIAM ICDM. https://zeno.siam.org/ proceedings/datamining/2001/dm01_05SavaresiS.pdf.
    [23] Zhang T, Ramakrishnan R, Livny M. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1997, 1(2): 141-182.
    [24] Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. Information Systems, 2001, 26(1): 35-58.
    [25] Guha S, Rastogi R, Shim K. Rock: a robust clustering algorithm for categorical attributes. Information Systems, 2000, 25(5): 345-366.
    [26] Jarvis R A, Patrick E A. Clustering using a similarity measure based on shared nearest neighbors. IEEE Transactions on Computers, 1973, 22(11): 1025-1034.
    [27] Karypis G, Han E H, Kumar V. CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 1999, 2(8): 68-75.
    [28] Kohonen T, Kaski S, Lagus K, Salojrvi J, Honkela J, Paatero V, Saarela A. Self organization of a massive document collection. IEEE Trans. Neural Networks, 2000, 11(3):574-585.
    [29] Ester M, Kriegel H P, Sander J, XU X. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd ACM SIGKDD. Portland, Oregon, 1996: 226-231.
    [30] Sander J, Ester M, Kriegel H P, X Xu. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169-194.
    [31] Ankerst M, Breunig M M, Kriegel H P, Sander J. OPTICS: ordering points to identify the clustering structure. Proceedings of the 1999 ACM SIGMOD international conference. New York: ACM, 1999: 49-60.
    [32] Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. Knowledge Discovery and Data Mining, 1998. https://www.aaai.org/Papers/KDD/1998/KDD98-009.pdf.
    [33] Wang W, Yang J, Muntz R. STING: a statistical information grid approach to spatial data mining. Proceedings of the 23rd VLDB Conference. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1997: 186-195.
    [34] Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: a multi-resolution clustering approach for very large spatial databases. Proceedings of the 24th VLDB Conference. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1998: 428 - 439.
    [35] Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the 1998 ACM SIGMOD International Conference. ACM SIGMOD Record, 1998, 27(2): 94-105.
    [36] Goil S, Nagesh H, Choudhary A. MAFIA: efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University, 1999.
    [37] Zhong S, Ghosh J. A unified framework for model-based clustering and its applications to clustering time sequences. Journal of Machine Learning Research, 2003, 4: 2001-2037.
    [38] Cadez I V, Gaffney S, Smyth P. A general probabilistic framework for clustering individuals and objects. Proceedings of the sixth ACM SIGKDD international conference.New York: ACM, 2000: 140-149.
    [39] Meila M, Heckerman D. An experimental comparison of model-based clustering methods. Machine Learning, 2001, 42(1-2): 9-29.
    [40] Law M, Figueiredo M, Jain A. Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(9): 1154-1166.
    [41] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 1977, 39(1): 1-38.
    [42] Blimes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden Markov models. Technical Report TR-97-021, University of California at Berkeley, 1998.
    [43] Banerjee A, Merugu S, Dhillon I, Ghosh J. Clustering with Bregman divergences. Journal of Machine Learning Research, 2005, 6: 1705-1749.
    [44] Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha D S. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. Journal of Machine Learning Research, 2007, 8: 1919-1986.
    [45]罗四维,赵连伟.基于谱图理论的流形学习算法.计算机研究与发展, 2006, 43(7): 1173-1179.
    [46] Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, USA, 2004, 6(1): 90-105.
    [47] Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering. Proceedings of the Ninth ACM SIGKDD International. New York: ACM, 2003: 89-98.
    [48] Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM, 2001: 269-274.
    [49]张长胜,孙吉贵,崔妍,杨凤芹.一种基于PSO的分割聚类算法.吉林大学学报(工学版), 2008, 38(6): 1371-1377.
    [50] Frey B J, Dueck D. Clustering by passing messages between data points. Science, 2007, 315(5814): 972-976.
    [51] Mezard M. Where Are the Exemplars? Science, 2007, 315(5814): 949-951.
    [52] Kleinberg J. An impossibility theorem for clustering. Advances in Neural Information Processing Systems, MIT Press, 2002.
    [53]罗会兰.聚类集成关键技术研究.浙江大学博士学位论文, 2007.
    [54]阳琳赟,王文渊.聚类融合方法综述.计算机应用研究, 2005, 22(12): 8-10.
    [55] Dietterich T G. Machine-learning research: four current directions. The AI Magazine, 1998, 18(4): 97-136.
    [56] Dietterich T G. Ensemble methods in machine learning. Lecture Notes in Computer Science, Springer, 2000, 1857: 1-15.
    [57] Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for combining partitionings. Journal of Machine Learning Research, 2002, 3. 583-617.
    [58] Topchy A, Jain A K, Punch W. A mixture model for clustering ensembles. Proceedings of the 4th SIAM ICDM, 2004. http://dataclustering.cse.msu. edu/papers/topchy_mixture_siam_accepted.pdf.
    [59] Fred A. Finding consistent clusters in data partitions. Proceedings of the 2nd International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, Springer, 2001, 2096: 309-318.
    [60] Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 1998, 20(1): 359-392.
    [61] Karypis G, Aggarwal R, Kumar V, Shekhar S. Multilevel hypergraph partitioning: Applications in VLSI domain. Proceedings of the Design and Automation Conference, Now York: ACM, 1999: 526-529.
    [62] Fred A, Jain A K. Data clustering using evidence accumulation. Proceedings of the16th International Conference on Pattern Recognition, 2002, 4: 276-280.
    [63] Fred A L, Jain A K. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(6): 835-850.
    [64] Ayad H, Kamel M. Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. Proceedings of the 4th International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, Springer, 2003, 2709: 166-175.
    [65] Ayad H, Kamel M. Refined shared nearest neighbors graph for combining multiple data clusterings. Proceedings of the 5th International Symposium on Intelligent Data Analysis, Lecture Notes in Computer Science, Springer, 2003, 2810: 307-318.
    [66] Fern X Z, Brodley C E. Solving cluster ensemble problems by bipartite graph partitioning. The 21st International Conference on Machine Learning, 2004. http://www.machinelearning.org/proceedings/icml2004/papers/289.pdf.
    [67] Ng A, Jordan M, Weiss Y. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems. MIT Press, 2002: 849-856.
    [68] Ayad H, Basir O A, Kamel M. A probabilistic model using information theoretic measures for cluster ensembles. Proceedings of the 5th International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, Springer, 2004, 3077: 144-153.
    [69]唐伟,周志华.基于Bagging的选择性聚类集成.软件学报, 2005, 16(4): 496-502.
    [70] Zhou Z H, Tang W. Clusterer ensemble. Knowledge-Based Systems, 2006, 19(1): 77-83.
    [71] Nguyen N, Caruana R. Consensus clusterings. Proceedings of the 7th IEEE International Conference on Data Mining, Washington, DC: IEEE Computer Society, 2007: 607-612.
    [72] Fred A, Lourengo A. Cluster ensemble methods: from single clusterings to combined solutions. Supervised and Unsupervised Ensemble Methods and their Applications. Berlin: Springer, 2008. 3-30.
    [73] Sevillano X, Alías F, SocoróJ C. BordaConsensus: a new consensus function for softcluster ensembles. Proceedings of the 30th annual international ACM SIGIR, New York: ACM, 2007: 743-744.
    [74]王红军,李志蜀,成飏,周鹏,周维.基于隐含变量的聚类集成模型.软件学报, 2009, 20(4): 825-833.
    [75]罗会兰,孔繁胜,李一啸.聚类集成中的差异性度量研究.计算机学报, 2007, 30(8): 1315-1323.
    [76] Luxburg U V. A Tutorial on Spectral Clustering. Statistics and Computing, 2007, 17(4): 395-416.
    [77] Fiedler M. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 1973, 23: 298-305.
    [78] Donath W E, Hoffman A J. Lower bounds for the partitioning of graphs. IBM Journal of Research and Development, 1973, 17: 420-425.
    [79] Hagen L, Kahng A. New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 1992, 11(9): 1074-1085.
    [80] Driessche R V, Roose D. An improved spectral bisection algorithm and its application to dynamic load balancing. Parallel Computing, 1995, 21(1): 29-48.
    [81] Hendrickson B, Leland R. An improved spectral graph partitioning algorithm for mapping parallel computations. SIAM Journal on Scientific Computing, 1995, 16(2): 452-469.
    [82] Pothen A, Simon H D, Liou K P. Partitioning sparse matrices with eigenvectors of graphs. SIAM Journal of Matrix Analysis and Applications, 1990, 11(3): 430-452.
    [83] Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
    [84] Meila M, Shi J. Learning segmentation by random walks. Advances in Neural Information Processing Systems. MIT Press, 2000: 873-879.
    [85] Meila M, Shi J. A random walks view of spectral segmentation. AI and Statistics (AISTATS), 2001. http://eref.uqu.edu.sa/files/a_random_walks _view_of_spectral_segmenta_55059.pdf.
    [86] Rege M, Dong M, Fotouhi F. Co-clustering documents and words using bipartite isoperimetric graph partitioning. Proceedings of the Sixth International Conference onData Mining, Washington, DC: IEEE Computer Society, 2006: 532-541.
    [87] Bach F, Jordan M. Learning spectral clustering. Advances in Neural Information Processing Systems. MIT Press, 2003.
    [88] Bach F, Jordan M. Learning spectral clustering, with application To speech separation. Journal of Machine Learning Research, 2006, 7: 1963-2001.
    [89] Alpert C J, Yao S Z. Spectral partitioning: the more eigenvectors, the better. Proceedings of the 32nd ACM/IEEE Conference on Design Automation. New York: ACM, 1995: 195-200.
    [90] Kannan R, Vempala S, Vetta A. On clusterings: good, bad and spectral. Journal of the ACM (JACM), 2004, 51(3): 497-515.
    [91] Ding CHQ, He X, Zha H, Gu M, Simon H D. A min-max cut algorithm for graph partitioning and data clustering. Proceedings of the first IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2001: 107-114.
    [92]王玲,薄列峰,焦李成.密度敏感的谱聚类.电子学报, 2007, 35(8): 1577-1581.
    [93] Tian Z, Li X B, Ju Y W. Spectral clustering based on matrix perturbation theory. Science in China Series F: Information Sciences, 2007, 50(1): 63-81.
    [94] Spielman D, Teng S. Spectral partitioning works: planar graphs and finite element meshes. In 37th Annual Symposium on Foundations of Computer Science. Los Alamitos, CA: IEEE Computer. Soc. Press, 1996. 96-105.
    [95] Barnard S, Pothen A, Simon H. A spectral algorithm for envelope reduction of sparse matrices. Numerical Linear Algebra with Applications, 1995, 2(4): 317-334.
    [96] Guattery S, Miller G L. On the quality of spectral separators. SIAM Journal of Matrix Anal. Appl., 1998, 19(3): 701-719.
    [97] Weiss Y. Segmentation using eigenvectors: a unifying view. Proceedings of the International Conference on Computer Vision, Washington, DC: IEEE Computer Society, 1999: 975-982.
    [98] Higham D, Kibble M. A unified view of spectral clustering. Mathematics Research Report 2, University of Strathclyde, 2004. http://citeseerx.ist.psu. edu/viewdoc/download?doi=10.1.1.115.1591&rep=rep1&type=pdf.
    [99] Luxburg U V, Bousquet O, Belkin M. On the convergence of spectral clustering onrandom samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT). Springer, New York, 2004: 457-471.
    [100] Luxburg U V, Bousquet O, Belkin M. Limits of spectral clustering. Advances in Neural Information Processing Systems, MIT Press, 2004.
    [101] Luxburg U V, Belkin M, Bousquet O. Consistency of spectral clustering. The Annals of Statistics, 2008, 36(2): 555-586.
    [102] Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. Proceedings of the 20th International Conference of Machine Learning. AAAI Press, 2003. http://www.aaai.org/Papers/ICML/2003/ICML03-118.pdf.
    [103] Belkin M, Niyogi P. Semi-supervised learning on Riemannian manifolds. Machine Learning, 2004, 56(1-3): 209-239.
    [104]王玲,薄列峰,焦李成.密度敏感的半监督谱聚类.软件学报. 2007, 18(10): 2412-2422.
    [105] Li T, Ogihara M, Ma S. On combining multiple clusterings. Proceedings of the thirteenth ACM international conference on Information and Knowledge Management. New York: ACM, 2004: 294-303.
    [106] Lovasz L. Random walks on graphs: A survey. In Combinatorics, Paul Erdos is Eighty, 1993, 2: 1-46. http://www.cs.unibo.it/~babaoglu/ courses/cas/resources/tutorials/RandomWalks.pdf.
    [107] TREC. Text REtrieval conference. http://trec.nist.gov, 2009, 9.
    [108] Lewis D D. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 2007, 11.
    [109] Patrikainen A, Meila M. Comparing subspace clusterings. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(7): 902-916.
    [110] Dattorro J. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing, USA, 2005.
    [111] Deerwester S C, Dumais S T, Landauer T K, Furnas G W, Harshman R A. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 1990, 41(6): 391-407.
    [112] Papadimitriou C H, Raghavan P, Tamaki H, Vempala S. Latent Semantic Indexing: A Probabilistic Analysis. Proceeding of Seventeenth ACM-SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, Seattle, Washington, 1998: 159-168.
    [113] Bengio Y, Delalleau O, Roux N L, et al. Learning eigenfunctions links spectral embedding and kernel PCA. Neural Computation, MIT Press, 2004, 16(10): 2197-2219.
    [114] Roweis S T, Saul L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000, 290(12): 2323-2326.
    [115] Tenenbaum J B, Silva V, Langford J C. A global geometric framework for nonlinear dimensionality reduction. Science, 2000, 290(12): 2319-2323.
    [116] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Neural Computation, 2003, 15(6): 1373-1396.
    [117] Lee D D, Seung H S. Learning the parts of objects by non-negative matrix factorization. Nature, 1999, 401(6755): 788-791.
    [118]刘维湘,郑南宁,游屈波.非负矩阵分解及其在模式识别中的应用.科学通报, 2006, 51(3): 241-250.
    [119]李乐,章毓晋.非负矩阵分解算法综述.电子学报, 2008, 36(4): 737-743.
    [120] Lee D D, Seung H S. Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems. MIT Press, 2000: 556-562.
    [121] Xu W, Liu X, Gong Y H. Document-clustering based on non-negative matrix factorization. Proceedings of SIGIR’03. Toronto: CA, 2003: 267-273.
    [122] Wild S, Curry J, Dougherty A. Improving non-negative matrix factorizations through structured initialization. Pattern Recognition, 2004, 37(11): 2217-2232.
    [123]范金城,梅长林.数据分析.北京:科学出版社. 2002. 228-229.
    [124] Dudoit S, Fridlyand J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 2003, 19(9): 1090-1099.
    [125] Fred A, Jain A K. Evidence accumulation clustering based on the K-means algorithm. Proceedings of the International Workshops on Structural and Syntactic Pattern Recognition, Springer, 2002, 2396: 442-451.
    [126] Minaei-Bidgoli B, Topch A, Punch W F. Ensembles of partitions via data resampling.Proceedings of International Conference on Information Technology, Coding and Computing. Washington, DC: IEEE Computer Society, 2004, 2: 188-192.
    [127] Minaei-Bidgoli B, Topchy A, Punch W F. A comparison of resampling methods for clustering ensembles. International Conference on Machine Learning, Models, Technologies and Applications, 2004. 939-945. http://web.cse.msu.edu/~minaeibi/papers/Behrouz_MLMTA2004.pdf.
    [128] Topchy A, Jain A K, Punch W F. Combining multiple weak clusterings. Proceedings of the 3rd IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2003: 331-338.
    [129] Fern X Z, Brodley C E. Random projection for high dimensional data clustering:a cluster ensemble approach. Proceedings of the 20th International Conference on Machine Learning, Washington, DC: IEEE Computer Society, 2003. http://www.aaai.org/Papers/ICML/2003/ ICML03-027.pdf.
    [130] Fern X Z, Brodley C E. Cluster ensembles for high dimensional clustering: an empirical study. Technical Report, CS06-30-02, Oregon State University, 2006.
    [131] Topchy A, Minaei-Bidgol B, Jain A K, et al. Adaptive clustering ensembles. Proceedings of the 17th International Conference on Pattern Recognition. Washington, DC: IEEE Computer Society, 2004, 1: 272-275.
    [132] Kuncheva L I, Hadjitodorov S T. Using diversity in cluster ensembles. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics. 2004, 2: 1214-1219.
    [133] Hadjitodorov S T, Kuncheva L I, Todorova L P. Moderate diversity for better cluster ensembles. Information Fusion, 2005, 7(3): 264-275.
    [134] Shinnou H, Sasaki M. Ensemble document clustering using weighted hypergraph generated by NMF. Proceedings of the ACL 2007 Demo and Poster Sessions. Prague: Association for Computational Linguistics, 2007: 77-80.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700