高维数据上的聚类方法研究

英文题名：Research of Clustering Methods on High Dimensional Data
作者：任亚洲
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：聚类 ; 高维数据 ; 集成聚类 ; 均值偏移
英文关键词：Clustering ; High Dimensional Data ; Ensemble Clustering ; Mean Shift
学位年度：2014
导师：张国基
学科代码：081203
学位授予单位：华南理工大学
论文提交日期：2014-06-01
答辩委员会主席：汤庸

摘要

近年来，随着信息技术的飞速发展、数据集规模的不断膨胀，如何有效地分析这些海量数据并从中提取有用的信息成为研究的热点和难点。聚类分析，作为一种无监督的机器学习方法越来越受到人们重视并得到了快速的发展，已经被广泛应用于生物信息学、互联网技术和图像分析等重要领域。一般的聚类算法在低维数据上能取得较好的结果，而在处理高维数据时会发生“维数灾难”。随着数据维度的增加，数据变得稀疏，样本之间的距离差距不再明显，同时噪声特征和冗余特征也随之增多，这些因素都可能导致聚类算法的有效性大大降低。因此，针对高维数据的聚类算法研究已经成为机器学习领域研究的难点与重点。同时，很多聚类算法对数据有很多约束条件，如限制簇的数目以及形状等等，而这些限制在实际问题中往往不能得到满足，所以如何设计有效的“无参”的聚类算法也非常重要。
     本文以高维数据上的聚类方法研究为主线，结合集成技术、Boosting技术等，对高维数据的聚类问题展开深入研究，提出了一些新的聚类算法。全文的主要贡献包括：
     （1）集成聚类能够综合利用多个聚类结果提高聚类结果的稳定性和准确性近年来大量的集成聚类算法被提出，然而其中绝大部分算法将每个基聚类、每个样本或每个簇平等地对待。一些算法尝试在集成过程中使用簇或者基聚类的权重，然而还没有相关的研究工作在集成过程中更细粒度地考虑样本的权重。为了解决这一问题，本文提出样本加权的集成聚类算法(Weighted-Object Ensemble Clustering, WOEC)。WOEC首先通过共联矩阵去评估每个样本难划分的程度，并为样本分配相应的权重。本文提出三种集成聚类方法来利用样本的权重，这三种方法都把集成聚类问题转化为图的分割问题。大量实验证明WOEC算法的优越性以及对参数的鲁棒性。
     （2）Mean Shift（均值偏移算法）是一种“无参”的聚类方法，它不需要指定簇的数目和形状。它为每个点做概率密度估计，并不断沿着邻域内的概率密度增加最大的方向移动直至收敛。收敛到同一个模的所有样本点被划分为同一类。运行时，由于高维数据的稀疏性以及噪音特征的存在，Mean Shift的有效性大大降低。为解决这一问题，本文提出一种加权的自适应均值偏移聚类算法(A Weighted Adaptive Mean Shift ClusteringAlgorithm, WAMS)。首先，WAMS分析每个样本点所在的子空间信息，并将这些信息应用到Mean Shift算法中，从而避免在原始空间里计算距离。WAMS能够有效地处理高维数据，并同时保持了Mean Shift的“无参”特性。利用随机采样技术，可以加快WAMS的运行速度，而不会牺牲WAMS的准确性。本文在大量人工和真实数据集上证明了WAMS算法的有效性。
     （3）Mean Shift算法的另一个缺点是对参数（带宽）的选择敏感，而且不能处理一簇多模的情况。DBSCAN是另一种流行的基于密度的聚类算法，它也对参数敏感且容易合并有交集的簇。为了克服这些缺点，本文提出一种增强的均值偏移聚类算法(BoostedMean Shift Clustering, BMSC)。BMSC通过一个网格划分原始数据并局部地在网格的每个单元执行Mean Shift算法，这样每个单元可以提供一组中间过程的模(iModes)。本文提出一种模-增强的技术以迭代地选择稠密区域的样本，而DBSCAN被用来划分已得的所有iModes。计算复杂度分析说明了BMSC有处理大规模数据的潜力，实验也证明了BMSC算法的有效性和鲁棒性。
In recent years, with the rapid development of information technology and sharp increaseof the size of data, how to effectively analyze these massive amount of data and extract use-ful information has become one of the research hotspots and difficulties. Cluster analysis, anunsupervised machine learning method, attracts more and more attentions and has developedrapidly. It has been used in bioinformatics, Internet technology and image analysis, et al. Mostclustering algorithms can achieve good performance on low-dimensional data, but will occur”curse of dimensionality” when dealing with high-dimensional data. With the increase of di-mensionality, the data becomes sparse, distances among objects tend to be the same and noisyand redundant features increase as well, thereby reducing the effectiveness of clustering algo-rithms. Therefore, research of clustering methods on high-dimensional data becomes a popularand difficult reasearch area. Meanwhile, many clustering algorithms have a lot of constraints onthe data, such as restrictions on the number of clusters and on their shapes, which probably arenot satisfied in real-world tasks. So how to design effective nonparameter clustering techniquesis very important as well.
     To address the problem associated with clustering on high-dimensional data, we make useof ensemble technique and Boosting, et al., and propose several novel clustering methods. Themain contributions of the thesis are summaried as follows:
     (1) Ensemble clustering aims to generate a stable and robust clustering through the con-solidation of multiple base clusterings. In recent years many ensemble clustering methods havebeen proposed, most of which treat each clustering and each object as equally important. Someapproaches make use of weights associated with clusters, or with clusterings, when assemblingthe different base clusterings. However, not much effort has been put towards incorporatingweighted objects into the consensus process. To fill this gap, in this paper we propose an ap-proach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficultit is to cluster an object by constructing the co-association matrix that summarizes the baseclustering results, and we then embed the corresponding information as weights associated toobjects. We propose three different consensus techniques to leverage the weighted objects. Allthree reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-artconsensus clustering methods and is robust to parameter settings.
     (2) The mean shift algorithm is a nonparametric clustering technique that does not makeassumptions on the number of clusters and on their shapes. It achieves this goal by performingkernel density estimation, and iteratively locating the local maxima of the kernel mixture. Thesetofpointsthatconvergetothesamemodedefinesacluster. Whileappealing, theperformanceof the mean shift algorithm significantly deteriorates with high dimensional data due to thesparsity of the input space. Noisy features can also disguise the mean shift procedure.
     Inthispaperweextendthemeanshiftalgorithmtoovercometheselimitations, whilemain-taining its desirable properties. To achieve this goal, we first estimate the relevant subspace foreachdatapoint, andthenembedsuchinformationwithinthemeanshiftalgorithm, thusavoidingcomputing distances in the full dimensional input space. The resulting approach achieves thebest-of-two-worlds: effective management of high dimensional data and noisy features, whilepreserving a nonparametric nature. Our approach can also be combined with random samplingto speedup the clustering process with large scale data, without sacrificing accuracy. Extensiveexperimental results on both synthetic and real-world data demonstrate the effectiveness of theproposed method.
     (3)Mean shift is a nonparametric clustering technique that does not require the number ofclusters in input and can find clusters of arbitrary shapes. While appealing, the performance ofthe mean shift algorithm is sensitive to the selection of the bandwidth, and can fail to capture thecorrect clustering structure when multiple modes exist in one cluster. DBSCAN is an efficientdensity based clustering algorithm, but it is also sensitive to its parameters and typically mergesoverlapping clusters. In this paper we propose Boosted Mean Shift Clustering (BMSC) to ad-dress these issues. BMSC partitions the data across a grid and applies mean shift locally on thecells of the grid, each providing a number of intermediate modes (iModes). A mode-boostingtechnique is proposed to select points in denser regions iteratively, and DBSCAN is utilized topartition the obtained iModes iteratively. Complexity analysis shows its potential to deal withlarge-scale data and extensive experimental results on both synthetic and real benchmark datademonstrate its effectiveness and robustness to parameter settings.

引文

注1http://baike.baidu.com/view/31801.htm
    注2soft.cs.tsinghua.edu.cn/~keltin/docs/ensemble.pdf
    [1]王钰,周志华,周傲英.机器学习及其应用[M].北京:清华大学出版社,2006.
    [2] Mjolsness E, DeCoste D. Machine Learning for Science: State of the Art and Future Prospects[J].Science,2001,293:2051–2055.
    [3] Duda R O, Hart P E, Stork D G. Pattern Classification[M].2nd Edition. New York: John Wiley andSons,2000.
    [4]周志华.机器学习与数据挖掘[J].中国计算机学会通讯,2007,3(12):35–44.
    [5] ZhuX. Semi-SupervisedLearningLiteratureSurvey[R]. Madison: DepartmentofComputerSciences,University of Wisconsin,2008.
    [6] Settles B. Active Learning Literature Survey[R]. Madison: University of Wisconsin,2009.
    [7] Roweis S, Saul L. Nonlinear dimensionality reduction by local linear embedding[J]. Science,2000,290:2323–2326.
    [8] Polikar R. Ensemble learning[J].2009,4(1):2776.
    [9] Pan S J, Yang Q. A Survey on Transfer Learning[R]. Hong Kong: Department of Computer Scienceand Engineering, Hong Kong University of Science and Technology,2008.
    [10] Han J, Kamber M. Data Mining: Concepts and Techniques[M].2nd Edition. San Francisco: MorganKaufmann Publishers,2006.
    [11] Apostolico A, Denas O. Fast Algorithms for Computing Sequence Distances by Exhaustive SubstringComposition[J]. Algorithms for Molecular Biology,2008,3(1):13.
    [12] Vinga S, Gouveia-Oliveira R, Almeida J S. Comparative evaluation of word composition distances forthe recognition of SCOP relationships[J]. Bioinformatics,2004,20(2):206–215.
    [13] Ververidis D, Kotropoulos C. Information loss of the mahalanobis distance in high dimensions: appli-cation to feature selection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2009,31(12):2275–2281.
    [14] Yu J, Amores J, Sebe N. Distance learning for similarity estimation[J]. IEEE Transactions on PatternAnalysis and Machine Intelligence,2008,30(3):451–462.
    [15] Sonnhammer E, Hollich V. Scoredist: a simple and robust protein sequence distance estimator[J].Bioinformatics,2005,6:108.
    [16] Brendan M, Albert M. Distance functions for categorical and mixed variables[J]. Pattern RecognitionLetters,2008,29(7):986–993.
    [17] Francisco C, Renata M, Chavent M. Adaptive Hausdorff distances and dynamic clustering of symbolicinterval data[J]. Pattern Recognition Letters,2006,27(3):167–179.
    [18] Ling H, Jacobs D W. Shape classification using the inner-distance[J]. IEEE Transactions on PatternAnalysis and Machine Intelligence,2007,29(2):286–299.
    [19] Kocsor A, Kertesz-Farkas A, Kajan L. Application of compression-based distance measures to proteinsequence classification: a methodological study[J]. Bioinformatics,2006,22(4):407–412.
    [20] Wu K P, Wang S D. Choosing the kernel parameters for support vector machines by the inter-clusterdistance in the feature space[J]. Pattern Recognition,2009,42(5):710–717.
    [21] Sanghamitra B, Saha S. GAPS: A clustering method using a new poin symmetry-based distance mea-sure[J]. Pattern Recognition,2007,40(12):3430–3451.
    [22] Julia H, Knowles J, Kell D B. Computational cluster validation in post-genomic data analysis[J].Bioinformatics,2005,21(15):3201–3212.
    [23] K.GeorgeEHH,KumarV. Chameleon: Hierarchicalclusteringusingdynamicmodeling[J]. ComputerScience for Environmental Engineering and Ecoinformatics,1999,32(8):68–75.
    [24] G. V. Nosovskiy D L, Sourina O. Automatic clustering and boundary detection algorithm based onadaptive influence function[J]. Pattern Recognition,2008,41(9):2757–2776.
    [25] Jain A K, Murty M N, Flynn P J. Data Clustering: A Review[J]. ACM Computing Surveys,1999,31(3):264–323.
    [26] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[A]. In:Proceedings of the5th Berkeley Symposium on Mathematical Statistics and Probability[C]. StatisticalLaboratory of the University of California, Berkeley,1967:281–297.
    [27] Kaufman L, Rousseeuw P. Clustering by Means of Medoids[M]. Delft: Faculty of Mathematics andInformatics,1987.
    [28]雷小锋,谢昆青,林帆.一种基于K-Means局部最优性的高效聚类算法[J].软件学报,2008,19(7):1683–1692.
    [29] Bagirov A M. Modified global k-means algorithm for minimum sum-of-squares clustering prob-lems[J]. Pattern Recognition,2008,41(10):3192–3199.
    [30] Hamerly G, Elkan C. Learning the k in k-means[A]. In: Advances in Neural Information ProcessingSystems16[C]. MIT Press,2003:281–288.
    [31] Chung K L, Lin J S. Faster and more robust point symmetry-based K-means algorithm[J]. PatternRecognition,2007,40(2):410–422.
    [32] Huang J Z, Ng M K, Rong H. Automated variable weighting in k-means type clustering[J]. IEEETransactions on Pattern Analysis and Machine Intelligence,2005,27(5):657–668.
    [33] Wu F X. Genetic weighted k-means algorithm for clustering large-scale gene expression data[J]. BMCBioinformatics,2008,9(Suppl6): S12.
    [34] Xiong H, Wu J, Chen J. K-means clustering versus validation measures: a data-distribution perspec-tive[J]. IEEE Transactions on Systems, Man, and Cybernetics,2009,39(2):318–331.
    [35] Guha S, Rastogi R, Shim K. CURE: An Efficient Clustering Algorithm for Large Databases[A]. In:Proceedings of the ACM SIGMOD International Conference on Management of Data[C]. New York,NY, USA: ACM,1998:73–84.
    [36] Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very largedatabases[A]. In: Proceedings of the ACM SIGMOD International Conference on Management ofData[C]. Montreal, Canada,1996:103–114.
    [37] S. Guha R R, Shim K. ROCK: a robust clustering algorithm for categorical attributes[A]. In: Proceed-ings of International Conference on Data Engineering[C]. Sydney, NSW,1999:512–521.
    [38] S. Guha R R, Shim K. ROCK: a robust clustering algorithm for categorical attributes[J]. InformationSystems,2000,25(2):345–366.
    [39] Loewenstein Y, Portugaly E, Fromer M. Efficient algorithms for accurate hierarchical clustering ofhuge datasets: tackling the entire protein space[J]. Bioinformatics,2008,24(13): i41–i49.
    [40] H. Wang H Z, Azuaje F. Poisson-based self-organizing feature maps and hierarchical clustering forserial analysis of gene expression data[J]. IEEE/ACM Transactions on Computational Biology andBioinformatics,2007,42(2):163–175.
    [41] Arifin A Z, Asano A. Image segmentation by histogram thresholding using hierarchical cluster analy-sis[J]. Pattern Recognition Letters,2006,27(13):1515–1521.
    [42] Goldberger J, Tassa T. A hierarchical clustering algorithm based on the Hungarian method[J]. PatternRecognition Letters,2008,29(11):1632–1638.
    [43] Vijaya P A, Murty M N, Subramanian D K. Efficient bottom-up hybrid hierarchical clustering tech-niques for protein sequence classification[J]. Pattern Recognition,2006,39(12):2344–2355.
    [44] Krause A, Stoye J, Vingron M. Large scale hierarchical clustering of protein sequences[J]. BMCBioinformatics,2005,6:15.
    [45] Kaplan N, Friedlich M, Fromer M. A functional hierarchical organization of the protein sequencespace[J]. BMC Bioinformatics,2004,5:196.
    [46]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48–61.
    [47] Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in largespatial databases with noise[A]. In: Proceedings of the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining[C]. Portland, Oregon, USA: AAAI Press,1996:226–231.
    [48] DaszykowskiM,WalczakB,MassartD. Lookingfornaturalpatternsinanalyticaldata.Part2: Tracinglocal density with OPTICS[J]. Journal of Chemical Information and Computer Sciences,2002,42(3):500–507.
    [49] Agrawal R, Gehrke J, Gunopulos D, et al. Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications[A]. In: Proceedings of the ACM SIGMOD International Conference onManagement of Data[C]. New York, NY, USA: ACM,1998:94–105.
    [50] Wang W, Yang J, Muntz R R. STING: A Statistical Information Grid Approach to Spatial Data Min-ing[A]. In: Proceedings of the23rd International Conference on Very Large Data Bases[C]. Athens,Greece,1997:186–195.
    [51] Bezdek J. Cluster validity with fuzzy sets[J]. Journal of Cybernetics,1974,3(3):58–73.
    [52] Dembele D, Kastner P. Fuzzy C-means method for clustering microarray data[J]. Bioinformatics,2003,19(8):973–980.
    [53] Kalna G, Higham D J. A clustering coefficient for weighted networks, with application to gene ex-pression data[J]. AI Communications,2007,20(4):263–271.
    [54] Grira N, Crucianu M, Boujemaa N. Active semi-supervised fuzzy clustering[J]. Pattern Recognition,2008,41(5):1834–1844.
    [55] Zhong S. Semi-supervised model based document Clustering: A comparative study[J]. MachineLearning,2006,65(1):3–29.
    [56] Wagstaff K, Cardie C. Clustering with instance-level constraints[A]. In: Proceedings of the17thInternational Conference on Machine Learning[C]. Stanford, CA, USA,2000:1103–1110.
    [57]肖宇,于剑.基于近邻传播算法的半监督聚类[J].软件学报,2008,19(11):2803–2813.
    [58] Kamvar S, Klein D, Manning C. Spectral Learning[A]. In: Proceedings of the18th International JointConference on Artificial Intelligence[C]. Acapulco, Mexico,2003:561–566.
    [59]王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10):2412–2422.
    [60] Xing E P, Ng A Y, Jordan M I, et al. Distance metric learning, with application to clustering withside-information[A]. In: Advances in Neural Information Processing Systems[C]. MIT Press,2002:521–528.
    [61] A.Bar-HillelNS,T.Hertz,WeinshallD. LearningDistanceFunctionsusingEquivalenceRelations[A].In: Proceedingsof20thInternationalConferenceonMachineLearning[C]. WashingtonDC,2003:11–18.
    [62]尹学松,胡思良,陈松灿.基于成对约束的判别型半监督聚类分析[J].软件学报,2008,19(11):2791–2802.
    [63] A.DemirizKB,EmbrechtsMJ. Semi-supervisedclusteringusinggeneticalgorithms[A]. In: ArtificialNeural Networks in Engineering[C]. ASME Press,1999:809–814.
    [64] Basu S, Bilenko M, Mooney R J. A probabilistic framework for semi-supervised clustering[A]. In:Proceedings of the10th ACM SIGKDD International Conference on Knowledge Discovery and DataMining[C]. Seattle, WA, USA,2004:59–68.
    [65] Wagstaff K, Cardie C, Rogers S. Constrained K-means clustering with background knowledge[A]. In:Proceedings of International Conference on Machine Learning[C]. San Francisco, USA,2001:577–584.
    [66] M. Bilenko S B, Mooney R. Integrating constraints and metric learning in semi-supervised cluster-ing[A]. In: Proceedings of the21st International Conference on Machine Learning[C]. Banff, Alberta,Canada,2004:81–88.
    [67] S. Basu A B, Mooney R. Semi-supervised clustering by seeding[A]. In: Proceedings of the19thInternational Conference on Machine Learning[C]. Sydney, NSW, Australia,2002:27–34.
    [68] Dash M, Liu H. Feature Selection for Clustering[A]. In: Proceedings of the4th Pacific-Asia Confer-ence on Knowledge Discovery and Data Mining[C]. Kyoto, Japan: Springer,2000:110–121.
    [69] Roth V, Lange T. Feature selection in clustering problems[A]. In: Advances in Neural InformationProcessing Systems[C]. MIT Press,2003:473–480.
    [70] Luxburg U V. A Tutorial on Spectral Clustering[J]. Statistics and Computing,2007,17(4):395–416.
    [71] Dietterich T G. Ensemble Methods in Machine Learning[A]. In: Proceedings of the First InternationalWorkshop on Multiple Classifier Systems[C]. London, UK: Springer-Verlag,2000:1–15.
    [72] Breiman L. Bagging predictors[J]. Machine Learning,1996,24(2):123–140.
    [73] Schapire R E. The strength of weak learnability[J]. Machine Learning,1990,5(2):197–227.
    [74] Freund Y, Schapire R E. A Decision-Theoretic Generalization of on-Line Learning and an Applicationto Boosting[J]. Journal of Computer and System Sciences,1997,55:119–139.
    [75] Fred A L. Finding Consistent Clustersin Data Partitions[J]. Multiple Classifier Systems,2001,2096:309–318.
    [76]唐伟,周志华.基于Bagging的选择性聚类集成[J].软件学报,2005,16(4):496–502.
    [77] Strehl A, Ghosh J. Cluster Ensembles-A Knowledge Reuse Framework for Combining MultiplePartitions[J]. Journal of Machine Learning Research,2002,3:583–617.
    [78] A. Topchy A K J, Punch W. A mixture model for clustering ensemble[A]. In: Proceedings of the4thSIAM International Conference on Data Mining[C]. Brighton, UK,2004:379–390.
    [79] Ayad H, Basir O A, Kamel M. A probabilistic model using information theoretic measures for clusterensembles[A]. In: Proceedings of the5th International Workshop on Multiple Classifier System[C].Springer,2004:144–153.
    [80] Kriegel H P, Kroger P, Sander J, et al. Density-based clustering[J]. wires.
    [81] Kriegel H P, Kr¨oger P, Zimek A. Subspace clustering[J]. wires.
    [82] ParsonsL,HaqueE,LiuH. Subspaceclusteringforhighdimensionaldata: areview[J]. ACMSIGKDDExplorations Newsletter,2004,6(1):90–105.
    [83] Ghosh J, Acharya A. Cluster Ensembles[J]. WIREs Data Mining and Knowledge Discovery,2011,1(4):305–315.
    [84]杨草原,刘大有,杨博, et al.聚类集成方法研究[J].计算机科学,2011,38(2):166–170.
    [85] Al-Razgan M, Domeniconi C. Weighted Clustering Ensemble[A]. In: Proceedings of The6th SIAMInternational Conference on Data Mining[C]. Bethesda, MD, USA,2006:258–269.
    [86] Domeniconi C, Al-Razgan M. Weighted Cluster Ensembles: Methods and Analysis[J]. ACM Trans-actions on Knowledge Discovery from Data,2009,2(4):17:1–40.
    [87] Li T, Ding C, Jordan M I. Solving Consensus and Semi-supervised Clustering Problems Using Non-negative Matrix Factorization[A]. In: Proceedings of the7th IEEE International Conference on DataMining[C]. Omaha, Nebraska, USA,2007:577–582.
    [88] HamerlyG,ElkanC. Alternativestothek-meansalgorithmthatfindbetterclusterings[A]. In: Proceed-ings of the11th International Conference on Information and Knowledge Management[C]. McLean,Virginia, USA,2002:600–607.
    [89] Topchy A, Minaei-Bidgoli B, Jain A K, et al. Adaptive Clustering Ensembles[A]. In: Proceedings ofthe International Conference on Pattern Recognition[C]. Cambridge, England, UK,2004:272–275.
    [90] Zhang B, Hsu M, Dayal U. K-Harmonic Means–A Spatial Clustering Algorithm with Boosting[J].Temporal, Spatial, and Spatio-Temporal Data Mining,2001:31–45.
    [91] Nock R, Nielsen F. On weighting clustering[J]. IEEE Transactions on Pattern Analysis and MachineIntelligence,2006,28(8):1223–1235.
    [92] Zhou Z. Ensemble Methods: Foundations and Algorithms[M]. Boca Raton, FL: Chapman and Hall/CRC,2012.
    [93] Fred A, Jain A. Data Clustering Using Evidence Accumulation[A]. In: Proceedings of the16thInternational Conference of Pattern Recognition[C]. Quebec City, Canada,2002:276–280.
    [94] Fred A, Jain A K. Evidence Accumulation Clustering based on the K-Means Algorithm[A]. In: Pro-ceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical PatternRecognition[C]. Windsor, Ontario, Canada,2002:442–451.
    [95] Karypis G, Kumar V. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs[J].SIAM Journal on Scientific Computing,1998,20(1):359–392.
    [96] Karypis G, Aggarwal R, Kumar V, et al. Multilevel Hypergraph Partitioning: Application in VLSIDomain[A]. In: Proceedings of the34th annual Design Automation Conference Pages[C]. Anaheim,CA, USA,1997:526–529.
    [97] Fern X Z, Brodley C E. Solving Cluster Ensemble Problems by Bipartite Graph Partitioning[A]. In:Proceedings of the21th International Conference on Machine Learning[C]. Banff, Alberta, Canada,2004:281–288.
    [98] Ng A Y, Jordan M I, Weiss Y. On Spectral Clustering: Analysis and an algorithm[A]. In: Advances inNeural Information Processing Systems[C]. MIT press,2001:849–856.
    [99] Shi J, Malik J. Normalized Cuts and Image Segmentation[J]. IEEE Transactions on Pattern Analysisand Machine Intelligence,2000,22(8):888–905.
    [100] Wang H, Shan H, Banerjee A. Bayesian Cluster Ensembles[A]. In: Proceedings of the9th SIAMInternational Conference on Data Mining[C]. Sparks, Nevada, USA,2009:209–220.
    [101] DomeniconiC,PapadopoulosD,GunopulosD,etal. Subspaceclusteringofhighdimensionaldata[A].In: Proceedings of the SIAM International Conference on Data Mining[C]. Orlando, Florida, USA,2004:517–521.
    [102] Li T, Ding C. Weighted consensus clustering[A]. In: Proceedings of the8th SIAM InternationalConference on Data Mining[C]. Atlanta, Georgia, USA,2008:798–809.
    [103] Yi J, Yang T, Jin R, et al. Robust Ensemble Clustering by Matrix Completion[A]. In: Proceedings ofthe12th IEEE International Conference on Data Mining[C]. Brussels, Belgium,2012:1176–1181.
    [104] Arbelaitz O, Gurrutxaga I, Muguerza J, et al. An extensive comparative study of cluster validity in-dices[J]. PR,2013,46(1):243–256.
    [105] Legany C, Juhasz S, Babos A. Cluster validity measurement techniques[A]. In: Proceedings of the5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and DataBases[C]. Madrid, Spain,2006:388–393.
    [106] Hubert L, Arabie P. Comparing partitions[J]. Journal of Classification,1985,2(1):193–218.
    [107] Georgescu B, Shimshoni I, Meer P. Mean Shift Based Clustering in High Dimensions: A Texture Clas-sification Example[A]. In: Proceedings of the IEEE International Conference on Computer Vision[C].Nice, France,2003:456–463.
    [108] Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis[J]. IEEE Trans-actions on Pattern Analysis and Machine Intelligence,2002,24(5):603–619.
    [109] Comaniciu D, Ramesh V, Meer P. The variable bandwidth mean shift and data-driven scale selec-tion[A]. In: Proceedings of the IEEE International Conference on Computer Vision[C]. Vancouver,Canada,2001:438–445.
    [110] Freedman D, Kisilev P. Fast mean shift by compact density representation[A]. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition[C]. Miami, Florida, USA,2009:1818–1825.
    [111] Kriegel H P, Kr¨oger P, Zimek A. Clustering high-dimensional data: A survey on subspace cluster-ing, pattern-based clustering, and correlation clustering[J]. The ACM Transactions on KnowledgeDiscovery from Data,2009,3(1):1:1–1:58.
    [112] Aggarwal C C, Wolf J L, Yu P S, et al. Fast algorithms for projected clustering[A]. In: Proceedingsof the ACM SIGMOD International Conference on Management of Data[C]. Philadelphia, PA,1999:61–72.
    [113] Aggarwal C C, Yu P S. Finding generalized projected clusters in high dimensional spaces[A]. In:Proceedings of the ACM SIGMOD International Conference on Management of Data[C]. Dallas,Texas,2000:70–81.
    [114] Woo K G, Lee J H, Kim M H, et al. FINDIT: a fast and intelligent subspace clustering algorithm usingdimension voting[J]. Information and Software Technology,2004,46(4):255–271.
    [115] Yang J, Wang W, Wang H, et al. δ-Clusters: Capturing Subspace Correlation in a Large Data Set[A].In: Proceedings of the IEEE International Conference on Data Engineering[C]. San Jose, California,USA,2002:517–528.
    [116] Agrawal R, Gehrke J, Gunopulos D, et al. Automatic subspace clustering of high dimensional datafor data mining applications[A]. In: Proceedings of the ACM SIGMOD International Conference onManagement of Data[C]. Seattle, Washington,1998:94–105.
    [117] Cheng C H, Fu A W, Zhang Y. Entropy-based subspace clustering for mining numerical data[A].In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and DataMining[C]. San Diego, California, USA,1999:84–93.
    [118] Goil S, Nagesh H, Choudhary A. MAFIA: Efficient and Scalable Subspace Clustering for Very LargeData Sets[R].2145Sheridan Road, Evanston IL-60208: Northwestern University,1999.
    [119] Chang J W, Jin D S. A new cell-based clustering method for large, high-dimensional data in datamining applications[A]. In: Proceedings of the ACM Symposium on Applied Computing[C]. Madrid,Spain,2002:503–507.
    [120] Liu B, Xia Y, Yu P S. Clustering Through Decision Tree Construction[A]. In: Proceedings of theinternational conference on Information and Knowledge Management[C]. Washington, DC, USA,2000:20–29.
    [121] ProcopiucCM,JonesM,AgarwalPK,etal. AMonteCarloalgorithmforfastprojectiveclustering[A].In: ProceedingsoftheACMSIGMODInternationalConferenceonManagementofData[C]. Madison,Wisconsin, USA,2002:418–427.
    [122] Domeniconi C, Gunopulos D, Ma S, et al. Locally adaptive metrics for clustering high dimensionaldata[J]. Data Mining and Knowledge Discovery,2007,14(1):63–97.
    [123] Friedman J H, Meulman J J. Clustering objects on subsets of attributes[J]. Journal of the RoyalStatistical Society,2004,66:815–849.
    [124] Cordeiro R L, Traina A J, Faloutsos C, et al. Halite: Fast and Scalable Multiresolution Local-Correlation Clustering[J]. IEEE Transactions on Knowledge and Data Engineering,2013,25(2):387–401.
    [125] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algo-rithm[J]. Journals of the Royal Statistical Society, Series B,1977,39(1):1–38.
    [126] Ankerst M, Breunig M M, peter Kriegel H, et al. OPTICS: Ordering Points To Identify the ClusteringStructure[A]. In: Proceedings of the ACM SIGMOD International Conference on Management ofData[C]. Philadelphia, PA: ACM Press,1999:49–60.
    [127] Fukunaga K, Hostetler L D. The estimation of the gradient of a density function, with applications inpattern recognition[J]. IEEE Transactions on Information Theory,1975,21(1):32–40.
    [128] Cheng Y. Mean shift, mode seeking, and clustering[J]. IEEE Transactions on Pattern Analysis andMachine Intelligence,1995,17(8):790–799.
    [129] Avidan S. Ensemble tracking[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29(2):261–271.
    [130] Kamath U, Kaers J, Shehu A, et al. A Spatial EA Framework for Parallelizing Machine LearningMethods[J]. Parallel Problem Solving from Nature—PPSN XII,2012,7491:206–215.
    [131] Kamath U, Domeniconi C, Jong K A D. An analysis of a spatial EA parallel boosting algorithm[A].In: Proceedings of the ACM Genetic and Evolutionary Computation Conference[C]. Amsterdam, TheNetherlands,2013:1053–1060.
    [132] Sarma J, Jong K. An analysis of the effects of neighborhood size and shape on local selection algo-rithms[J]. Parallel Problem Solving from Nature—PPSN IV,1996,1141:236–244.
    [133] Gionis A, Mannila H, Tsaparas P. Clustering Aggregation[J]. The ACM Transactions on KnowledgeDiscovery from Data,2007,1(1):1–30.
    [134] Daszykowski M, Walczak B, Massart D L. Looking for Natural Patterns in Data. Part1: Density basedapproach[J]. Chemometrics and Intelligent Laboratory Systems,2001,56(2):83–92.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700