针对多维混合属性数据的聚类算法研究

英文题名：Research on Clustering Algorithms for the Data with Multidimensional Mixed Attributes
作者：冀进朝
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：聚类分析 ; 数据挖掘 ; 混合数据 ; 簇中心初始化
英文关键词：Cluster analysis ; data mining ; mixed data ; cluster centers initialization
学位年度：2013
导师：周春光
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2013-06-01
答辩委员会主席：许东

摘要

在大多数应用中，数据集同时具有数值和分类这两种属性类型。因此研究混合属性数据的聚类问题是目前的一个热点。本文主要对这个热点问题进行研究：
     1.提出了一种新的加权聚类算法IWKM。在真实数据集上的实验结果表明，算法IWKM的聚类准确率要高于传统的聚类算法。
     2.提出了一个加权模糊K-prototypes算法WFK-prototypes。这个算法考虑了数据对象在簇归属上的模糊性。在真实数据集上的实验结果表明，在聚类的准确率上，算法WFK-prototypes要高于传统的聚类算法。
     3.在KH算法框架的基础上，提出了一种能够消除模糊系数缺点的模糊聚类算法IKH。把IKH算法应用在UCI机器学习库中的数据集上进行实验。实验结果表明，这种聚类算法的聚类准确率要高于传统的聚类算法。
     4.提出了一种新的簇中心初始化方法DDCI。在UCI机器学习库中的真实数据集上进行实验。实验结果表明，在聚类准确率和结果的稳定性上，簇中心初始化方法DDCI比传统的随机初始化方法更好，更稳定。
Based on the intrinsic characteristic or similarity of objects, organizing these objects intosensible groups is one of the most fundamental modes of learning and understanding. Clusteranalysis the rearch of the approaches and algorithms which partition or allocate objects intosensible groups. With the rapid development in information technology and data collectionand storage device, almost all aspects of human society produce and store a lot of data, andthe number and variety of data continue to grow fast. For example, worldwide businessesgenerate large volume data, including sales transactions, stock trading records; scientific andengineering practices generate many data from remote sensing, process measuring, scientificexperiments, engineering observations, and environment testing; social media such as blogs,podcasts, Wikipedia, forums, social networks, micro-blog, Twitter, has become increasinglyimportant data source.
     The availability and explosive growth of data has inspired the generation and developmentof data mining or knowledge discovery which can be automatic or convenient extractingknowledge from data. Clustering analysis is an important technology in data mining orknowledge discovery, its purpose is to explore the potential structure hidden in the data. Thistechnique is widely used in customer segmentation, web search, privacy protection,bioinformatics etc.
     The traditional clustering algorithm is mainly designed for the data objects only withnumeric or categorical attributes. More and more research suggests that existing data sets aremostly described by both numerical and categorical attributes. Since these two types ofattributes has great difference in the range, characteristics and distribution of values, manyresearchers believe that the traditional clustering algorithms designed for numeric orcategorical data are may not suitable for processing mixed attribute data. Designing thealgorithm for the data with both numeric and categorical attributes therefore is one of the mostattractive research issues in clustering analysis. In this paper, we investigate this researchissue. Our research work mainly includes the following four aspects:
     1) Based on the W-k-means framework, a new clustering algorithm (IWKM) is proposedin this paper. In IWKM algorithm, the distribution centroid is first introduced to representthe center of cluster with categorical attributes; then distribution and mean is combined torepresent the center of cluster with mixed numerica and categorical data; exploit a newdissimilarity measure which takes into account the influence of different attributes inclustering process to evaluate the distance between data objects and the center of cluster.In addition, the IWKM algorithm uses the weight strategy in the W-k-means frameworkto assess the influence of attribute. The performance of the proposed method isdemonstrated by a series of experiments on real world datasets in comparison with that oftraditional clustering algorithms.
     2) Aweighted fuzzy k-prototypes algorithm (WFK-prototypes) is proposed in this paper.In this algorithm, the idea of fuzzy set and fuzzy clustering was introduced to deal withthe fuzzy nature of data objects; integrating fuzzy centroid with mean to represent thecenter of cluster with mixed numeric and categorical data, and this new representationcan capture the distribution information of both numeric and categorical attribute values;utilize the co-occurence of attribute values to calculate the impact of attribute in clustering process. The performance of the proposed method is demonstrated by a seriesof experiments on real world datasets in comparison with that of traditional clusteringalgorithms.
     3) An improved KH algorithm (IKH) is proposed to deal with the issue of clusteringmixed data in this paper. In the fuzzy clustering designed for mixed data, no matter howfar it is away from the center of cluster, the every data object will influence the all cluster.By introduction of KH’s framework, the IKH algorithm can avoid this deficiency. In theIKH algorithm, we first combine means and fuzzy centroid to represent the center ofcluster with mixed attributes; and utilize the new dissimilarity measure which use a newnormalize factor to assess the distance between data objects and center of cluster withmixed attribute.The performance of the proposed method is demonstrated by a series ofexperiments on real world datasets in comparison with that of traditional clusteringalgorithms.
     4) A new method for initialization of centers of cluster (DDCI) for mixed data isproposed in this paper. In partition algorithm, the result of clustering is dramaticallyinfluenced by the initial place of cluster centers. So far, there are many works deal withthis issue for numeric or categorical data. However, as for as we know, all the partitionalgorithms designed for mixed data exploit the random method to initialize the center ofcluster. Thus, cluster centers initialized by random approach result in unstable outcome ofclustering and the results of clustering cannot be repeated. To deal with the initializationissue for mixed data, we proposed the DDCI approach by considering the idea aboutdensity and distance. In the approach DDCI, for the mixed numerica and categoricalattribute data, we introduce the notion of density to evaluate the coherence of data objectin data set, and then combined with the density and distance to select the initial clustercenter. The performance of the proposed method is demonstrated by a series ofexperiments on real world datasets in comparison with that of traditional clusteringalgorithms.

引文

[1] U. Fayyad, G. Piatetsky-shapiro, and P. Smyth. From Data Mining to KnowledgeDiscovery in Databases[J]. AI Magazine,1996,17(3):37-54.
    [2] J. Han, M. Kamber, and J. Pei. Data mining concepts and techniques[M]. Third ed.北京:机械工业出版社2012.
    [3] P. Tan, M. Steihbach, and V. Kumar. eds. Introduction to Data Mining[M].北京:人民邮电出版社,2006.
    [4] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge Discovery inDatabases: An Overview[J]. AI Magazine,1992,13(3):57-70.
    [5] D. Hand, H. Mannila, and Padhraic. Principles of data mining[M]. Massachusetts: MITPress,2001.
    [6] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review[J]. Acm ComputingSurveys,1999,31(3):264-323.
    [7] A. K. Jain. Data clustering:50years beyond K-means[J]. Pattern Recognition Letters,2010,31(8):651-666.
    [8] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to ClusterAnalysis[M]. New Jersey: John Wiley&Sons Inc.,2005.
    [9] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery ofassociation rules, in Advances in knowledge discovery and data mining[M]. California:AAAI Press.1996.307-328.
    [10] R. Zembowicz and J. M. ytkow. From contingency tables to various forms ofknowledge in databases, in Advances in knowledge discovery and data mining[M].California: AAAI Press,1996.
    [11] D. Heckerman. Bayesian Networks for Knowledge Discovery, in Advances inKnowledge Discovery and Data Mining[M]. California: AAAI Press,1996:273-306.
    [12] A. K. Jain and R. C. Dubes. Algorithms for clustering data[M]. New Jersey: Prentice Hall,1988.
    [13] G. Gan, C. Ma, and J. Wu. Data Clustering: Theory, Algorithms, and Applications[M].Pennsylvania: ASA-SIAM,2007.
    [14] Z. Huang. Extensions to the k-Means Algorithm for Clustering Large Data Sets withCategorical Values[J]. Data Mining and Knowledge Discovery,1998,2(3):283-304.
    [15] C.-C. Hsu, C.-L. Chen, and Y.-W. Su. Hierarchical clustering of mixed data based ondistance hierarchy[J]. Information Sciences,2007,177(20):4474-4492.
    [16] F. Cao, J. Liang, D. Li, L. Bai, and C. Dang. A dissimilarity measure for the k-Modesclustering algorithm[J]. Knowledge-Based Systems,2012,26:120-127.
    [17] H. Xue, G. H. Nie, and Z. W. Liu. Data Mining Technique in Business IntelligenceSystem for Supermarket[M]. Wuhan: WUT Press,2008:1539-1544.
    [18] T. Lukman, R. Hackney, A. Popovic, J. Jaklic, and Z. Irani. Business IntelligenceMaturity: The Economic Transitional Context Within Slovenia[J]. InformationSystems Management,2011,28(3):211-222.
    [19] G. Bordogna and G. Pasi. A quality driven Hierarchical Data Divisive Soft Clustering forinformation retrieval[J]. Knowledge-Based Systems,2012,26:9-19.
    [20] A. Tombros, R. Villa-Caro, and C. J. Van Rijsbergen. The effectiveness of query-specifichierarchic clustering in information retrieval[J]. Information Processing&Management,2002,38(4):559-582.
    [21] Y. J. Horng, S. M. Chen, Y. C. Chang, and C. H. Lee. A new method for fuzzyinformation retrieval based on fuzzy hierarchical clustering and fuzzy inferencetechniques[J]. IEEE Transactions on Fuzzy Systems,2005,13(2):216-228.
    [22] C. Carpineto, S. Mizzaro, G. Romano, and M. Snidero. Mobile Information Retrievalwith Search Results Clustering: Prototypes and Evaluations[J]. Journal of theAmerican Society for Information Science and Technology,2009,60(5):877-895.
    [23] M. Z. Islam and L. Brankovic. Privacy preserving data mining: A noise additionframework using a novel clustering technique[J]. Knowledge-Based Systems,2011,24(8):1214-1223.
    [24] A. R. Xue, D. J. Jiang, S. G. Ju, W. H. Chen, and H. D. Ma. Privacy-PreservingHierarchical-k-means Clustering on Horizontally Partitioned Data[J]. InternationalJournal of Distributed Sensor Networks,2009,5(1):81-81.
    [25] X. Yi and Y. C. Zhang. Equally contributory privacy-preserving k-means clustering oververtically partitioned data[J]. Information Systems,2012,38(1):97-107.
    [26] A. Skabar and K. Abdalgader. Clustering Sentence-Level Text Using a Novel FuzzyRelational Clustering Algorithm[J]. IEEE Transactions on Knowledge and DataEngineering,2013,25(1):62-75.
    [27] S. Kim and W. J. Wilbur. Thematic clustering of text documents using an EM-basedapproach[J]. Journal of Biomedical Semantics,2012,3Suppl3: S6.
    [28] A. Kalogeratos and A. Likas. Text document clustering using global term contextvectors[J]. Knowledge and Information Systems,2012,31(3):455-474.
    [29] W. Zhang, T. Yoshida, X. J. Tang, and Q. Wang. Text clustering using frequentitemsets[J]. Knowledge-Based Systems,2010,23(5):379-388.
    [30] C.-C. Hsu and Y.-P. Huang. Incremental clustering of mixed data based on distancehierarchy[J]. Expert Systems with Applications,2008,35(3):1177-1185.
    [31] C.-C. Hsu and Y.-C. Chen. Mining of mixed data with application to catalog marketing[J].Expert Systems with Applications,2007,32(1):12–23.
    [32] G. David and A. Averbuch. SpectralCAT: Categorical spectral clustering of numerical andnominal data[J]. Pattern Recognition,2012,45(1):416-433.
    [33] Z. Huang. Clustering large data sets with mixed numeric and categorical values[C]. Inthe First Pacific-Asia Conference on Knowledge Discovery and Data Mining.Singapore: World Scientific Publishing,1997,21-34.
    [34] S. P. Lloyd. Least square quantization in PCM[J]. IEEE Transactions on InformationTheory,1982,28(2):129-137.
    [35] I. Berget, B.-H. Mevik, and T. Naes. New modifications and applications of fuzzyC-means methodology[J]. Computational Statistics&Data Analysis,2008,52(5):2403-2418.
    [36] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Methodfor Very Large Databases[C]. In Proceedings of the1996ACM SIGMOD InternationalConference on Management of Data. Montreal: ACM Press.1996.103-114.
    [37] S. Guha, R. Rastogi, and K. Shim. CURE: an efficient clustering algorithm for largedatabases[C]. In ACM SIGMOD International Conference on Management of Data.Washington: ACM Press,1998,73-84.
    [38] S. H. Cluster Analysis Algorithms[M]. West Sussex: Ellis Horwood Limited,1980.
    [39] Z. Huang. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets inData Mining[C]. In Research Issues on Data Mining and Knowledge Discovery.Arizona: ACM press,1997,1-8.
    [40] T. Bai, C. A. Kulikowski, L. G. Gong, B. Yang, L. Huang, and C. G. Zhou. A GlobalK-modes Algorithm for Clustering Categorical Data[J]. Chinese Journal of Electronics,2012,21(3):460-465.
    [41] N. Mastrogiannis, I. Giannikos, B. Boutsinas, and G. Antzoulatos. CLEKMODES: amodified k-modes clustering algorithm[J]. Journal of the Operational Research Society,2009,60(8):1085-1095.
    [42] G. Gan, J. Wu, and Z. Yang. Agenetic fuzzy k-Modes algorithm for clustering categoricaldata[J]. Expert Systems with Applications,2009,36(2):1615-1620.
    [43] Z. X. Huang and M. K. Ng. Afuzzy k-modes algorithm for clustering categorical data[J].IEEE Transactions on Fuzzy Systems,1999,7(4):446-452.
    [44] D. Barbara, J. Couto, and Y. Li. COOLCAT: an entropy-based algorithm for categoricalclustering[C]. In Proceedings of the11th international conference on Information andknowledge management. Virginia: ACM Press,2002,582-589.
    [45] S. Deng, Z. He, and X. Xu. G-ANMI: A mutual information based genetic clusteringalgorithm for categorical data[J]. Knowledge-Based Systems,2010,23(2):144-149.
    [46] C.-C. Hsu, S.-H. Lin, and W.-S. Tai. Apply extended self-organizing map to cluster andclassify mixed-type data[J]. Neurocomputing,2011,74(18):3832-3842.
    [47] A. Ahmad and L. Dey. Ak-mean clustering algorithm for mixed numeric and categoricaldata[J]. Data&Knowledge Engineering,2007,63(2):503-527.
    [48] J. C. Bezdek, J. Keller, R. Krisnapuram, and N. R. Pal. Fuzzy Models and Algorithms forPattern Recognition and Image Processing[M]. Berlin: Springer,2005.
    [49] A. Ahmad and L. Dey. Algorithm for fuzzy clustering of mixed data with numeric andcategorical attributes[C]. In Proceedings of Distributed Computing and InternetTechnology. Berlin: Springer,2005,561-572.
    [50] S. P. Chatzis. A fuzzy c-means-type algorithm for clustering of data with mixed numericand categorical attributes employing a probabilistic dissimilarity functional[J]. ExpertSystems with Applications,2011,38(7):8684-8689.
    [51] I. Gath and A. B. Gev. Unsupervised Optimal Fuzzy Clustering[J]. IEEE Transactions onPatternAnalysis and Machine Intelligence,1989,11(7):773-780.
    [52] Z. Zheng, M. Gong, J. Ma, L. Jiao, and Q. Wu. Unsupervised Evolutionary ClusteringAlgorithm for Mixed Type Data[C]. In Proceedings of the2010IEEE Congress onEvolutionary Computation. Barcelona: CEC,2010,1-8.
    [53] C. Li and G. Biswas. Unsupervised Learning with Mixed Numeric and Nominal Data[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(4):673-690.
    [54] A. Koufakou and M. Georgiopoulos. A fast outlier detection strategy for distributedhigh-dimensional data sets with mixed attributes[J]. Data Mining and KnowledgeDiscovery,2010,20(2):259-289.
    [55] A. Ahmad and L. Dey. A k-means type clustering algorithm for subspace clustering ofmixed numeric and categorical datasets[J]. Pattern Recognition Letters,2011,32(7):1062-1069.
    [56] W.-S. Tai and C.-C. Hsu. Growing Self-Organizing Map with cross insert for mixed-typedata clustering[J]. Applied Soft Computing,2012,12(9):2856-2866.
    [57]张泽洪.投影聚类算法及其应用研究.江南大学[D].2007.
    [58]王飞.基于聚类算法的入侵检测的研究.南京理工大学[D].2006.
    [59]孙利. Web日志在个性化远程教育系统中的应用研究.江西理工大学[D].2008.
    [60] R. Xu and D. Wunsch. Survey of clustering algorithms[J]. IEEE Transactions on NeuralNetworks,2005,16(3):645-678.
    [61] M. G. H. Omran, A. P. Engelbrecht, and A. Salman. An overview of clusteringmethods[J]. Intelligent DataAnalysis,2007,11(6):583-605.
    [62] M. R. Anderberg. Cluster analysis for applications[M]. New York:Academic Press,1973.
    [63] J. Mao and A. K. Jain. Aself-organizing network for hyperellipsoidal clustering (HEC)[J].IEEE transactions on neural networks,1996,7(1):16-29.
    [64] D. W. Goodall. ANew Similarity Index Based on Probability[J]. Biometrics,1966,22(4):882-907.
    [65] S. Boriah, V. Chandola, and V. Kumar. Similarity Measures for Categorical Data: AComparative Evaluation[C]. In Proceedings of the8th SIAM International Conferenceon Data Mining. Atlanta: CurranAssociates,2008,243-254.
    [66] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A Geometric Framework forUnsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data[C]. InApplications of Data Mining in Computer Security. Boston: Kluwer,2002.
    [67] J. C. Gower. A General Coefficient of Similarity and Some of Its Properties[J].Biometrics,1971,27(4):857-871.
    [68] D. Wishart. K-means clustering with outlier detection, mixed variables and missingvalues[C]. In Exploratory Data Analysis in Empirical Research: Proceedings of the25th Annual Conference of the Gesellschaft für Klassifikation. Munich: Springer,2003,216-226.
    [69] M. Ichino. General metrics for mixed features: the cartesian space theory for patternrecognition[C]. In IEEE international conference on systems,man, and cybernetics.Beijing: IEEE,1988,494-497.
    [70] M. Ichino and H. Yaguchi. Generalized Minkowski metrics for mixed feature-type dataanalysis[J]. IEEE Transactions on Systems, Man and Cybernetics,1994,24(4):698-708.
    [71] H. H. Bock. Probabilistic aspects in cluster analysis[C]. In Conceptual and NumericalAnalysis of Data. Augsburg: Springer,1989,12-44.
    [72] M. Lorr. Cluster Analysis for Social Scientists[M]. New Jersey: Wiley&Sons Inc.,1983.
    [73] J. C. Bezdek, R. Ehrlich, and W. Full. Fcm: the fuzzy c-means clustering algorithm [J].Computers&Geosciences,1984,10(2):191-203.
    [74] J. M. Buhmann. Data Clustering and Learning, in Brain Theory and Neural Networks[M],Massachusettes: The MIT Press,2002,308-312.
    [75] J. MacQueen. Some methods for classification and analysis of multivariateobservations[C]. In the5th Berkeley Symposium on Mathematical Statistics andProbability. Berkeley: University of California Press,1967,281-297.
    [76] K. Krishna and M. N. Murty. Genetic K-means algorithm[J]. IEEE Transactions onSystems Man and Cybernetics Part B-Cybernetics,1999,29(3):433-439.
    [77] S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categoricalattributes[J]. Information Systems,2000,25(5):345-366.
    [78] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS-clustering categorical data usingsummaries[C]. In the5th ACM SIGKDD international conference on Knowledgediscovery and data mining. California: ACM,1999,73-83.
    [79] K. Mckusick and K. Thompson. COBWEB/3: A portable implementation, in TechnicalReport FIA-90-6-18-2[M]. California: NASAAmes Research Center,1990.
    [80] M. A. Gluck and J. E. Corter. Information, uncertainty, and the utility of categories[C]. Inthe7th Annual Conference of the Cognitive Science Society. California: Cog,1985,283-287.
    [81] P. Cheeseman and J. Stutz. Bayesian Classification(AutoClass):Theory and Results, inAdvances in Knowledge Discovery and Data Mining[M]. Menlo Park: AmericanAssociation for Artificial Intelligence,1996,153-180.
    [82] C. Plant and C. B hm. INCONCO: interpretable clustering of numerical and categoricalobjects[C]. In the17th ACM SIGKDD international conference on Knowledgediscovery and data mining. New York: ACM,2011,1127-1135.
    [83] J. Z. X. Huang, M. K. Ng, H. Q. Rong, and Z. C. Li. Automated variable weighting ink-means type clustering[J]. IEEE Transactions on Pattern Analysis and MachineIntelligence,2005,27(5):657-668.
    [84] P. E. Green, J. Kim, and F. J. Carmone. A preliminary study of optimal variableweighting in k-means clustering[J]. Journal of Classification,1990,7(2):271-285.
    [85] W. S. DeSarbo, J. D. Carroll, L. A. Clark, and P. E. Green. Synthesized clustering: Amethod for amalgamating alternative clustering bases with differential weighting ofvariables[J]. Psychometrika,1984,49(1):57-78.
    [86] L. Jing, M. K. Ng, and J. Z. Huang. An entropy weighting k-means algorithm forsubspace clustering of high-dimensional sparse data[J]. IEEE Transactions onKnowledge and Data Engineering,2007,19(8):1026-1041.
    [87] D. Modha and S. Spangler. Feature Weighting in k-Means Clustering, in MachineLearning[M]. Netherlands: Kluwer Academic Publishers,2002:217-237.
    [88] G. D. Soete. Optimal variable weighting for ultrametric and additive tree clustering[J].Quality and Quantity,1986,20(2):169-180.
    [89] G. D. Soete. OVWTRE: A program for optimal variable weighting for ultrametric andadditive tree fitting[J]. Journal of Classification,1988,5(1):101-104.
    [90] V. Makarenkov and P. Legendre. Optimal variable weighting for ultrametric and additivetrees and K-means partitioning: Methods and software[J]. Journal of Classification,2001,18(2):245-271.
    [91] D. W. Kim, K. H. Lee, and D. Lee. Fuzzy clustering of categorical data using fuzzycentroids[J]. Pattern Recognition Letters,2004,25(11):1263-1271.
    [92] J. Ji, W. Pang, C. Zhou, X. Han, and Z. Wang. A fuzzy k-prototype clustering algorithmfor mixed numeric and categorical data[J]. Knowledge-Based Systems,2012,30:129-135.
    [93] J. C. Bezdek. Cluster Validity with Fuzzy Sets[J]. Journal of Cybernetics,1973,3(3):58-73.
    [94] L. A. Zadeh. Fuzzy sets[J]. Information and Control,1965,8(3):338-353.
    [95] M. Lee and W. Pedrycz. The fuzzy C-means algorithm with fuzzy P-mode prototypes forclustering objects having mixed features[J]. Fuzzy Sets and Systems,2009,160(24):3590-3600.
    [96] F. Klawonn. Fuzzy Clustering: Insights and a New Approach[J]. Mathware&SoftComputing,2004,11(3):125-142.
    [97] F. Klawonn and F. H ppner. What Is Fuzzy about Fuzzy Clustering? Understanding andImproving the Concept of the Fuzzifier[C]. In the5th International Symposium onIntelligent DataAnalysis. Berlin: Springer,2003,254-264.
    [98] M. E. Celebi, H. A. Kingravi, and P. A. Vela. A comparative study of efficientinitialization methods for the k-means clustering algorithm[J]. Expert Systems withApplications,2013,40(1):200-210.
    [99] E. W. Forgy. Cluster analysis of multivariate data: efficiency vs interpretability ofclassifications[J]. Biometrics,1965,21(3):768-769.
    [100] R. C. Jancey. Multidimensional group analysis[J]. Australian Journal of Botany,1966,14(1):127-130.
    [101] J. T. Tou and R. C. González. Pattern recognition principles[M]. Boston:Addison-Wesley,1974.
    [102] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance[J].Theoretical Computer Science,1985,38(2-3):293-306.
    [103] I. Katsavounidis, C.-C. J. Kuo, and Z. Zhang. A new initialization technique forgeneralized Lloyd iteration[J]. IEEE Signal Processing Letters,1994,1(10):144-146.
    [104] Y. Sun, Q. M. Zhu, and Z. X. Chen. An iterative initial-points refinement algorithm forcategorical data clustering[J]. Pattern Recognition Letters,2002,23(7):875-884.
    [105] S. Wu, Q. Jiang, and J. Z. Huang. A new initialization method for clustering categoricaldata[C]. In Proceedings of the11th Pacific-Asia Conference on Knowledge Discoveryand Data Mining. Nanjing: Springer,2007,972-980.
    [106] F. Cao, J. Liang, and L. Bai. A new initialization method for categorical dataclustering[J]. Expert Systems with Applications,2009,36(7):10223-10228.
    [107] L. Bai, J. Liang, C. Dang, and F. Cao. A cluster centers initialization method forclustering categorical data[J]. Expert Systems with Applications,2012,39(9):8022-8029.
    [108] P. S. Bradley and U. M. Fayyad. Refining Initial Points for K-Means Clustering[C]. Inthe15th International Conference on Machine Learning. San Francisco: MorganKaufmann Publishers Inc.,1998,91-99.
    [109] G. H. Ball and D. J. Hall. A clustering technique for summarizing multivariate data[J].Behavioral Science,1967,12(2):153-155.
    [110] M. B. A1-Daoud and S. A. Roberts. New methods for the initialisation of clusters[J].Pattern Recognition Letters,1996,17(5):451-455.
    [111] M. d. B. Al-Daoud. A New Algorithm for Cluster Initialization[J]. World Academy ofScience, Engineering and Technology.2005,4(1):74-76.
    [112] S. S. Khan and A. Ahmad. Cluster center initialization algorithm for k-meansclustering[J]. Pattern Recognition Letters,2004,25(11):1293-1302.
    [113] S. J. Redmond and C. Heneghan. A method for initialising the k-means clusteringalgorithm using kd-trees[J]. Pattern Recognition Letters,2007,28(8):965-973.
    [114] P. Mitra, C. A. Murthy, and S. K. Pal. Density-based multiscale data condensation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(6):734-747.
    [115] S. S. Khan and A. Ahmad. Computing initial points using density based multiscale datacondensation for clustering categorical data[C]. In the2nd International Conference onApplied Articial Intelligence. Kolhapur: ICAAI,2003,1-7.
    [116] Z. He. Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering[J].CoRR,2006, abs/cs/0610043.
    [117] S. S. Khan and S. Kant. Computation of Initial Modes for K-modesClusteringAlgorithm using Evidence Accumulation[C]. In the20th international jointconference on artifical intelligence. San Francisco: Morgan Kaufmann Publishers Inc,2007,2784-2789.
    [118] S. S. Khan and A. Ahmad. Cluster Center Initialization for Categorical Data UsingMultiple Attribute Clustering[C]. In3rd MultiClust Workshop: Discovering,Summarizing and Using Multiple Clusterings. California: ACM,2012,3-10.
    [119] M. Ester, H.-p. Kriegel, J. S, and X. Xu. A density-based algorithm for discoveringclusters in large spatial databases with noise[C]. In Proceedings of the2ndInternational Conference on Knowledge Discovery and Data Mining. Oregon: AAAIPress,1966,226-231.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700