一种带权的混合数据聚类个数确定算法

英文篇名：A WEIGHTED CLUSTERING NUMBER DETERMINING ALGORITHM FOR MIXED DATA
作者：李顺勇 ; 张苗苗
英文作者：Li Shunyong;Zhang Miaomiao;School of Mathematical Sciences,Shanxi University;
关键词：聚类个数 ; 混合数据 ; 属性权重 ; 有效性指标
英文关键词：The number of clustering;;Mixed data;;Attribute weight;;Validity index
中文刊名：JYRJ
英文刊名：Computer Applications and Software
机构：山西大学数学科学学院;
出版日期：2019-01-12
出版单位：计算机应用与软件
年：2019
期：v.36
基金：国家自然科学基金项目(61573229);; 山西省基础研究计划项目(201701D121004);; 山西省回国留学人员科研项目(2017-020);; 山西省高等学校教学改革创新项目(J2017002)
语种：中文;
页：JYRJ201901051
页数：7
CN：01
ISSN：31-1260/TP
分类号：290-296

摘要

混合数据的聚类过程中通常面临一个不可回避的问题:聚类个数的确定。基于Liang k-prototype算法引入属性权重,重新定义混合数据缺失某类的类间熵和(SBAE_M)、有效性指标(CUM)及相异性度量。提出一种带权的混合数据聚类个数确定算法。该算法的基本思想是:用newk-prototype算法将混合数据进行聚类,计算其聚类结果的CUM及SBAE_M,将最坏的类剔除,并将该类中的对象用新的相异性度量进行重新分配,CUM最大时包含的类别数即为聚类个数。在5个UCI数据集上验证了该算法的有效性。
Determining the number of clusters is an unavoidable problem in the clustering process of mixed data. This paper introduced attribute weight on the basis of Liang k-prototype algorithm,redefined the sum of between-cluster entropies in absence of a cluster( SBAE_M),the validity index( CUM) and the dissimilarity measure of mixed data,and proposedaweighted algorithm for determining the number of mixed data clustering. New k-prototype algorithm was used to cluster the mixed data. CUM and SBAE_M of the clustering results were calculated and the worst class was eliminated.The objects in this class were reassigned with new dissimilarity measure. The number of categoriesincluding at the maximum of CUM was the number of clusters. The effectiveness of the improved k-prototype clustering algorithm was verified on five data sets from UCI.

引文

[1] Sun H,Wang S,Jiang Q. FCM-based model selection algorithms for determining the number of clusters[J]. Pattern Recognition,2004,37(10):2027-2037.
    [2] Bezdek J C,Ehrlich R,Full W. FCM:the fuzzy c-means clustering algorithm[J]. Computers&Geosciences,1984,10(2):191-203.
    [3]陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法[J].软件学报,2008,19(1):62-72.
    [4] Li M J,Ng M K,Cheung Y,et al. Agglomerative fuzzy kmeans clustering algorithm with selection of number of clusters[J]. IEEE Transactions on Knowledge&Data Engineering,2008,20(11):1519-1534.
    [5] Aghagolzadeh M,Soltanian-Zadeh H,Araabi B N,et al.Finding the number of clusters in a dataset using an information theoretic hierarchical algorithm[C]//IEEE International Conference on Electronics, Circuits and Systems. IEEE,2006:1336-1339.
    [6] Chen K,Liu L. The“Best K”for entropy-based categorical data clustering[C]//International Conference on Scientific and Statistical Database Management,SSDBM 2005,27-29June 2005,University of California,Santa Barbara,Ca,Usa,Proceedings. DBLP,2005:253-262.
    [7] Bai L,Liang J,Dang C. An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data[J]. Knowledge-Based Systems,2011,24(6):785-795.
    [8] Yan H,Chen K,Liu L,et al. Determining the best K for clustering transactional datasets:A coverage density-based approach[J]. Data&Knowledge Engineering,2009,68(1):28-48.
    [9] Ahmad A,Dey L. A k-mean clustering algorithm for mixed numeric and categorical data[J]. Data&Knowledge Engineering,2007,63(2):503-527.
    [10] Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining&Knowledge Discovery,1998,2(3):283-304.
    [11] Li C,Biswas G. Unsupervised learning with mixed numeric and nominal data[J]. Knowledge&Data Engineering IEEE Transactions on,2002,14(4):673-690.
    [12] Hsu C C,Chen Y C. Mining of mixed data with application to catalog marketing[J]. Expert Systems with Applications,2007,32(1):12-23.
    [13]常茜茜,张月琴.一种基于划分的混合数据聚类算法[J].计算机应用与软件,2014,31(6):154-157.
    [14] Liang J,Zhao X,Li D,et al. Determining the number of clusters using information entropy for mixed data[J]. Pattern Recognition,2012,45(6):2251-2265.
    [15] Renyi A. On measures of information and entropy[J]. Maximum-Entropy and Bayesian Methods in Science and Engineering,1961,1(2):547-561.
    [16] Parzen E. On estimation of a probability density function and mode[J]. Annals of Mathematical Statistics,1962,33(3):1065-1076.
    [17]丁祥武,谭佳,王梅.一种分类数据聚类算法及其高效并行实现[J].计算机应用与软件,2017,34(7):249-256.
    [18] Huang Z. Clusterin large data sets with mixed numeric and categorical values[C]//Proceeding of the First Pacific Asia Knowledge Discovery and Data Mining Conference. 1997:21-34.
    [19] Liang J,Chin K S,Dang C,et al. A new method for measuring uncertainty and fuzziness in rough set theory[J]. International Journal of General Systems,2002,31(4):331-342.
    [20] Jenssen R,Eltoft T,Erdogmus D,et al. Some equivalences between kernel methods and information theoretic methods[J]. Journal of Vlsi Signal Processing Systems for Signal Image&Video Technology,2006,45(1-2):49-65.
    [21]赵兴旺,梁吉业.一种基于信息熵的混合数据属性加权聚类算法[J].计算机研究与发展,2016,53(5):1018-1028.
    [22] Bezdek J C. Pattern Recognition in Handbook of Fuzzy Computation[M]. Boston:IOP Publishing Ltd.,1998.
    [23] Liang J,Bai L,Dang C,et al. The k-means-type algorithms versus imbalanced data distributions[J]. IEEE Transactions on Fuzzy Systems,2012,20(4):728-745.
    [24] Strehl A,Ghosh J. Cluster ensembles:A knowledge reuse framework for combining partitionings[J]. Journal of Machine Learning Research,2002,3(3):583-617.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700