摘要
在模糊聚类算法中,模糊系数被用来控制簇可能重叠的程度,其负面影响是所有的数据对象会影响所有的簇。为解决该问题,Klawonn和H9ppner使用模糊函数替换模糊系数(KH算法),但该方法是针对数值属性数据而设计的。然而,在许多真实的应用中,数据对象通常同时由数值属性和分类属性描述。面向混合属性数据,文中提出了一种新的基于模糊质心的模糊加权聚类算法。首先结合模糊质心和均值来表示混合属性条件下的簇中心,然后使用能够评估不同属性在聚类过程中作用的度量来评估数据对象和簇中心之间的相异度,最后给出算法框架。在3个混合属性数据集上对新算法进行了一系列的测试,实验结果表明新算法的性能优于传统算法。
In fuzzy c-means type algorithms,fuzy parameters are used to control the degree of possible overlap,but it also has the negative effects that all data objects tend to influence all clusters.To solve this issue,Klawonn and H9 ppner proposed a fuzzy function for replacing the fuzzier.However,this method is only designed for numeric data.In many real-world applications,data objects are usually described by both numeric and categorical attributes.In this paper,a novel weighted fuzzy clustering algorithm based on fuzzy centroid(FWFC)was proposed for the data with both numeric and categorical attributes,i.e.mixed data.In this method,the mean is first integrated with fuzzy centroid to represent the cluster centers.Then,a measure which can evaluate the influence of different attributes in the process of clustering is used to evaluate the dissimilarity between data objects and cluster centers.Finally,the algorithm is presented for clustering the data with mixed attributes.The proposed algorithm was tested by a series of experiments on three mixed datasets.Experimental results show that the proposed algorithm outperforms traditional clustering algorithms.
引文
[1]CELEBI M E,KINGRAVI H A,VELA P A.A comparative study of efficient initialization methods for the k-means clustering algorithm[J].Expert Systems with Applications,2013,40(1):200-210.
[2]BORDOGNA G,PASI G.A quality driven hierarchical data divisive soft clustering for information retrieval[J].KnowledgeBased Systems,2012,26:9-19.
[3]LI T,CORCHADO J M,SUN S,et al.Clustering for filtering:Multi-object detection and estimation using multiple/massive sensors[J].Information Sciences,2017(388-389):172-190.
[4]VERMA H,AGRAWAL R K,SHARAN A.An improved intuitionistic fuzzy c-means clustering algorithm incorporating local information for brain image segmentation[J].Applied Soft Computing,2016,46:543-557.
[5]SAEED F,SALIM N,ABDO A.Information theory and voting based consensus clustering for combining multiple clusterings of chemical structures[J].Molecular Informatics,2013,32(7):591-598.
[6]HUANG Z.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304.
[7]ZHANG X,MEI C,CHEN D,et al.Feature selection in mixed data:A method using a novel fuzzy rough set-based information entropy[J].Pattern Recognition,2016,56(1):1-15.
[8]HUANG Z.Clustering large data sets with mixed numeric and categorical values[C]∥Proceedings of the first Pacific-Asia Conference on Knowledge Discovery and Data Mining.1997:21-34.
[9]LI C,BISWAS G.Unsupervised learning with mixed numeric and nominal data[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(4):673-690.
[10]FOSS A,MARKATOU M,RAY B,et al.A semiparametric method for clustering mixed data[J].Machine Learning,2016,105(3):419-458.
[11]BAI L,LIANG J Y,DANG C,et al.A cluster centers initialization method for clustering categorical data[J].Expert Systems with Applications,2012,39(9):8022-8029.
[12]PANG T J,LIANG J Y.Clustering Ensemble Algorithm for Large-scale Mixed Data Based on Sampling[J].Computer Science,2016,43(9):209-212.(in Chinese)庞天杰,梁吉业.一种基于抽样的大规范混合数据聚类集成算法[J].计算机科学,2016,43(9):209-212.
[13]PANG T J,ZHAO X W.Algorithm to Determine Number of Clusters for Mixed Data Based on Prior Information[J].Computer Science,2016,43(2):101-104.(in Chinese)庞天杰,赵兴旺.一种基于先验信息的混合数据聚类个数确定算法[J].计算机科学,2016,43(2):101-104.
[14]KIM D W,LEE K H,LEE D.Fuzzy clustering of categorical data using fuzzy centroids[J].Pattern Recognition Letters,2004,25(11):1263-1271.
[15]AHMAD A,DEY L.Algorithm for fuzzy clustering of mixed data with numeric and categorical attributes[M]∥Distributed Computing and Internet Technology.Berlin:Springer Berlin Heidelberg,2005:561-572.
[16]LEE M,PEDRYCZ W.The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features[J].Fuzzy Sets and Systems,2009,160(24):3590-3600.
[17]CHATZIS S P.A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional[J].Expert Systems with Applications,2011,38(7):8684-8689.
[18]KLAWONN F,H PPNER F.What Is Fuzzy about Fuzzy Clustering?Understanding and Improving the Concept of the Fuzzifier[M]∥Advances in Intelligent Data Analysis V.Berlin:Springer Berlin Heidelberg,2003:254-264.
[19]AHMAD A,DEY L.A k-mean clustering algorithm for mixed numeric and categorical data[J].Data&Knowledge Engineering,2007,63(2):503-527.
[20]WITTEN I H,FRANK E.Data Mining Practical Machine Learning Tools and Techniques with Java Implementation[M].San Fransisco:Morgon Kaufmann Publishers,1999.
[21]HUANG Z X,NG M K.A fuzzy k-modes algorithm for clustering categorical data[J].IEEE Transactions on Fuzzy Systems,1999,7(4):446-452.
1)http://archive.ics.uci.edu/ml/datasets.html