改进的分类数据聚类中心初始化方法

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

改进的分类数据聚类中心初始化方法

详细信息查看全文 | 推荐本文 |

英文篇名：Improved cluster center initialization method for clustering categorical data
作者：王思杰 ; 唐雁
英文作者：WANG Sijie;TANG Yan;College of Computer & Information Science, Southwest University;
关键词：模糊K-modes算法 ; 距离 ; 密度 ; 初始聚类中心 ; 离群点检测
英文关键词：fuzzy K-modes algorithm;;distance;;density;;initial cluster center;;outlier detection
中文刊名：JSJY
英文刊名：Journal of Computer Applications
机构：西南大学计算机与信息科学学院;
出版日期：2018-06-30
出版单位：计算机应用
年：2018
期：v.38
基金：中央高校基本科研业务费专项资金资助项目(XDJK2015C110)
语种：中文;
页：JSJY2018S1018
页数：4
CN：S1
ISSN：51-1307/TP
分类号：78-81

摘要

模糊K-modes算法是一种有效的针对分类数据的聚类方法,但算法性能非常依赖于初始中心的选择。针对模糊聚类算法对初始中心敏感这一问题,提出一种改进的基于距离和离群点检测的初始中心选择的方法。首先,通过增大初始中心选择过程中距离所占的比重,使所选择的初始中心点更具有分布性;然后,运用基于距离的离群点检测技术对初始中心点进行进一步筛选,避免离群点成为初始中心。对比实验结果表明,改进方法提高了分类数据初始中心选择的成功率,并具有较高的准确率。
The fuzzy K-modes algorithm is one of the efficient clustering methods for categorical data. However, the performance of the fuzzy K-modes clustering algorithm strongly depends on initial cluster centers. In order to solve the problem, a modified initialization method for categorical data based on density and distance was proposed. Firstly, by increasing the proportion of the distance in initial center selection process, the selected initial centers were more distributed.Then the distance-based outlier detection technique was used to screen the initial centers from the outliers. The experimental results show that the method is effective, the accuracy of the clustering results and the success rate of the algorithm are improved.

引文

[1]HUANG Z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining&Knowledge Discovery, 1998, 2(3):283-304.
    [2]CAO F, LIANG J, BAI L. A new initialization method for categorical data clustering[J]. Expert Systems with Applications, 2009, 36(7):10223-10228.
    [3]WU S, JIANG Q S, HUANG Z. A new initialization method for categorical data clustering[J]. Lecture Notes in Computer Science,2007, 4426:972-980.
    [4]BAI L, LIANG J, DANG C. An initialization method to simultaneously find initial cluster centers and the numbers of clusters for clustering categorical data[J]. Knowledge-Based Systems, 2011, 24(6):785-795.
    [5]JIANG F, LIU G, DU J, et al. Initialization of K-modes clustering using outlier detection techniques[J]. Information Science, 2016,332:167-183.
    [6]AO F Y, LIANG J Y, BAI L. A framework for clustering categorical time-evolving data[J]. IEEE Transactions on Fuzzy Systems,2010,18(5):872-882.
    [7]CAO F, LIANG J, LI D, et al. A dissimilarity measure for the kmodes clustering algorithm[J]. Knowledge-Based Systems, 2012,26(9):120-127.
    [8]HE Z, XU X, DENG S. Discovering cluster-based local outliers[J]. Pattern Recognition Letters, 2003, 24(9/10):1641-1650.
    [9]HUANG K Y. Applications of an enhanced cluster validity index method based on the fuzzy C-means and rough set theories to partition and classification[J]. Expert Systems with Applications,2010, 37(12):8757-8769.
    [10]CHEN K, LIU L."Best K":critical clustering structures in categorical datasets[J]. Knowledge and Information Systems, 2009, 20(1):1-33.
    [11]WANG J,KARYPIS G. On efficiently summarizing categorical databases[J]. Knowledge and Information Systems,2006,9(1):19-37.
    [12]LEE M, PEDRYCZ W. The fuzzy C-means algorithm with fuzzy pmode prototypes for clustering objects having mixed features[J].Fuzzy Sets and Systems, 2009, 160(24):3590-3600.
    [13]WANG W, ZHANG Y. On fuzzy cluster validity indices[J]. Fuzzy Sets and Systems, 2007, 158(19):2095-2117.
    [14]ASUNCION A, NEWMAN D J. 2007 UCI machine learning repository[DB/OL].[2017-12-01]. http://archive. ics. uci.edu/ml.
    [15]DAS K, SCHNEIDER J. Detecting anomalous records in categorical datasets[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2007:220-229.
    [16]ZHANG Q, CHEN Z. A weighted kernel possibilistic C-means algorithm based on cloud computing for clustering big data[J]. International Journal of Communication Systems, 2014, 27(9):1378-1391.
    [17]ZHANG Q C, YANG L T, CHEN Z K, et al. PPHOPCM:privacy-preserving high-order possibilistic C-means algorithm for heterogeneous data fuzzy clustering[J]. IEEE Transactions on Big Data,2017, PP(99):1-1.
    [18]ZHANG Q, YANG L T, CHEN Z, et al. A high-order possibilistic C-means algorithm for clustering incomplete multimedia data[J].IEEE Systems Journal, 2017, 11(4):2160-2169.
    [19]ZHANG Q, ZHU C, YANG L T, et al. An incremental CFS algorithm for clustering large data in industrial Internet of things[J]. IEEE Transactions on Industrial Informatics, 2017, 13(3):1193-1201.
    [20]ZHANG Q, YANG L T, CHEN Z, et al. High-order possibilistic C-means algorithms based on tensor decompositions for big data in Io T[J]. Information Fusion, 2018, 39:72-80.
    [21]ZHANG Q, CHEN Z, LENG Y. Distributed fuzzy C-means algorithms for big sensor data based on cloud computing[J]. International Journal of Sensor Networks, 2015, 18(1/2):32-39.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700