基于Spark的模糊C均值算法改进

英文篇名：Improvement of FCM algorithm based on Spark
作者：夏邢 ; 薛涛 ; 李婷
英文作者：XIA Xing;XUE Tao;LI Ting;School of Computer Science, Xi′an Polytechnic University;
关键词：模糊C均值 ; Canopy算法 ; 马哈拉诺比斯距离 ; Spark ; 并行化
英文关键词：fuzzy C-Means;;Canopy algorithm;;Mahalanobis distance;;Spark;;parallelization
中文刊名：XBFZ
英文刊名：Journal of Xi'an Polytechnic University
机构：西安工程大学计算机科学学院;
出版日期：2019-03-05 13:16
出版单位：西安工程大学学报
年：2019
期：v.32;No.155
基金：陕西省自然科学基础研究计划一般项目(2018JQ6103)
语种：中文;
页：XBFZ201901019
页数：6
CN：01
ISSN：61-1471/N
分类号：104-109

摘要

模糊C均值(FCM,fuzzzy C-Means)算法是一种在大数据分析领域广泛使用的聚类算法,由于FCM的聚类结果和聚类速度很大程度上取决于初始聚类中心,因此给出一种Canopy-FCMBM改进算法。首先运用Canopy算法生成聚类中心和聚类数量,并以此结果作为FCM算法的初始聚类中心,从而解决确定聚类数目困难和随机初始聚类中心容易产生局部最优解等问题。针对数据存在多维度且分布不均匀的问题,将FCM算法目标函数距离度量方式由欧几里德距离替换为马哈拉诺比斯距离。最后通过Spark编程模型实现Canopy-FCMBM算法的并行化处理,提高算法执行效率。结果表明,相比较传统的FCM算法,基于Spark的Canopy-FCMBM算法聚类准确率提升12.7%,聚类速度提升1.35倍,聚类效果更优。
Fuzzy C-Means(FCM) algorithm is a clustering algorithm widely used in the field of big data analysis.Since the clustering results and speed of FCM depend largely on the initial clustering center, an improved Canopy-FCMBM algorithm is proposed in this paper. Firstly, the Canopy algorithm is used to generate the cluster center and the number of clusters, and the result is used as the initial clustering center of the FCM algorithm, so as to solve the problem that it is difficult to determine the number of clusters,and that randomly determining the initial clustering center leads to the local optimal solution. In view of the multi-dimensional and uneven distribution of data, the distance measurement method of FCM is replaced by the Mahalanobis distance. Finally, the parallelization processing on Spark programming model is realized to improve the algorithm execution efficiency. Compared with the traditional FCM algorithm, the experimental results show that the clustering accuracy of the improved algorithm increases by 12.7%, the clustering speed increases by 1.35 times, and thus the clustering effect is better than before.

引文

[1] 范明,田铮,赵伟.FCM型聚类算法的统一框架及其核推广[J].电子设计工程,2013,21(4):134-136. FAN M,TIAN Z,ZHAO W.Unified framework of the FCM-type clustering algorithm and its kernel version[J].Electronic Design Engineering,2013,21(4):134-136.(in Chinese)
    [2] 张姣,王晓东,薛红.基于花粉算法的K均值聚类算法[J].纺织高校基础科学学报,2016,29(4):563-569. ZHANG J,WANG X D,XUE H.K-means clustering algorithm based on flower pollination algorithm[J].Basic Sciences Journal of Textile Universities, 2016,29(4):563-569.(in Chinese)
    [3] ESTEVES R M,Rong C.Using mahout for clustering wikipedia′s latest articles:A comparison between K-means and fuzzy C-means in the Cloud[C]//IEEE Third International Conference on Cloud Computing Technology & Science.USA:IEEE,2012:565-569.
    [4] YU Q,DING Z.Improved Canopy-FCM algorithm based on MapReduce[C]//International Congress on Image & Signal Processing.USA:IEEE,2017:1975-1979.
    [5] DAI W,YU C J,JIANG Z L.An improved hybrid Canopy-Fuzzy C-Means clustering algorithm based on MapReduce model[J].Journal of Computing Science and Engineering.USA:IEEE,2016,10(1):1-8.
    [6] 王桂兰,周国亮,萨初日拉,等.Spark环境下的并行模糊C均值聚类算法[J].计算机应用,2016,36(2):342-347. WANG G L,ZHOU G L,SACHURILA,et al.Parallel fuzzy C-means clustering algorithm in Spark[J].Journal of Computer Applications,2016,36(2):342-347.(in Chinese)
    [7] 冯青平,李星毅.基于MapReduce和聚类算法的交通状态识别[J].信息技术,2017(5):1-6. FENG Q P,LI X Y.Traffic state recognition based on MapReduce and clustering algorithm[J].Information Technology,2017(5):1-6.(in Chinese)
    [8] 李琪,张欣,张平康,等.基于密度峰值优化的Canopy-Kmeans并行算法[J].通信技术,2018,51(2):312-317. LI Q,ZHANG X,ZHANG P K,et al.Modified Canopy-Kmeans parallel algorithm based on density peaks[J].Communications Technology,2018,51(2):312-317.(in Chinese)
    [9] 盛莉,邹开其,邓冠男.基于网格和密度的模糊c均值聚类初始化方法[J].计算机应用与软件,2008,25(3):22-23. SHENG L,ZHOU K Q,DENG G N.An initialization method for fuzzy c-Means clustering algorithm based on crid and density[J].Computer applications and software,2008,25(3):22-23.(in Chinese)
    [10] 祖志文,李秦.基于马氏距离的模糊聚类优化算法——KM-FCM[J].河北科技大学学报,2018,39(2):159-165. ZU Z W,LI Q.KM-FCM:A fuzzy clustering optimization algorithm based on Mahalanobis distance[J].Journal of Hebei University of Science and Technology,2018,39(2):159-165.(in Chinese)
    [11] 梁鹏.基于Spark的模糊C均值聚类算法研究[D].哈尔滨:哈尔滨工业大学,2014,30-36. LIANG P.Research on fuzzy C-means clustering algorithm based on Spark[D].Harbin:Harbin Institute of Technology,2014,30-36.(in Chinese)
    [12] 熊拥军,刘卫国,欧鹏杰.模糊c-均值聚类算法的优化[J].计算机工程与应用,2015,51(11):124-128. XIONG Y J,LIU W G,OU P J.New optimized fuzzy c-means clustering algorithm[J].Computer Engineering and Applications,2015,51(11):124-128.(in Chinese)
    [13] 高新波,裴继红,谢维信.模糊c-均值聚类算法中加权指数m的研究[J].电子学报,2000,28(4):80-83. GAO X B,PEI J H,XIE W X.A study of weighting exponent m in a fuzzy c-Means algorithm[J].Acta Electronica Sinica,2000,28(4):80-83.(in Chinese)
    [14] 祖志文,李秦.关于马氏距离模糊聚类的有效性指标研究[J].陕西理工大学学报(自然科学版),2018,34(2):33-38. ZU Z W,LI Q.Research on validity index of Mahalanobis distance fuzzy[J].Journal of Shaanxi University of Technology(Natural Science Edition),2018,34(2):33-38.(in Chinese)
    [15] 余长俊,张燃.云环境下基于Canopy聚类的FCM算法研究[J].计算机科学,2014,41(S2):316-319. YU C J,ZHANG R.Research of FCM algorithm based on Canopy clustering algorithm under cloud environment[J].Computer Science,2014,41(S2):316-319.(in Chinese)
    [16] 郭卫霞,薛涛,李婷.基于Hadoop的Canopy-K-means并行算法的学生成绩与毕业流向关系分析[J].西安工程大学学报,2018,32(6):705-712. GUO W X,XUE T,LI T.Analysis of student score and graduation destination based on Hadoop′s Canopy-K-means parallel algorithm[J].Journal of Xi′an Polytechnic University,2018,32(6):705-712.(in Chinese)
    [17] JIANG W,YANG T,SHOU Y H,et al.Improved evidential fuzzy C-Means method[J].Journal of Systems Engineering and Electronics,2018,29(01):187-195.
    [18] 王小姣,徐夫田,单国杰.模糊C-均值聚类算法的改进[J].微型机与应用,2010,29(12):42-44,48. WANG X J,XU F T,SHAN G J.Improvement of fuzzy C-means clustering algorithm[J].Microcomputer & Its Applications,2010,29(12): 42-44,48.(in Chinese)
    [19] 马洋春,王兴芬.基于Spark的K-means聚类的并行实现与优化[J].福建电脑,2017,33(11):1-4. MA Y C,WANG X F.Parallel Implementation and optimization of K-means clustering based on Spark[J].Fujian Computer,2017,33(11):1-4.(in Chinese)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700