基于Flink平台的K-Means算法

英文篇名：Application of K-Means algorithm based on Flink platform
作者：蔡鲲鹏 ; 李澄非 ; 田果
英文作者：CAI Kun-peng;LI Cheng-fei;TIAN Guo;School of Information Engineering,Wuyi University;
关键词：K-Means ; 聚类 ; 核密度 ; Flink
英文关键词：K-Means;;clustering;;kernel density;;Flink platform
中文刊名：HDZJ
英文刊名：Information Technology
机构：五邑大学信息工程学院;
出版日期：2019-03-20
出版单位：信息技术
年：2019
语种：中文;
页：HDZJ201903018
页数：4
CN：03
ISSN：23-1557/TN
分类号：83-86

摘要

对于K-Means算法在数据聚类时存在的K值敏感问题,基于密度分布的特性,采用核密度并针对大数据处理的需求,基于Flink平台对算法进行并行化处理。通过对比查准率、误差平方和,以及在串行和集群化处理中的加速比、任务执行时间等,并结合Flink平台的并行化计算优势,结果表明,算法在处理海量数据时具有较好的稳定性和高效性。
In order to solve the K-value sensitivity problem of K-Means algorithm in data clustering,the algorithm is applicated in Flink platform by adopting kernel density options based on the characteristics of density distribution,and meetting the requirement of big data processing. By comparing the precision,the sum of squared errors,the acceleration ratio in serial and cluster processing,the task execution time,etc.,combined with the parallel processing method based on Flink platform. The results show that the improved algorithm based on the Flink platform of parallel computing has better stability and efficiency in dealing with massive datasets.

引文

[1] Arlia D,Coppola M. Experiments in Parallel Clustering with DB-SCAN[C]. Euro-Par 2001 Parallel Processing,2001:326-331.
    [2] Ankerst M,Breunig M M,Kriegel H P,et al. OPTICS:orderingpoints to identify the clustering structure[C]. SIGMOD'99 Pro-ceedings of the 1999 ACM SIGMOD international conference onManagement of data. Philadelphia,Pennsylvania,1999:49-60.
    [3]Hinneburg A,Gabriel H H. DENCLUE 2. 0:Fast Clustering Basedon Kernel Density Estimation[C]. Advances in Intelligent Data A-nalysis VII,2007:70-80.
    [4]淦文燕,李德毅.基于核密度估计的层次聚类算法[J].系统仿真学报,2004(2):302-305,309.