摘要
针对传统的Agnes算法在处理大批量数据时出现的内存和CPU处理速度问题,提出基于Map Reduce框架的并行Agnes算法,给出了算法的主要设计方案。Map阶段主要进行簇的初始化步骤,Reduce阶段则计算簇间距离,合并距离最近的簇。为了考虑属性间的联系,在计算簇间距离时,使用马氏距离替代欧氏距离。最后使用大小不同的数据集验证改进算法的加速比和可伸缩性。实验结果表明基于Map Reduce框架的并行Agnes算法适合于大批量数据的分析和挖掘。
In order to solve the problem of memory capacity and CPU processing speed when the traditional Agnes algorithm is used to deal with massive data. A parallel Agnes algorithm based on mapreduce was proposed. And concrete method was also described. The process of the Map ' s aim is to get initialized clusters. The process of the Reduce is to calculate distance between clusters, merge the most closed clusters. And concerning the connection of Attributes, the thesis replaced Euclidean Distance with Mahalanobis Distance. At last, using different size of dataset to test speedup ratio and scalability of improved algorithm. The experimental result show that improved algorithm is suitable for massive data analysis and data mining.
引文
[1]毛典辉.基于Map Reduce的Canopy-Kmeans改进算法[J].计算机工程与应用,2012,48(27):22-26+68.
[2]曾佳军.改进的AGNES算法在羽毛球技战术分析中的应用[J].电脑知识与技术,2009,5(33):9343-9345.
[3]温程.并行聚类算法在Map Reduce上的实现[D].浙江大学,2011.
[4]段明秀.层次聚类算法的研究及应用[D].中南大学,2009.
[5]易倩,滕少华,张巍.基于马氏距离的K均值聚类算法的入侵检测[J].江西师范大学学报(自然科学版),2012,36(03):284-287.
[6]马可.基于Storm的流数据聚类挖掘算法的研究[D].南京邮电大学,2016.