基于MapReduce的并行Agnes算法

英文篇名：Parallel Agnes Algorithm based on MapReduce
作者：张国光 ; 巩秀钢 ; 于旭东 ; 冯韶文
英文作者：ZHANG Guo-guang;GONG Xiu-gang;YU Xu-dong;FENG Shao-wen;School of Computer Science and Technology,Shandong University of Technology;
关键词：MapReduce ; 并行Agnes ; 大批量数据 ; 马氏距离
英文关键词：MapReduce;;Parallel Agnes;;Massive data;;Mahalanobis Distance
中文刊名：KJSJ
英文刊名：Science & Technology Vision
机构：山东理工大学计算机科学与技术学院;
出版日期：2018-04-05
出版单位：科技视界
年：2018
期：No.232
语种：中文;
页：KJSJ201810051
页数：3
CN：10
ISSN：31-2065/N
分类号：118-120

摘要

针对传统的Agnes算法在处理大批量数据时出现的内存和CPU处理速度问题,提出基于Map Reduce框架的并行Agnes算法,给出了算法的主要设计方案。Map阶段主要进行簇的初始化步骤,Reduce阶段则计算簇间距离,合并距离最近的簇。为了考虑属性间的联系,在计算簇间距离时,使用马氏距离替代欧氏距离。最后使用大小不同的数据集验证改进算法的加速比和可伸缩性。实验结果表明基于Map Reduce框架的并行Agnes算法适合于大批量数据的分析和挖掘。
In order to solve the problem of memory capacity and CPU processing speed when the traditional Agnes algorithm is used to deal with massive data. A parallel Agnes algorithm based on mapreduce was proposed. And concrete method was also described. The process of the Map ' s aim is to get initialized clusters. The process of the Reduce is to calculate distance between clusters, merge the most closed clusters. And concerning the connection of Attributes, the thesis replaced Euclidean Distance with Mahalanobis Distance. At last, using different size of dataset to test speedup ratio and scalability of improved algorithm. The experimental result show that improved algorithm is suitable for massive data analysis and data mining.

引文

[1]毛典辉.基于Map Reduce的Canopy-Kmeans改进算法[J].计算机工程与应用,2012,48(27):22-26+68.
    [2]曾佳军.改进的AGNES算法在羽毛球技战术分析中的应用[J].电脑知识与技术,2009,5(33):9343-9345.
    [3]温程.并行聚类算法在Map Reduce上的实现[D].浙江大学,2011.
    [4]段明秀.层次聚类算法的研究及应用[D].中南大学,2009.
    [5]易倩,滕少华,张巍.基于马氏距离的K均值聚类算法的入侵检测[J].江西师范大学学报(自然科学版),2012,36(03):284-287.
    [6]马可.基于Storm的流数据聚类挖掘算法的研究[D].南京邮电大学,2016.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700