基于Spark的FP_Growth算法的并行与优化

英文篇名：Parallelization and optimization of FP_Growth algorithm based on Spark
作者：石陆魁 ; 张欣 ; 师胜利
英文作者：SHI Lukui;ZHANG Xin;SHI Shengli;School of Computer Science and Software, Hebei University of Technology;School of Information Technology, Hebei Normal University;
关键词：FP_Growth算法 ; 频繁项集挖掘 ; 负载均衡 ; 链头表结构 ; Spark
英文关键词：FP_Growth algorithm;;frequent itemset mining;;load balance;;head table;;Spark
中文刊名：JSGG
英文刊名：Computer Engineering and Applications
机构：河北工业大学计算机科学与软件学院;河北师范大学信息技术学院;
出版日期：2017-08-11 13:05
出版单位：计算机工程与应用
年：2018
期：v.54;No.908
基金：河北省自然科学基金(No.F2016202144,No.F2017202145)
语种：中文;
页：JSGG201813008
页数：8
CN：13
分类号：58-64+116

摘要

PFP_Growth算法是FP_Growth算法在Hadoop平台上基于MapReduce的并行化,该算法在分组过程中没有考虑负载均衡问题,导致各个节点完成任务时间不一致,甚至相差很大,从而降低了算法的执行效率。为了提高算法的执行效率,提出了一种基于Spark的RPFP算法,该算法对PFP_Growth算法在均衡分组和降低时间复杂度两方面进行优化,通过把负载大的项放在负载总和最小的组里面实现均衡分组,通过在链头表结构中加入一张哈希表达到快速访问元素地址的目的,从而降低时间复杂度。实验结果表明,RPFP通过优化PFP算法,有效提高了频繁项集的挖掘效率。
PFP_Growth algorithm is the parallelization of FP_Growth algorithm on the Hadoop platform based on MapReduce. The algorithm does not consider the balance of the load while grouping the transaction set, which causes the time inconsistency of different nodes to accomplish the tasks and even a bigger difference. Thus, it reduces the efficiency of the algorithm. To improve the efficiency of the algorithm, this paper proposes a Spark-based RPFP algorithm, which optimizes PFP_Growth algorithm through balancing the groups and reducing the time complexity. To balance the group,the large load is placed into the group with the smallest total load. The address of the element is fast accessed by adding a Hash table to the head table, which reduces the time complexity. Experimental results show that RPFP algorithm effectively improves the mining efficiency of the frequent itemsets.

引文

[1]Han Jiawei,Pei Jian,Yin Yiwen.Mining frequent patterns without candidate generation[C]//Proceedings of the 2000ACM SIGMOD International Conference on Management of Data,2000:1-12.
    [2]Liu Zhenyu,Xu Weixiang,Liu Xurnin.Efficiently using matrix in mining maximum frequent itemset[C]//Proceedings of International Conference on Knowledge Discovery and Data Mining,2010:50-54.
    [3]Essalmi H,El Far M,Mohajir M E,et al.A novel approach for mining frequent itemsets:Apriori Min[C]//Proceedings of 2016 4th IEEE International Colloquium on Information Science and Technology(Ci St),2016:24-26.
    [4]刘步中.基于频繁项集挖掘算法的改进与研究[J].计算机应用研究,2012,29(2):475-477.
    [5]Otey M E,Wang C,Parthasaratlty S,et al.Mining frequent itemsets in distributed and dynamic databases[C]//Proceedings of IEEE International Conference on Data Mining,2003:617-620.
    [6]Kaosar M G,Xu Zhuojia,Yi X.Distributed association rule mining with minimum communication overhead[C]//Proceedings of Eighth Australasian Data Mining Conference,2009:17-23.
    [7]Li H,Wang Y,Zhang D,et al.PFP:Parallel FP_Growth for query recommendation[C]//Proceedings of the 2008ACM Conference on Recommender Systems,2008:107-114.
    [8]Chen Xingshu,Zhang Shuai,Tong Tao,et al.FP-Growth Algorithm based on boolean matrix and Map Reduce[J].Journal of South China University of Technology,2014,42(1):135-141.
    [9]江雨燕,李平.基于PFP-Growth算法的海量频繁项集挖掘[J].计算机技术与发展,2013(9):63-65.
    [10]Qiu Hongjian,Gu Rong,Yuan Chunfeng,et al.YAFIM:A parallel frequent itemset mining algorithm with Spark[C]//Proceedings of Parallel&Distributed Processing Symposium Workshops,2014:1664-1671.
    [11]Zheng Jianhua,Zhang Liangjie,Zhu Rong,et al.Parallel matrix multiplication algorithm based on vector linear combination using Map Reduce[C]//Proceedings of IEEE Ninth World Congress on Services,2013.
    [12]陈明洁.分布式频繁项集挖掘算法[J].计算机应用与软件,2015(10):63-66.
    [13]Zaharia M.Architecture for fast and general data processing on large clusters,Technical Report No.UCBIEECS-2014-12[R].2014.
    [14]Zaharia M,Chowdhury M,Das T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation,2012.
    [15]Joy R,Sherly K K.Parallel frequent itemset mining with spark RDD framework for disease prediction[C]//Proceedings of International Conference on Circuit,2016:1-5.
    [16]Zhang F,Liu M,Gui F,et al.A distributed frequent itemset mining algorithm using spark for big data analytics[J].Cluster Computing,2015,18(4):1493-1501.
    [17]曹博,倪建成,李淋淋,等.基于Spark的并行频繁模式挖掘算法[J].计算机工程与应用,2016,52(20):86-91.
    [18]方向,张功萱.基于Spark的PFP-Growth并行算法优化实现[J].现代电子技术,2016,39(8):9-13.
    [19]Frequent Itemset Mining Dataset Repository[EB/OL].(2012-10-21).http://fimi.ua.ac.be/data/.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700