基于Spark的并行FP-Growth算法优化及实现

英文篇名：Optimization and implementation of parallel FP-Growth algorithm based on Spark
作者：顾军华 ; 武君艳 ; 许馨匀 ; 谢志坚 ; 张素琪
英文作者：GU Junhua;WU Junyan;XU Xinyun;XIE Zhijian;ZHANG Suqi;School of Artificial Intelligence and Data Science,Hebei University of Technology;Hebei Province Key Laboratory of Big Data Computing ( Hebei University of Technology);School of Information Engineering,Tianjin University of Commerce;
关键词：大数据平台 ; 关联规则 ; 频繁项集 ; 频繁模式增长算法 ; Spark
英文关键词：big data platform;;association rule;;frequent itemset;;Frequent Pattern-Growth(FP-Growth) algorithm;;Spark
中文刊名：JSJY
英文刊名：Journal of Computer Applications
机构：河北工业大学人工智能与数据科学学院;河北省大数据计算重点实验室(河北工业大学);天津商业大学信息工程学院;
出版日期：2018-07-19 10:39
出版单位：计算机应用
年：2018
期：v.38;No.339
基金：河北省科技计划项目(17210305D);; 天津市科技计划项目(16ZXHLSF0023);天津市科技计划项目(15ZXHLGX00130);; 天津市自然科学基金资助项目(15JCQNJC00600)~~
语种：中文;
页：JSJY201811005
页数：6
CN：11
ISSN：51-1307/TP
分类号：23-28

摘要

为了进一步提高在Spark平台上的频繁模式增长(FP-Growth)算法执行效率,提出一种新的基于Spark的并行FP-Growth算法——BFPG。首先,从频繁模式树(FP-Tree)规模大小和分区计算量对F-List分组策略进行改进,保证每个分区负载总和近似相等;然后,通过创建列表P-List对数据集划分策略进行优化,减少遍历次数,降低时间复杂度。实验结果表明,BFPG算法提高了并行FP-Growth算法挖掘效率,且算法具有良好的扩展性。
In order to further improve the execution efficiency of Frequent Pattern-Growth(FP-Growth) algorithm on Spark platform,a new parallel FP-Growth algorithm based on Spark,named BFPG(Better Frequent Pattern-Growth),was presented. Firstly,the grouping strategy F-List was improved in the size of the Frequent Pattern-Tree(FP-Tree) and the amount of partition calculation to ensure that the load sum of each partition was approximately equal. Secondly,the data set partitioning strategy was optimized by creating a list P-List,and then the time complexity was reduced by reducing the traversal times. The experimental results show that the BFPG algorithm improves the mining efficiency of the parallel FP-Growth algorithm,and the algorithm has good scalability.

引文

[1]张素琪,梁志刚,胡利娟,等.改进的多维关联规则算法研究及应用[J].计算机工程与科学,2012,34(9):174-179.(ZHANGS Q,LIANG Z G,HU L J,et al.Research and application of improved multi-dimensional association rule algorithm[J].Computer Engineering and Science,2012,34(9):174-179.)
    [2]HAN J,PEI J,YIN Y.Mining frequent patterns without candidate generation[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.New York:ACM,2000:1-12.
    [3]ESSALMI H,FAR M E,MOHAJIR M E,et al.A novel approach for mining frequent itemsets:Apriori Min[C]//Proceedings of the2016 4th IEEE International Colloquium on Information Science and Technology.Piscataway,NJ:IEEE,2017:286-289.
    [4]崔妍,包志强.关联规则挖掘综述[J].计算机应用研究,2016,33(2):331-334.(CUI Y,BAO Z Q.Association rule mining overview[J].Application Research of Computers,2016,33(2):331-334.)
    [5]马强,杨金民.基于MapReduce的频繁项集并行挖掘算法[J].计算机应用与软件,2015,33(9):13-17.(MA Q,YANG J M.Parallel mining algorithm of frequent item set based on MapReduce[J].Computer Applications and Software,2015,33(9):13-17.)
    [6]LI H,WANG Y,ZHANG D,et al.PFP:parallel FP-growth for query recommendation[C]//Proceedings of the 2008 ACM Conference on Recommender Systems.New York:ACM,2008:107-114.
    [7]朱文飞,齐建东,洪剑珂.Hadoop下负载均衡的频繁项集挖掘算法研究[J].计算机应用与软件,2016,33(5):36-39.(ZHUW F,QI J D,HONG J K.research on load balanced frequent itemsets mining algorithm based on Hadoop[J].Computer Applications and Software,2016,33(5):36-39.)
    [8]章志刚,吉根林.一种基于FP-Growth的频繁项目集并行挖掘算法[J].计算机工程与应用,2014,50(2):103-106.(ZHANG ZG,JI G L.Parallel algorithm for mining frequent item sets based on FP-Growth[J].Journal of Computer Engineering and Application,2014,50(2):103-106.)
    [9]ZAHNG F,LIU M,GUI F,et al.A distributed frequent itemset mining algorithm using Spark for big data analytics[J].Cluster Computing,2015,18(4):1493-1501.
    [10]DENG L L,LOU Y,et al.Improvement and research of FP-Growth algorithm based on distributed spark[C]//Proceedings of the 2015 International Conference on Cloud Computing and Big Data.Piscataway,NJ:IEEE,2015:105-108.
    [11]方向,张功萱.基于Spark的PFP-Growth并行算法优化实现[J].现代电子技术,2016,39(8):9-13.(FANG X,ZHANGG X.PFR-Growth parallel algorithm optimization based on Spark[J].Modern Electronics Technique,2016,39(8):9-13.)
    [12]LI C,HUANG X.Research on FP-Growth algorithm for massive telecommunication network alarm data based on Spark[C]//Proceedings of the 2016 IEEE International Conference on Software Engineering and Service Science.Piscataway,NJ:IEEE,2017:875-879.
    [13]张稳,罗可.一种基于Spark框架的并行FP-Growth挖掘算法[J].计算机工程与科学,2017,33(8):1403-1409.(ZHANGW,LUO K.A parallel FP-Growth mining algorithm based on spark framework[J].Computer Engineering and Science,2017,33(8):1403-1409.)
    [14]陆可,江雨燕,杜萍萍,等.基于Spark的并行FP-Growth算法优化与实现[J].计算机应用与软件,2017,34(9):273-277.(LU K,JIANG Y Y,DU P P,et al.Parallel FP-Growth algorithm optimization and implementation based on Spark[J].Computer Applications and Software,2017,34(9):273-277.)
    [15]ZHOU L,WANG X.Research of the FP-Growth algorithm based on cloud environments[J].Journal of Software,2014,9(3):676-682.
    [16]高权,万晓东.基于负载均衡的并行FP-Growth算法[J].计算机工程,2018,39(6):37-72.(GAO Q,WAN X D.Load-balanced parallel FP-Growth algorithm[J].Computer Engineering,2018,39(6):37-72.)
    [17]Frequent itemset mining dataset repository[EB/OL].[2012-10-21].http://fimi.ua.ac.be/data/.This work is partially supported by the Science-Technology Program of Hebei Province(17210305D),the Science-Technology Program of Tianjin(16ZXHLSF0023),the Natural Science Foundation of Tianjin(15JCQNJC00600),the Science-Technology Program of Tianjin(15ZXHLGX00130)GU Junhua,born in 1966,Ph.D.,professor.His research interests include data mining,intelligent information processsing,information acquisition and integration,intelligent computing and optimization,function and information display,software engineering,project management.WU Junyan,born in 1994,M.S.candidate.Her research interests include data mining,computer simulation,machine learning.XU Xinyun,born in 1995,M.S.candidate.Her research interests include sentiment analysis,natural language processing,deep learning.XIE Zhijian,born in 1995,M.S.candidate.His research interests include machine learning,data mining.ZHANG Suqi,born in 1980,Ph.D.,lecturer.Her research interests include machine learning,data mining.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700