基于项编码的分布式频繁项集挖掘算法

英文篇名：Novel distributed itemset mining algorithm based on item encoding
作者：郑静益 ; 邓晓衡
英文作者：Zheng Jingyi;Deng Xiaoheng;College of Software,Central South University;
关键词：频繁项集挖掘 ; Apriori算法 ; 大数据 ; 分布式计算
英文关键词：frequent itemset mining;;Apriori algorithm;;big data;;distributed computation
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：中南大学软件学院;
出版日期：2018-03-14 17:30
出版单位：计算机应用研究
年：2019
期：v.36;No.330
基金：中南大学研究生科研创新项目(2017zzts612)
语种：中文;
页：JSYJ201904023
页数：6
CN：04
ISSN：51-1196/TP
分类号：105-109+113

摘要

Apriori算法是解决频繁项集挖掘最常用的算法之一,但多轮迭代扫描完整数据集的计算方式,严重影响算法效率且难以并行化处理。随着数据规模的持续增大,这一问题日益严重。针对这一问题,提出了一种基于项编码和Spark计算框架的Apriori并行化处理方法——IEBDA算法,利用项编码完整保存项集信息,在不重复扫描完整数据集的情况下完成频繁项集挖掘,同时利用Spark的广播变量实现并行化处理。与其他分布式Apriori算法在不同规模的数据集上进行性能比较,发现IEBDA算法从第一轮迭代后加速效果明显。结果表明,该算法可以提高大数据环境下多轮迭代的频繁项集挖掘效率。
Apriori is one of the most widely used algorithm to discover frequent patterns. However,scanning the entire dataset in each iteration makes this algorithm inefficient and hard to be in parallel. With the size of datasets gets larger continuously,this problem is becoming more and more serious. Therefore,this paper proposed a novel algorithm called IEBDA. This algorithm was a kind of parallelization of Apriori based on item encoding and Spark framework. Saving information of each itemset by item encoding so that it could finish frequent itemset mining without scanning the whole dataset repeatedly. The broadcast variables of Spark enabled this algorithm to be in parallel. Compared with other distributed Apriori algorithms on datasets with different sizes,the acceleration of mining after the first iteration was obvious. The results show that this algorithm efficiently improves the multi-iteratively frequent itemset mining in big data environment.

引文

[1]Jiao Yabin.Research of an improved Apriori algorithm in data mining association rules[J].International Journal of Computer&Communication Engineering,2013,2(1):25-27.
    [2]Borgelt C,Kruse R.Induction of association rules:Apriori implementation[C]//Proc of Computational Statistics.Berlin:Springer-Verlag,2002:937-944.
    [3]Ye Yanbin,Chiang C C.A parallel Apriori algorithm for frequent itemsets mining[C]//Proc of International Conference on Software Engineering Research,Management and Applications.Washington DC:IEEE Computer Society,2006:87-94.
    [4]Lin Xueyan.MR-Apriori:association rules algorithm based on MapReduce[C]//Proc of IEEE International Conference on Software Engineering and Service Science.Piscataway,NJ:IEEE Press,2014:141-144.
    [5]Li Ning,Zeng Li,He Qing,et al.Parallel implementation of Apriori algorithm based on MapReduce[C]//Proc of ACIS International Conference on Software Engineering,Artificial Intelligence,Networking and Parallel&Distributed Computing.Piscataway,NJ:IEEE Press,2012:236-241.
    [6]Yu Kunming,Lee Minggong,Huang Yuanshao,et al.An efficient frequent patterns mining algorithm based on MapReduce framework[C]//Proc of International Conference on Software Intelligence Technologies and Applications&International Conference on Frontiers of Internet of Things.[S.l.]:IET,2015:1-5.
    [7]Yang Xinyue,Liu Zhen,Fu Yan.MapReduce as a programming model for association rules algorithm on Hadoop[C]//Proc of International Conference on Information Sciences and Interaction Sciences.Piscataway,NJ:IEEE Press,2010:99-102.
    [8]Dharavath R,Kumar V,Kumar C,et al.An Apriori-based vertical fragmentation technique for heterogeneous distributed database transactions[M]//Intelligent Computing,Networking,and Informatics.Berlin:Springer,2014:687-695.
    [9]田卫东,许静文.基于模糊等价类的频繁项集精简表示方法[J].计算机应用研究,2016,33(7):1936-1940.(Tian Weidong,Xu Jingwen.Concise representation of frequent itemset based on fuzzy equivalence[J].Application Research of Computers,2016,33(7):1936-1940.)
    [10]唐颖峰,陈世平.一种基于后缀项表的并行闭频繁项集挖掘算法[J].计算机应用研究,2014,31(2):373-377.(Tang Yingfeng,Chen Shiping.Parallel closed frequent itemset mining algorithm with postfix-table[J].Application Research of Computers,2014,31(2):373-377.)
    [11]谢志明,王鹏.基于MapReduce架构的并行矩阵Apriori算法[J].计算机应用研究,2017,34(2):401-404.(Xie Zhiming,Wang Peng.Parallel matrix Apriori algorithm based on MapReduce architecture[J].Application Research of Computers,2017,34(2):401-404.)
    [12]Qiu Hongjian,Gu Rong,Yuan Chunfeng,et al.YAFIM:a parallel frequent itemset mining algorithm with Spark[C]//Proc of IEEE International Parallel&Distributed Processing Symposium Workshops.Washington DC:IEEE Computer Society,2014:1664-1671.
    [13]Yang Shaosong,Xu Guoyan,Wang Zhijian,et al.The parallel improved Apriori algorithm research based on Spark[C]//Proc of the9th International Conference on Frontier of Computer Science and Technology.Piscataway,NJ:IEEE Press,2015:354-359.
    [14]Sethi K K,Ramesh D.HFIM:a Spark-based hybrid frequent itemset mining algorithm for big data processing[J].Journal of Supercomputing,2017,73(8):3652-3668.
    [15]Rathee S,Kaul M,Kashyap A.R-Apriori:an efficient apriori based algorithm on Spark[C]//Proc of Workshop in Information and Knowledge Management.New York:ACM Press,2015:27-34.
    [16]Gui Feng,Ma Yunlong,Zhang Feng,et al.A distributed frequent itemset mining algorithm based on Spark[C]//Proc of International Conference on Computer Supported Cooperative Work in Design.Piscataway,NJ:IEEE Press,2015:271-275.
    [17]Pang Shaoning,Kasabov N.Encoding and decoding the knowledge of association rules over SVM classification trees[J].Knowledge&Information Systems,2009,19(1):79-105.
    [18]邢长征,安维国,王星.垂直数据格式挖掘频繁项集算法的改进[J].计算机工程与科学,2017,39(7):1365-1370.(Xing Changzheng,An Weiguo,Wang Xing.An improved frequent itemsets mining algorithm based on vertical data format[J].Computer Engineering and Science,2017,39(7):1365-1370.)
    [19]Bell J.Apache Spark[M]//Machine Learning:Hands-on for Developers and Technical Professionals.Hokebon:Wiley,2015:275-314.
    [20]Zaharia M,Chowdhury M,Das T,et al.Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//Proc of USENIX Conference on Networked Systems Design and Implementation.[S.l.]:USENIX Association,2012.
    [21]Pudi V.Data mining:concepts and techniques[M].Oxford:Oxford University Press,2009.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700