基于分布式的频繁闭合模式挖掘算法研究

英文题名：The Research on the Distributed Algorithm of Mining Frequent Closed Patterns
作者：张敏
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 关联规则 ; 分布式算法 ; 频繁闭合模式
英文关键词：Data Mining ; Association Rules ; DistributedAlgorithm ; Frequent Closed Patterns
学位年度：2011
导师：杨君锐
学科代码：081203
学位授予单位：西安科技大学

摘要

关联规则挖掘是数据挖掘研究中的热点问题之一,其目的是发现数据库中数据项之间存在的潜在联系。关联规则挖掘的重点任务是频繁模式挖掘。然而,由于频繁模式挖掘的复杂性,业界提出了频繁闭合模式挖掘问题。频繁闭合模式可以唯一地确定所有频繁模式完全集以及它们的准确支持度,且其规模远远小于频繁模式。在单处理机上的频繁闭合模式挖掘算法研究方面,人们已经取得了许多成果。但随着分布式环境的日益普遍,使得传统串行算法的挖掘技术已无法解决分布式下的挖掘问题,因此,研究高性能的分布式频繁闭合模式挖掘算法显得尤为重要。
     本文在对典型关联规则挖掘算法进行较深入研究的基础上,将分布式思想引入关联规则挖掘中,提出了两种分布式频繁闭合模式挖掘算法,主要内容有以下两部分:
     第一部分提出了一种基于分布式的频繁闭合模式挖掘算法-PFCI_Miner。算法采用任务分布的主从方式,其中主处理器通过发送文中提出的前缀路径表(PrePthx)将挖掘任务合理划分,而从处理器借助提出的存储树(Trac-tree)挖掘局部频繁闭合模式,最后由主处理器挖掘出全局频繁闭合模式集。另外,采用星形的拓扑结构,使数据通信只存在于主处理器与从处理器之间,而各从处理器之间无数据通信且不需要同步。实验结果表明,PFCI_Miner算法具有较好的效率。
     第二部分针对数据流及分布式算法的特点,提出了一种数据流下的分布式频繁闭合模式挖掘算法DSFC_Miner。该算法采用分段思想,挖掘每个数据流分段的临界频繁闭合模式,并创建相应的局部FCI_DS树保存临界频繁闭合模式。最后通过合并局部FCI_DS树,在允许误差范围内挖掘得到当前数据流中的频繁闭合模式集。实验结果表明该算法是可行的。
Mining association rules is one of the most important problems in data mining, whichcould describe the potential relationships between items in the magnanimous data. The miningof association rules focuses on the frequent patterns. Because of the complexity of frequentpatterns, mining frequent closed patterns have been proposed to improve the miningefficiency. The set of frequent closed patterns is far smaller than the set of frequent patternson scale. The set of frequent closed patterns still contain enough information of the frequentpatterns and its accurate support. People have made many achievements in the research offrequent closed patterns on a single processor. But as the distributed environment has becomemore common and the traditional serial algorithms can not solve the mining problems underdistributed one, it is very important to desigh the high-performance distributed miningalgorithms.
     This thesis analyzes the performance of typical algorithms of association rules, and theirvirtues and disadvantages. For the shortages of the traditional algorithms, two algorithmsbased on distributed for mining frequent closed patterns are presented. The major workincludes the following two parts::
     In the first part, in view of the characteristic of mining association rules and distributedenvironment, one efficient algorithm(PFCI_Miner) based on distributed for mining frequentclosed patterns is presented. The algorithm uses the Master-Slave structure to implement taskdistribution. The Master-processor assigns the task efficiently by sending Prefix PathTable(PrePthx) which is presented in the paper, and the Slave-processors mine local frequentclosed patterns with the help of the proposed store tree(Trac-tree). Finally the main processorfinds out the global frequent closed patterns. The algorithm uses star-like topology in order tomake all data communications only between the Master-processor and the Slave-processors. There is no communication and synchronization among all Slave-processors. Theexperimental results show the efficiency of the PFCI_Miner.
     In the second part, according to the features of data streams, a distributed algorithmDSFC_Miner for mining frequent closed patterns from data streams is proposed. In addition,the method, in which data streams are partitioned, is adopted in the algorithm. The algorithmgets critical frequent closed patterns from each data stream section, and creates correspondinglocal FCI_DS tree to store the critical frequent closed patterns. Though introducing error, thepresent global frequent closed patterns can effectively be mined. The experimental resultssuggest that DSFC_Miner algorithm is fast and effective

引文

[1] Han J, Kamber M著.范明,盂小峰等译.数据挖掘:概念与技术[M].北京:机械工业出版社, 2001
    [2] Soman K.P, Shyam Diwakar, V Ajay著.范明,牛常勇译.数据挖掘基础教程[M].北京:机械工业出版社, 2009
    [3] Agrawal R, Imielinshi T, Swami A. Mining Association Rules between Sets of Items inLarge Database[C]. In: Proc. of the ACM SIGMOD International Conference onManagement of Data, Washington, 1993, 22(2): 207~216
    [4] Weiss S.M, Kulikowski C.A. Computer systems that learn: Classification andprediction methods from static, neural nets, machine learning, and expert systems[C].San Mateo, CA:Morgan Kaufman, 1991
    [5] Agrawal R, Gehrke J.D, Gunopulos, et al. Automatic Subspace Clustering of HighDimensional Data for Data Mining Applications[C]. In: Proc. of the ACM SIGMODRecord, 1998, 27(2): 94~105
    [6] Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules[C]. In: Proc. ofthe 20th International Conference on VLDB, Santigo, 1994, 487~499
    [7] Park J.S, Chen M.S, Yu P.S. An Effeetive Hash-Based Algorithm for MiningAssociation Rules[C]. In: Proc. of the ACM SIGMOD International Conference onManagement of Data, SanJose, CA, 1995, 12(9): 175~186
    [8] Savasere E, Omieeinski S, Navathe. An Effieient Algorithm for Mining AssociationRules[C]. In: Proc. of the 21st International Conference on Very Large Databases,Zurich, 1995, 25(18): 432~434
    [9] Toivonen H. Sampling Large Databases for Association Rules[C]. In: Proc. of the 22thInternational Conference on VLDB, Bombay, India, 1996, 134~145
    [10] Han J, Jian P, Yiwen Y. Mining Frequent Patterns Without Candidate Generation[C].In: Proc. of the 2000 ACM SIGMOD International Conference Management of Data.Dallas, 2000, 1~12
    [11] Brin S, Motwani R, Uiiman J, et al. Dynamic Itemset Counting and Implication Rulesfor Market Basket Data[C]. In: Proc. of the ACM SIGMOD International Conferenceon Management of Data. Tucson, Arizona, 1997, 255~264
    [12] D.W Cheng, J Han, V.T Ng, et al. Maintenance of Discovered Association Rule inLarge Database: An Increamental Update Technique[C]. In: Proc. of the ICDE' 96.1996, 106~114
    [13]易彤,徐宝文,吴方君.一种基于FP树的挖掘关联规则的增量更新算法[J].计算机学报, 2004, 27(5): 703~710
    [14]冯玉才,冯剑琳.关联规则的增量式更新算法[J].软件学报, 1998, 9(4): 301~306
    [15]王云岚,李增智,屈科文.基于候选项集个数上阶的增量式关联规则更新算法[J].电子学报, 2004, 32(5): 731~734
    [16] Agrawal R, Sharfer J. Parallel Mining of Association Rules[J]. IEEE Transactions onKonwledge and Data Engineering. 1996, 8(6): 962~969
    [17]方英武,张广鹏,吴德伟等.分布式数据挖掘计算过程[J].电子科技大学学报,2003, 32(1): 80~84
    [18]宋宝莉,覃征.分布式数据库关联规则更新[J].西安交通大学学报, 2007, 41(4):416~420
    [19] Han J, Fu Yj. Discovery of Multiple-Level Association Rules from Large Databases[C].In: Proc. Of the 1995 Int'l. Conference on Very Large DataBases, Zhrich, Switzerland,1995: 420~431
    [20] Srikant R, Agrawal R. Mining generalized association rules[C]. In: Proc. of the 21thInt'l Conference on Very Large Databases, Zhrich, Switzerland, 1995: 407~419
    [21] Manku G, Motwani R. Approximate Frequency Counts Over Data Streams[C]. In:Proc. of the 28th International Conference on VLDB, Hong Kong, 2002, 346~357
    [22] Giannella C, Han J, Pei J, et al. Mining Frequent Patterns in Data Streams at MultipeTime Granularities[M]. Next Generation Data Mining, 2002, 191~211
    [23] Zaki M.J, Parthasarathy S, Ogihara M, et al. New algorithm for mining fast discoveryof association rules[C]. In: Proc. of the 3rd Int' l. Conference on Knowledge Discoveryand Data Mining, 1997, 283~286
    [24] Pasquier N, Bastide Y, Taouil R, et al. Discovering frequent closed itemsets forassociation rules. In: Proc. of the 7th International Conference on Database Theory.1999: 398~416
    [25] Pei J, Han J, Mao R. CLOSET: An efficient algorithm for mining frequent closeditemsets. In: Proc. of the 2000 ACM SIGMOD International Workshop on Data Miningand Knowledge Discovery. Dallas:ACM Press, 2000: 21~30
    [26] Wang J.Y, Han J, Pei J. CLOSET+: searching for the best strategies for mining frequentclosed itemsets[C]. In: Proc. of the SIGKDD' 03,Auguest 24-24, Washington, DC USA
    [27] Zaki M.J, Hsiao C.J. CHARM: An efficient algorithm for closed itemset mining. In:Proc. of the 2nd SIAM International Conference on Data Mining. Arlington: SIAM,2002: 12~28
    [28] C Lucchese, S Orlando, R Perego. Fast and memory efficient mining of frequent closeditemsets[J]. IEEE Trans on Knowledge and Data Engineering, 2006, 18 (1): 21~36
    [29] Uno T, Kiyomi M, Arimura H. LCMver2: Efficient mining algorithms for frequent/closed/maximal itemsets[C]. In: Proc. of the IEEE ICDM2004 Workshop FIMI 2004.Brighton, 2004
    [30]刘君强,孙晓莹,潘云鹤等.挖掘闭合模式的高性能算法[J].软件学报, 2004, 15(1):94~102
    [31]宋威,杨炳儒,徐章艳等.一种改进的频繁闭项集挖掘算法[J].计算机研究与发展, 2007, 45(2): 278~286
    [32]陶利民,黄林鹏. Cherry:一种无须子集检查的闭合频繁集挖掘算法[J].软件学报,2008, 19(2): 379~388
    [33] Yun Chi, Haixun Wang, Philip S, et al. Moment: Maintaining closed frequent itemsetsover a stream sliding window[C]. In: Proc. of the 4th IEEE International Conference onData Mining. 2004: 59~66
    [34]刘旭,毛国君,孙岳等.数据流中频繁闭项集的近似挖掘算法[J].电子学报, 2007,35(5): 900~905
    [35]刘学军,徐宏炳,董逸生等.基于滑动窗口的数据流闭合频繁模式的挖掘[J].计算机研究与发展, 2006, 43(10): 1738~1743
    [36] Liu Chun, Zheng Zheng, Cai Kai yuan, et al. Online mining frequent closed itemsetsover data stream[J]. Journal of Beijing University of Aeronautics AND Astronautics,2008, 34(8): 969~972
    [37]胡为成,王本年,程传流.基于DSCFCI_tree的带项目约束的数据流频繁闭合模式挖掘算法[J].中国科学技术大学学报, 2009, 39(11): 1195~1201
    [38]都志辉.高性能并行编程技术: MPI并行程序设计[M].北京:清华大学大学出版社, 2001
    [39] Park J.S, Chen M, Yu P.S. Efficient Parallel Data Mining for Mining AssociationRules[C]. In: Proc. of the ACM International Conference on Information andKnowledge Management, 1995, 31~36
    [40]杨明,孙志挥,吉根林.快速挖掘全局频繁项目集[J].计算机研究与发展, 2003,40(4): 620~625
    [41]宋宝莉,覃征.分布式频繁项目集的快速挖掘方法[J].西安交通大学学报, 2006,40(8): 923~927
    [42]李梅花,王黎明,许红涛.利用抽样和元学习的分布式关联规则挖掘算法[J].计算机应用, 2006, 26(4): 872~877
    [43]王治和,景永霞,杜辉.分布式关联规则挖掘研究[J].南京师大学学报(自然科学版). 2010, 33(4): 114~118
    [44] David Wai-Lok Cheung, Vincent Ng, Ada Wai-Chee Fu, et al. Efficient Mining ofAssociation Rules in Distributed Database[C]. In: Proc. of the IEEE Transaction onKnowledge and Data Engineering, 1996, 8(1): 911~922
    [45]缪裕青,伊东.分布式存储结构的频繁闭合模式挖掘并行算法[J].微电子学与计算机. 2007, 24(10): 161~167
    [46] Grahne G, Zhu J. Efficiently using prefix-trees in mining frequent itemsets[C]. In: Proc.of the IEEE ICDM Workshop on Frequent Itemset Mining Implementation. Melbourne,Florida, USA, 2003
    [47]王黎明,张卓.基于iceberg概念格并置集成的闭频繁项集挖掘算法[J].计算机研究与发展, 2007, 44(7): 1184~1290
    [48] Jiangming Lin, Chunhua Ju, Dongsheng Liu. An Efficient Mining Model for GlobalFrequent Closed Itemsets[C]. In: Proc. of the 2009 Second International Symposium onElectronic Commerce and Security. 2009: 278~282
    [49] Tang P, Ning L, Wu N. Domain and data partitioning for parallel mining of frequentclosed itemsets[C]. In: Proc. of the 43th Annual Southeast Regional Conference,Ken-nesaw, GA. USA, 2005: 250~252
    [50]琚春华,倪栋君.基于元学习的分布式挖掘频繁闭合模式算法研究[J].计算机应用研究. 2009, 26(1): 41~46
    [51]吴磊,陈鹏.基于并行计算的关联规则挖掘优化算法[J].计算机应用, 2005, 25(9):1989~1991
    [52]侯燕,王水利.基于近似等深柱状图的数据流并行聚集算法[J].解放军理工大学学报, 2008, 9(1): 29~33
    [53]刘学军,董逸生,王永利等.挖掘数据流中的频繁模式[J].计算机研究与发展,2005, 42(12): 2192~2198
    [54]程转流,胡为成,胡学钢.基于DSFCI-tree的分布式数据流频繁闭合模式挖掘[J].微电子学与计算机, 2007, 24(9): 120~125
    [55] Manjhi A, Shkapenynk V, Dhamdhere K, et al. Finding (Recently) Frequent Items inDistributed Data Streams[C]. In: Proc. of the 21st International Conference on DataEngineering. Washington: IEEE Computer Society, 2005, 767~778
    [56]肖颖,毛国君.分布式数据流中挖掘频繁项算法的研究[J].微计算机信息. 2010,26(10): 144~146
    [57]任家东,李可,冯佳音等.在分布式数据流中查找近期频繁项方法的研究[J].计算机科学, 2008, 35(3): 206~208

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700