序列模式挖掘的并行算法研究

英文题名：Research on Parallel Algorithm for Sequential Pattern Mining
作者：王宇
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：序列模式 ; 并行 ; 概念格 ; 搜索空间
英文关键词：sequential pattern ; parallel ; concept lattice ; search space
学位年度：2007
导师：周丽娟
学科代码：081203
学位授予单位：哈尔滨理工大学
论文提交日期：2007-03-01

摘要

序列模式挖掘是指从序列数据库中发现相对于时间或者其它顺序的高频率序列。其最初动机是想通过在带有交易时间属性的交易数据库中发现频繁项目序列以发现一个时间段内的客户购买行为规律。近年来序列模式挖掘已经成为数据挖掘的一个重要方向,其应用范围已不局限于交易数据库,在DNA分析等尖端科学研究领域、Web访问等新型应用数据源等众多方面都得到了针对性研究。
     文中结合R.Agrawal和R.Srikant提出的序列模式挖掘的有关定义和R.Wille提出的概念格理论,提出了频繁概念的定义;根据序列模式并行挖掘的需要提出了搜索空间划分理论,其中包括搜索空间、子搜索空间、等价子搜索空间和最大子搜索空间的定义。
     序列模式挖掘的数据有如下特点:数据量大,数据分布存储。已有的大部分序列模式挖掘算法没有综合考虑到数据的上述特点。本文针对序列模式挖掘数据的这些特点,结合并行理论,提出了一个分布式并行算法SPP(Sequential Pattern Parallel)。本算法遵循模式缩减的原则,利用分治策略实现并行操作,在每台处理器上运用搜索空间划分理论和频繁概念构造频繁项集,运用图深度优先搜索方法构造频繁序列。算法只需扫描数据库两遍,不需要生成候选序列,大大减少了数据库访问时间,提高了序列模式挖掘的效率。不过本文采用是静态负载平衡,还有待改进。
     基于自己设计的随机数据生成程序和不同的消息结构,本文在具体的并行环境中模拟了算法SPP,并在单机中实现了AprioriAll算法。实验证明,相比于AprioriAll,SPP算法具有良好的加速比和效率。
Sequential pattern mining is the mining of frequent seqences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the transanction database and has extended to new data sources such as Web and advanced science fields such as DNA analysis.
     Combining with the relevant definitions put forward by R.Agrawal and R.Srikant and Concept Lattice Theory brought forward by R.Wille, this paper proposes the term of frequent concept. To fulfill the needs of sequential pattern parallel mining, there proposes search space partition theory which includes the definitions of search space, sub search space, equivalence sub search space and maximum search space.
     The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms havn’t considered the above-mentioned characteristics synthetically. Contraposing the traits mentioned above and combining the paralle theory, this paper put forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent itemsets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn’t generate the candidated sequences, which abates the access time and improves the minging efficiency. But this algorithm adopts the static loading balancing, so it is also needed to improve.
     Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll on a computer. The experiments demonstrated that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.

引文

1 Bamshad Mobasher,Honghua Dai,Tao Luo,Miki Nakagawa.Using Sequential and Non-Sequential Patterns in Predictive Web Usage Mining Tasks . In Proc. Of the 2002 IEEE International Conference on Data Mining(ICD’02),2002:1~4
    2 Shintani , Kitsuregawa . Mining algortithms for sequential patterns in parallel:hash based approach[C].In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining,Berlin:Springer-Verlag,1998:283~294
    3 J. Mohammed.Parallel sequence mining on smp machines [C].In workshop On Large-Scale Paralle KDD System(in conjunction 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining),San Diego:AAAI Press,1999:57~65
    4 V. Guralnik,N. Garg,Vipin Kuma.Parallel tree projection algorithm for sequence mining[C].In EuroPar2001,Manchestr:Springer-Verlag,2001:310~320
    5 R. Agrawal,R. Srikant.Mining Sequential Patterns.In Proceedings of the Eleventh International Conference on Data Engineering,Taipei,1995:1~9
    6 R. Srikant,R. Agrawal.Mining Sequential Patterns: Generalizations and Performance Improvements . Reasearch Report RJ9994 , IBM Almaden Research Center,San Jose,California,December 1995:2~21
    7 H. Pinto.Mutli-Dimensional Sequential Pattern Mining.In Proc. 2002 Int. Conf. On Information and Knowledge Management(CIKM’01),Atlanta,GA,2002:1~8
    8 J. Han, M. Kamber.Data mining: Concepts and techniques. San Mateo,CA:Morgan Kaufmann,2002:2~15
    9 X. Yan,J. Han,R. Afshar.Clospan: Mining closed sequential patterns in large databases[A] . Proc of SIAM Int Conf on Data Mining[C] , San Francisco,CA:ACM/SIAM Press, 2003,166~175
    10 李天瑞, 潘无名. 序列模式的性质研究.复旦学报(自然科学版), 2004,43(5):1~3
    11 孙莹.序列模式发现中的关键问题的研究与实现.合肥工业大学硕士学位论文,2005:46~57
    12 J. Pei.Mining sequential patterns with constraints in large databases[A].Proc of Int Conf on Information and Knowledge Management[C].Washington DC: ACM Press,2002:18~25
    13 J. Han,J. Wang,Y. Lu,etc.Mining top-k frequent closed patterns without minimum support[A].Proc 2002 Int Conf on Data Mining[C],Maebashi, Japan:IEEE Press, 2002:211~218
    14 M. Garofalakis,R. Rastogi.SPIRIT: Sequential Pattern Mining with regular expression constraints.In VLDB’99,San Francisco,CA, Sept.1999:3~9
    15 M. Seno,G.. Karypis.SLPMiner: An algorithm for finding frequent sequential patterns using length-decreasing support constraint . In ICDM’02 , Maebashi,Japan,Dec. 2002:2~10
    16 F. Masseglia , F. Cathala , P.Poncelet . The PSP Approach for Mining Sequential Patterns.In Proceedings of the 1998 European symposium on Principles of Data Mining and Knowledge Discovery,1998:176~184
    17 M. Zaki.SPADE: An efficient algorithm for mining frequent sequences. Machine Learning,2001,11(5):31~60
    18 J. Han,J. Pei,etc.FreeSpan: Frequent pattern-projected sequential pattern mining.Proceedings of tht 6th ACM SIGKDD international conference on Knowledge discovery and data mining,2000:2~12
    19 J. Han,J. Pei,etc.PrefixSpan: Mining Sequential Patterns efficiently by Prefix-projected Pattern Growth.Proceedings of 2001 International Conference on Data Engineering,2001:1~10
    20 周斌,吴泉源,高洪奎.序列模式挖掘的增量式算法的设计原则.计算机研究与发展,2000,37(10):1~5
    21 欧阳为民,蔡庆生.发现广义序贯模式的增量式更新技术.软件学报, 2003,9(10):1~4
    22 沈慧丽.基于增量式更新的序列模式挖掘模型 JZX_MINER 的设计研究.上海海事大学硕士学位论文,2004:26~38
    23 F. Masseglia.Incremental Mining of Sequential Patterns in Large Databases (PS).Actes des 16ièmes Journées Bases de Données Avancées(BDA’00),Blois, France,2000:9~30
    24 邹翔,张巍.大型数据库中的高效序列模式增量式更新算法.南京大学学报,2003,39(2):1~7
    25 Q. Zheng, K. Xu,W. Lv,S. Ma.The Algorithms of Updating Sequential Patterns . The Second SIAM Data mining’2002: workshop HPDM , Washington,USA,2002:42~45
    26 金灿.序列模式的增量式挖掘算法研究.华中师范大学硕士学位论文,2004:84~98
    27 段军玲.可增量挖掘序列模式的日志挖掘器的研究.武汉大学硕士学位论文,2003:2~4
    28 肖轶.从万维网日志中挖掘访问序列模式的算法研究.华中科技大学硕士学位论文,2005:1~3
    29 范习辉,张焰.序列模式挖掘在电力系统警报信息处理中的应用.电力系统自动化,2005,29(13):1~3
    30 钱昱,郑诚.基于序列模式的异常检测.微机发展,2004,14(9):1~3
    31 张学红,闫五四等.基于序列模式挖掘的网管告警系统.电信科学,2006,(11):1~4
    32 李学干.计算机体系结构.3.西安电子科技大学出版社,2000:226~228
    33 李小梅.并行计算模型及其算法设计.数值计算与计算机应用,2004,(3):2
    34 黄铠,徐志伟.可扩展并行计算——技术、结构与编程.陆鑫达等.机械工业出版社,2000:8~16
    35 陈国良等.并行算法实践.高等教育出版社,2004:68~73 146~149
    36 张林波,迟学斌.并行计算导论.清华大学出版社,2006:164~171
    37 张翠莲,刘方爱,王亚楠.基于 MPI 的并行程序设计.计算机技术与发展,2006,16(8):1
    38 李金宝,李建中.微型计算机机群并行计算系统的通信处理器的设计与实现.黑龙江大学自然科学学报,2001,18(3):44~48
    39 雷英杰,霍红卫.典型并行算法的实现性能分析.空军工程大学学报,2003,4(5):68~73
    40 Barry Wilkinson.并行程序设计.陆鑫达.2.机械工业出版社,2005:4~5
    41 陈国良.并行计算——结构·算法·编程.高等教育出版社,2003:123~136
    42 R. While.Restructuring lattice theory: an approach based on hierarchies ofconcepts.In I.Rival(Eds.),Ordered Sets,1982:445~470
    43 吴绍春,吴耿锋,慰赵春.一个基于高性能集群系统的并行数据挖掘平台模型.计算机工程与应用,2004,(13):2~3

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700