基于树结构的精简序列模式挖掘算法研究

英文题名：Research on Condensed Sequential Pattern Mining Based on Tree Structure
作者：解玉洁
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：位置信息表 ; 序列扩展匹配 ; WAP-Tree ; 投影树 ; 漏洞特征提取
英文关键词：Positional information table ; Sequence extension matching ; WAP-Tree ; Projection technology ; Bug feature discover
学位年度：2010
导师：任家东
学科代码：081202
学位授予单位：燕山大学
论文提交日期：2010-10-01

摘要

现有的序列模式挖掘算法能有效地在大型数据库中挖掘出完整的序列模式集,然而在很多实际应用中,用户更希望找出感兴趣的、更简洁的模式,而不是所有的模式。本文主要研究了如何挖掘精简序列模式,如何有效的增量挖掘精简序列模式,以及如何精确的挖掘重复间隔精简序列模式等问题,这些问题的研究在顾客购物分析,交易分析,Web页面的访问模式预测,DNA序列分析,软件行为模式分析中具有重要的意义。
     本文首先设计了一种基于改进前缀树的最大序列模式挖掘算法CSMS,算法利用纵向、横向结合搜索位置信息表的序列扩展匹配方法找到潜在最大序列模式,同时,把每个找到的潜在最大序列模式存储在改进的前缀树PStree中,最后通过对PStree进行剪枝,得到由最大序列模式组成的前缀树MPStree。该算法具有较好的时间效率和扩展性。
     其次,提出了一种基于重复链接WAP-Tree结构的闭合重复间隔序列模式挖掘算法MRCGP,算法首先为频繁1项集构建一个位置信息表,然后通过搜索位置信息表找到所有由不同项组成的2序列模式,最后构建一个重复链接WAP-Tree维护所有的频繁项集,通过逐步挖掘已存在模式的投影树,得到所有的闭合重复间隔序列模式集。该算法的性能优于CloGSgrow。
     最后,设计了一个面向软件漏洞特征提取的闭合序列模式挖掘算法MSPT和更新算法UMSPT。算法MSPT首先搜索半频繁和频繁2模式,然后为半频繁和频繁项构建一个漏洞序列树,利用投影技术,逐步找到半闭合和闭合序列模式。算法UMSPT插入新的序列到漏洞序列树,搜索树中新插入的分支找到新序列中的闭合和半闭合模式。最后通过检查已存在模式的包含关系以及支持度信息得到更新数据库中的所有半闭合和闭合序列模式集。
     本文使用现实数据集进行挖掘,通过实验对本文所提出的CSMS算法、MRCGP算法、MSPT算法以及UMSPT算法进行验证。
The existing sequential pattern mining algorithm can efficiently mine the complete set of sequential pattern in a large database. However,in many application, the user may want to find the more succinct patterns, rather than all of the patterns. The paper mainly study on how to mine compress sequential patterns,how to mine compress sequential patterns incrementally, and how to mine compress repetitive gapper sequtnial pattern, the study of those patterns have an import significance in analysis of customer purchasing lists, analysis of web log access, analysis of DNA sequences, credit card usage histories traces, program execution traces, and the analysis of any other sequences corrlete to the time.
     Firstly, an efficiently maximal sequential patterns mining algorithm CSMS is presented. This algorithm makes use of the method of matching sequences extension based on a positional table, at the same time, we build a PStree to maintain the candidate sequential patterns, we obatian the final maximal sequential patterns through pruning for the PStree. The algorithm CSMS has better time efficiency and scalability.
     Secondly, a closed repetitive gapped sequential pattern mining algorithm based on repetitive linked WAP-Tree is proposed. The algorithm build a postitional table for all the frequent items, then searching the postitional table for all the 2 sequential patterns which are composed of the different items, algorithm also construct a repetitive linked WAP-Tree, through mining the project tree of the existing patterns in RLWAP-Tree gradually, we can get all of the closed repetitive gapped sequential patterns.
     At the end, a close sequential pattern mining algorithm for discovery the feature of software bugs MSPT and an incremental algorithm UMSPT is proposed. Algorithm MSPT searches for the semi-frequent and frequent 2 patterns based on the positional information of the items, than an BStree is constructed to maintain the semi-frequent an frequent items, through the technology of projection, we can find the semi-closed and closed sequential patterns. Algortithm UMSPT inserts the new bugs sequences to the BStree, through searching the new branch to find the new semi-closed and closed sequential patterns. At last, all of the final semi-closed and closed sequential patterns can be obtained by checking the inclusion relation of the existing patterns. Algorithm MSPT and UMSPT has better time efficiency.

引文

1周斌,吴泉源.序列模式挖掘的一种渐进式算法.计算机学报, 1999, 22(10):882-887
    2 Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. Pro of the ACM SIGMOD Conference on Management of data. Washing-won, 1993:207-216
    3 S.Sava, Omiecnskie, S.Navathe. An Efficient Algorithm for Mining Association Rules in Large Database. Proc. of the 21st Int’l Conf. on Very Large Data Bases (VLDB), Zurich, Switzerland, Morgan Kaufmann, 1995:432-443
    4黄名选,严小卫,张师超.基于矩阵加权关联规则挖掘的伪相关反馈查询扩展.软件学报, 2009, 20(7):1854-1865
    5 J.Han, M.Kamber.数据挖掘概念与技术.范明,孟小峰等译.北京:机械工业出版社, 2001:278-284
    6 Jing Lu, Malcolm Keech, Weiru Chen. Concurrency in Web Access Patterns Mining. In Word Academy of Science, Engineering and Technology, 2009:856-865
    7 Wenjia Wang, Phuong Thanh CaoThai. Novel position-coded methods for mining web access patterns. IEEE International Conference on Intelligence and Security Informatics, 2008:194-196
    8 Jian Pei, Jiawei Han, Behzad Mortazavi-asl, Hua Zhu. Mining Access patterns Patterns Efficiently from Web logs. Proc. 2000 Paci c-Asia Conf. on Konwledge Discovery and Data Mining. 2000:396-407
    9 R.Agrawal, R.Srikant. Mining Sequential Patterns. In Proc. of the 11th Int. Conf. on Data Engineering, Taipei, Taiwan, 1995:3-14
    10朱杨勇,熊贇. DNA序列数据挖掘技术.软件学报, 2007, 18(11):2766-2781
    11 Shuang Bai, Si-Xue Bai. The Maximal Frequent Pattern mining of DNA sequence. IEEE international conference on Granular Computiong, 2009:23-26
    12 R.Agrawal, R.Srikant. Fast Algorithms for Mining Association Rules. Proc. of the 20th Int’l Conf. on Very Large Data Bases (VLDB), Santiago, Chile, 1994:487-499
    13 R.Srikant, R.Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proc. of the 5th Int. Conf. on Extending Database Technology, Avignon, France, 1996:3-17
    14 F.Masseglia, F.Cathala, P.Poncelet. The PSP Approach for Mining Sequential Patterns. In Proc. of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, Nantes, France, 1998:176-184
    15 M.Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 2001, 42(1):31-60
    16 J.Ayres, J.Gehrke, T.Yiu, et al. Sequential Pattern Mining Using a Bitmap Representation. In Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 2002:429-435
    17 Z.Yang, Y.Wang, M.Kitsuregawa. LAPIN-SPAM: An Improve Algorithm for Mining Sequential Pattern. Pro.of the 2th Inl. Conf. on Data Engineering Workshops, Tokyo, Japan, 2005:1222-1225
    18 J.Han, J.Pei, B.Mortazavi-Asl, et al. FreeSpan: Frequent Pattern-projected Sequential Pattern Mining. In Proc.of the 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Boston, 2000:355-359
    19 J.Pei, J.Han, H.Pinto, et al. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-projected Pattern Growth. In Proc. of the 17th Int. Conf. on Data Engineering, Heidelberg, Germany, 2001:215-224
    20 C.Cho, Y.Wu, C.Arbee. Effective Database Transformation and Efficient Support Computation for Mining Sequential Patterns. Pro. of the 10th Int. Conf.on Database Systems for Advanced Applications, Beijing, China, 2005:163-174
    21 Z.Yang, Y.Wang, M.Kitsuregawa. LAPIN: Effective Sequential mining Algorithm by Last Position Induction. Technical Report, Tokyo University,2005:236-242
    22 C.Wu, J.Koh, P.An. Improved Sequential Mining Using an Extended Bitmap Representation. Pro.of the 16th Inl. Conf.on Database and Expert System Application,copenhagen, Denmark,2005:776-785
    23 C.Antunes, A.Oliveira. Sequential Pattern mining Algorithms: Trade-offs between Speed and Memory.J.Kok, T.Washio.Proc.of the 2nd Int. Workshop on Mining Graphs, Trees and Sequences, Pisa, Italy, 2004:1-12
    24 Yan X, Han J W, Afshar R. CloSpan: Mining Closed Sequential Patterns in Large Datasets. Proc of SIAM International Conference on Data Mining. Chicago: SDM, May 2003:166-177
    25 Jianyong Wang, J. Han. BIDE: Efficient Mining of Frequent Closed Sequences. In Proc. Of 2004 Int Cof. On Data Engineering(ICDE’04), Boston, USA, March 2004:79-90
    26 Tzvetkov P, Yan X, Han J, Illinois Univ, Urbana IL. TSP: mining top-K closed sequential patterns. ICDM 2003. Third IEEE International Conference on Data Mining, 2003:347-354
    27 Afshar R. Mining frequent max and closed sequential patterns//School of Computing Science, Simon Freser University, Aug 2002:1-59
    28马传香,李庆华,简钟. MAXSeq:一个新的最大频繁序列模式挖掘算法.小型微型计算机系统, 2006, 27(6):1092-1096
    29 F.Masseglia, P.Poncelet, M.Teisseire. Incremental Mining of Sequential Patterns in Large Databases. Data and Knowledge Engineering, 2003, 46(1):97-121
    30 C.Lee, C.Lin, M.Chen. Sliding Window Filtering: An Efficient Method for Incremental Mining on a Time-variant Database. Information Systems, 2005, 30(3):227-244
    31 S.Parthasarathy, M.Zaki, M.Ogihara, et al. Incremental and Interactive Sequence Mining. In Proc. of the 8th Int. Conf on Information and Knowledge Management, Kansas, MO, USA, 1999:251-258
    32 H.Cheng, X.Yan, J.Han. IncSpan: Incremental Mining of Sequential Patterns in Large Database. In Proc. of the 10th Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA, USA, 2004:527-532
    33 Lei Chang, Tengjiao Wang, Dongqing Yang, Hua Luan. SeqStream: mining closedseqential patterns over stream sliding windows. ICDM '08. Eighth IEEE International Conference on Data Mining, 2008:83-92
    34 Raissi C., Poncelet P., Teisseire M.. SPEED: Mining Maximal Sequential Patterns over DataStreams. The 3rd International IEEE Conference on intelligent systems, 2006:546-552
    35 N.Pasquier, Y.Batide, R.Taouil, et al. Discovering Frequent Closed Itemsets for Association Rules. Proc. of the 6th Int’l Conf. on Database Theory (ICDT), Jerusalem, Israel, 1999:398-416
    36 Yan X, Han J W, Afshar R. CloSpan: Mining Closed Sequential Patterns in Large Datasets //Proc of SIAM International Conference on Data Mining. Chicago: SDM, May 2003:166-177
    37 Yuan D, Lee K, Cheng H. CISpan: Comprehensive Incremental Mining Algorithms of Closed Sequential Patterns for Multi-Versional Software Mining. Proc of the 2008 SIAM International Conference on Data Mining (SDM08). Georgia: SDM, April 2008:84-95
    38 Lin N P, Hao W H, Chen H J. Fast mining maximal sequential patterns. Proc of the
    7th WSEAS International Conference on Simulatio, Modelling and Optimization. USA:WSEAS, 2007:405-408
    39 García-Hernández R., Martínez-Trinidad F, Carrasco-Ochoa A. A New Algorithm for Fast Discovery of Maximal Sequential Patterns in a Document Collection. Proc for the Seventh International Conference on Computational Linguistics and text Processing. Berlin: Springer, 2006:514-523
    40 Lizhi Liu, Jun Liu. Mining maximal sequential patterns with layer coded Breadth-First linked WAP-tree. Computional Intelligence and Industrial Application, 2009:61-65
    41管河山,姜青山,王声瑞.基于点分布特征的多元时间序列模式匹配方法.软件学报, 2009, 20(1):67-79
    42 L.EunJu, K.WonYoung, R.JoonSuk, K.Ung-Mo. Efficient weighted mining of repetitive subsequences, Web Society, 2009. SWS’09, 1st IEEE Symposium, Aug2009:66-70
    43 D.Lo, S.-C.Khoo. SMArTIC: Toward building an accurate, robust and scalable specification miner. Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Portland, Oregon, USA, 2006:265-275
    44 D.Lo, S.Maoz, S.-C.Khoo. Mining modal scenario-based specifications from execution traces of reactive systems. Proceedings of the 22nd IEEE/ACM international conference on Automated software engineering, Atlanta, Georgia, USA, 2007:465-468
    45 D.Lo, S.-C.Khoo, C. Liu. Mining temporal rules from program execution traces. In proceedings of the 3rd International Workshop on Program Comprehension through Dynamic Analysis (PCODA'07), Vancouver, Canada, 2007:248-256
    46 B.Ding, D.Lo, J Han, S.-C.Khoo. Efficient mining of closed repetitive gapped sequential patterns from a sequence database. Proc of the 25th International Conference on Data Engineering (ICDE'09), Shanghai China, Mar,2009:1024-1035
    47 Yongxin Tong, Li Zhao, Dan Yu. Mining Compressed Repetitive Gapped Sequential Patterns Efficiently. The 5th International Conference on Advanced Data Mining and Applications (ADMA2009), 2009:652-660
    48 Jian Pei, Jiawei Han, Behzad Mortazavi-asl. Mining access patterns efficiently from web logs. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’00), Kyoto Japan, 2000:390-407
    49 C.Ezeife, Y.Lu. Mining web log sequential patterns with position coded pre-order linked wap-tree. International Journal of Data Mining and Knowledge Discovery (DMKD) Kluwer Publishers10(1), 2005:5-38
    50 C.I.Ezeife, Kashif Saeed, Dan Zhang. Mining very long sequences in large databases with PLWAPLong. Proceedings of the 2009 International Database Engineering & Applications Symposium, Cetraro - Calabria Italy, 2009:234-241
    51 C.I.Ezdife, Yi Liu. Fast incremental mining of web sequential patterns with PLWAP tree. Data Mining and Knowledge Discovery, 2009:376-416
    52 S.Chandra, R.A.Kan. Software security metric identification framework. Proceedings of the International Conference on Advances in Computing, Communication and Control. New York,2009:725-731
    53 Ken Ashcraft, Dawson Engler. Using Programmer-Written Compiler Extensions to Catch Security Holes. IEEE Symposium on Security and Privacy. 2002:143-159
    54 Frank Eichinger, Klemens Bohm, Matthias Huber. Improved software fault detection with graph mining. Proceedings of the 6th Int. Workshop on Mining and Learning with Graphs (MLG) at ICML, Helsinki, Finland, 2008
    55 Hong Cheng, David Lo, Yang Zhou. Identifying Bug Signatures Using Discriminative Graph Mining. Proceedings of the eighteenth international symposium on Software testing and analysis. Chicago, 2009:141-152
    56 Xu Yusheng, Zhang Lanhui, Ma Zhixin. Mining Sequential Pattern Using DF2Ls. Fuzzy Systems and Knowledge Discovery, 2008. FSKD '08. shandong, 2008:600-604
    57 S.K.Tanbeer, C.F.Ahmed, B-S.Jeong, Y-K Lee. Efficient single-pass frequent pattern mining using a prefix-tree. Information Sciences, vol. 179, New York, USA, 2009:559-583

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700