一种频繁子树挖掘算法在Web日志挖掘中的应用研究

英文题名：Research on the Application of a Frequent Sub-tree Algorithm in Web-log Mining
作者：刘振诚
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 频繁模式 ; 频繁子树 ; 序列数据库 ; 有序标签树
英文关键词：Data mining ; frequent pattern ; frequent sub-tree ; sequence database ; labeled ordered tree
学位年度：2007
导师：徐丽萍
学科代码：081202
学位授予单位：华中科技大学
论文提交日期：2007-01-01

摘要

随着互联网(Internet)的迅速发展,尤其是基于互联网的Web站点的广泛应用,Web已经成为目前世界上最丰富、最密集的信息来源。而日趋成熟的数据挖掘技术正好为Web数据的挖掘提供了技术基础。Web挖掘作为数据挖掘技术在Web数据分析与处理中的延伸,自然成为了当今数据挖掘领域中比较活跃的研究课题。
     Web挖掘技术主要包含了Web的内容挖掘、结构挖掘和使用挖掘。它们分别挖掘Web站点页面文件的内容、结构和用户对站点的使用信息。频繁模式挖掘是数据挖掘的核心任务之一,国内外学者在频繁项、序列模式挖掘方面已有较深入的研究。但是新兴的生物信息、数字图书馆、电子商务等领域提出了在复杂结构化数据中挖掘频繁子结构的要求。特别地,从有序标签树数据库挖掘频繁子树可为Web日志挖掘中的Web用户行为模式分析及Web用户分类、聚类等应用提供重要知识。
     频繁子树挖掘的一个重要研究方向是从标签树数据库中挖掘频繁子树。此前的研究表明,基于模式增长方法的序列模式挖掘算法在大型数据库上表现出较高的效率。可扩展频繁子树挖掘算法(SFTM)把模式增长方法运用到有序标签树数据库的频繁子树挖掘,并在此基础上改进了对搜索空间树的剪支方法。通过设计实现一个以频繁子树挖掘算法为核心的Web日志挖掘工具Webloger,把SFTM算法应用到Web日志数据的挖掘。在Webloger提供的框架下,把SFTM算法与一般算法分别在人工数据集和真实数据集上进行实验对比。实验结果表明SFTM算法是有效的,并且其搜索空间比一般算法有较好的收敛性,尤其在Web日志数据上较传统算法具有一定的优势。
With the rapid development of Internet, especially the popularity of Web sites, the World Wide Web has become the most abundant and mass information source all over the world. The sophisticate Data Mining technologies could properly satisfy the requirement of mining over Web data. Web Mining as the extension of Data Mining technologies to Web data analysis and process, naturally become one of the most active research topics.
     Web Mining technologies include Web Content Mining, Structure Mining and Usage Mining. They respectively mine in the content of Web pages, structure of inner-/inter-Web pages and Web user’s usage information. Frequent patterns mining is one of the premier tasks of Data Mining, researchers have dig into frequent items and sequential patterns mining. However lately, complex frequent structure mining technology is required by those rising fields like Bio-information, digital library and e-commerce. Particularly, mining frequent sub-trees in forest could provide important knowledge for user pattern analysis, Web user classification and clustering in Web-log mining.
     Mining frequent sub-trees in labeled tree database is an important study direction of frequent sub-tree mining. Previous study indicates that, sequential pattern mining algorithms based on pattern growth method have prominent performance. Scalable Frequent sub-Tree Mining algorithm (SFTM) uses pattern growth method in mining frequent sub-trees in labeled tree database, and improves the pruning method for the searching space. By designing and implementing a Web-log mining tool Webloger that based on frequent sub-tree mining algorithm, we apply the SFTM algorithm to Web-log mining. Under the architecture of Webloger, we compare SFTM with usual algorithms by experiments in generated dataset and real dataset respectively, the experiment result demonstrates that the SFTM algorithm is effective and efficient, and its searching space will shrink rapidly during the mining process, especially in the real Web-log data, it makes obviously advantage over conventional algorithms.

引文

[1] T. Miyahara, T. Shoudai, T. Uchida, et al. Discovery of frequent tree structured patterns in semistructured Web documents Knowledge Discovery and Data Mining. The 5th Pacific-Asia Conf., Hong Kong, 2001
    [2] M.J. Zaki, SPADE: An efficient algorithm for mining frequent sequences, Machine Learning, 42(2): 31~60
    [3] J. Pei, et al. H-Mine: Hyper-structure mining of frequent patterns in large database. The IEEE Int'l Conf. Data Mining, San Jose, California, USA, 2001
    [4] C.I. Ezeife and Yi Lu, Mining Web Log sequential Patterns with Position Coded Pre-Order Linked WAPtree, Kluwer Academic Publishers, June 2005, the International Journal of Data Mining and Knowledge Discovery (DMKD), 10(1): 5-38
    [5] Chen, M.S. Park, J.S. Yu, P.S. Data Mining for Path Traversal Patterns in a Web Environment. Proceedings of the 16th International Conference on Distributed Computing Systems, 1996(5): 27~30
    [6] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrences. In Proceedings of the Second Int'l Conference on Knowledge D iscovery and Data Mining. Portland, Oregon, 1996.146~151
    [7] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hypertext linking. In Fifth International World Wide Web Conference, Paris, France, 1996
    [8] Web Mining: Information and Pattern Discovery on the World Wide Web by R.Cooley, J. Srivastava, B. Mobasher
    [9] A.G. Buchner, M. Baumgarten, S.S. Anand, M.D. Mulvenna, and J.G. Hughes: Navigation Pattern Discovery from Internet Data. In WEBKDD'99, Revised Papers, LNCS00, Springer-Verlag Berlin Heidelberg 2000
    [10] O. R. Zaiane, M. Xin, J. Han. Discovering Web access paterns and trends by applying OLAP and data mining technology on Web logs. 1998
    [11] F. Bonchi, F. Giannoti, C. Gozzi, G. Manco, M. Nanni, D. Pedreschi, C. Renso and S. Ruggieri: Web log data warehousing and mining for intelligent web Data & Knowledge Engineering, 39(2):165~189
    [12] F. Bonchi, F. Giannotti, C. Gozzi, G. Manco, M. Nanni, D. Pedreschi, C. Renso, S. Ruggieri. Web Log Data Warehousing and Mining for Intelligent Web Caching. Elsevier Science, 2001
    [13]韩家炜,孟小峰. Web挖掘研究.计算机研究与发展. 2001, 38(4): 405~414
    [14]涂承胜,陆玉昌. Web使用挖掘技术研究.小型微型计算机系统. 2004, 25(7): 1177~184
    [15]王继成,潘金贵,张福炎. Web文本挖掘技术研究.计算机研究与发展. 2000, 37(5): 80~85
    [16]邢东山,沈均毅,宋擒豹.集成Web使用挖掘和内容挖掘的用户浏览兴趣迁移模式挖掘算法.小型微型计算机系统. 2004, 25(7): 1170~1173
    [17]周斌,刘亚萍,吴泉源.一个面向电子商务的数据挖掘系统的设计与实现.计算机工程. 2000, 26(6): 18~20
    [18]陆丽娜,陈亚萍,杨麦顺,魏恒义.挖掘关联规则算法的优化处理.计算机工程与应用. 2000, 36(8): 99~102
    [19]王实,高文,李锦涛,谢辉.路径聚类:在Web站点中的知识发现.计算机研究与发展. 2001, 38(4): 482~486
    [20]杨怡玲,管旭东,尤晋元,基于页面内容和站点结构的页面聚类挖掘算法.软件学报. 2002, 13(3): 467~469
    [21] Sergey Brin, Lawrence Page. The anatomy of a large-scale hypertextual web search engine. In 7th Int. Conf. WWW, Brisbane, Australia, April 1998
    [22] Page L., Brin S., Motwani R., Winograd T. The Page Rank citation ranking: Bringing order to the Web. Tech. Rep., Stanford Digital Library Technologies Project, 1998
    [23] Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposiumon Discrete Algorithms, 1998
    [24] Chakrabarti S., Dom B., Gibson D., Kumar S.R., Raghavan P., Rajagopalan S., Tomkins A. Experiments in topic distillation. In ACM SIGIR workshop on Hypertext Information Retrievalon the Web, Melbourne, Australia, l998
    [25] R.Baraglia, P.Palmerini: SUGGEST: a web usage mining system. In Proceedings of the International Conference on Information Technology: Coding and Computing. 2002, 282~287
    [26] Mike Perkowitz, et al. Adaptive Web sites: http//: www.cs.washington.edu/research/ adaptive, 2004
    [27] Charu C Aggarwal, Philip S Yu. On disk caching of web objects in proxy servers. In CIKM 97, Las Vegas, Nevada, 1997. 238~245
    [28] Jose' Borges, Mark Levene: An Heuristic to Capture Longer User Web Navigation Patterns. EC-Web 2000. 155~164
    [29] R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Data Bases(V LDB'94), Santiago, Chile, Sept 1994
    [30] R. Agrawal, R. Srikant. Mining Sequential Paterns. In Proceedings of the International Conference on Data Engineering (ICDE'95), Taipei, Taiwan, March 1995
    [31] S. Elo-Dean, M. Viveros. Data mining the IBM o.cial 1996 Olympics Web site. Technical report, IBM T.J. Watson Research Center, 1997
    [32] B. Liu, W.Hsu, Y.Ma. Association Rules with Multiple Minimum Supports.In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'99, poster), SanDiego, CA, August 1999
    [33] B. Mobasher, H. Dai, T. Luo, M.Nakagawa. Effective Personalization Based on Association Rule Discovery from Web Usage Data. In Proceedings of the 3rd ACM Workshop on Web Information and Data Management(WIDMOI), Atlanta, Georgia, November 2001
    [34] M .Spiliopoulou, L. Faulstich. WUM: A Tool for Web Utilization Analysis. In Proceedings of EDBT Workshop at Web DB'98, LNCS 1590, Springer Verlag, 1999. 184~203
    [35] S. Schechter, M. Krishnan, M.D. Sm ith. Using Path Profiles to Predict HTTP Requests. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, April 1998
    [36] R. Sarukkai. Link prediction and path analysis using Markov Chains. In: proceedings of the 9th international World Wide Web conference on Computer networks, Amsterdam, Holland 2000
    [37] M. Deshpande, G. Karypis. Selective Markov Models for Predicting Web-Page Accesses.In Proceedings of the First International SIAM Conference on Data Mining, Chicago, April 2001
    [38]杨怡玲,管旭东,尤晋元:基于页面内容和站点结构的页面聚类挖掘算法.软件学报, 2002, 13(3): 467~469
    [39] Zhong Su, Qiang Yang, Hongjiang Zhang, Xiaowei Xu, Yuhen Hu. Correlation-based Document Clustering using Web Logs.
    [40]苏中,马少平,杨强,张宏江.基于Web-Log Mining的Web页面聚类,软件学报, 2002, 13(1): 99~104
    [41] Nasraoui O, Frigui H, Joshi A, et al. Mining Web Access Logs Using Relational Competitive Fuzzy Clustering. Procof the 8th Int'l Fuzzy Systems Association Congress. Taiwan, 1999
    [42]黄松,刘晓明,宋自林.基于归纳化会话的网络用户的聚类.计算机研究与发展, Oct. 2001, 38(10): 1224~1228
    [43]王实,高文,李锦涛,谢辉.路径聚类:在Web站点中的知识发现.计算机研究与发展, Apr.2001, 38(4): 482~486
    [44] B.Mobasher, H.Dal, M.Nakagawa T.Luo. Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization. Data Mining and Knowledge Discovery, 2002, 6(6): 1~8
    [45] R. Agrawal and R. Srikant. Fast algorithms for mining association rules, In Proc. of the 20th VLDB Conference, 1994. 487~499
    [46] J. Pei, J. W. Han, M.A. Behzad. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-projected Pattern Growth. In: Proceedings of 17th International Conference on Data Engineering, 2001, 215~224
    [47] K. Wang, H. Liu. Schema discovery for semistructured data. The 3rd Int'l Conf. Knowledge Discovery and Data Mining, Newport Beach, California, USA, 1997
    [48] T.Asai, K.Abe, et al. Efficient substructure discovery from large semi-structured data. The 2nd SIAM Int'l Conf. Data Mining (SDM2002) , Arlington, VA, USA, 2002
    [49] M.J. Zaki. Efficiently mining frequent trees in a forest. The 8th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 2002
    [50] Pan Jin, et al. Chopper: An efficient algorithm for tree mining The 20th National Database Conf1 (NDBC2003), Changsha, 2003
    [51] Zhu Yongtai, et al. ESPM: an algorithm to mine frequent sub-trees. Journal of Computer Research and Development, 2004, 1720~1726
    [52] R. Cole, et al. Tree pattern matching and subset matching in deterministic O(nlog3n) time. The 10th Annual ACM SIAM Sympon Discrete Algorithms, Baltimore, Maryland, 1999
    [53] E. Cohen, B. Krishnamurthy, J. Rexford. Improving end-to-end performance of the web using server volumes and proxy filters. In Proc. ACM SIGCOMM, 1998, 241~253
    [54] Mike Perkowitz, Oren Etzioni. Towards adaptive Web sites: Conceptual framework and case study. Artificial Intelligence 2000. 245~275

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700