基于关联规则的web日志挖掘应用研究

英文题名：Application and Research on Web Log Mining Based on Association Rule
作者：孙赵平
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：关联规则 ; Apriori算法 ; BBS ; Web日志挖掘
英文关键词：Association rule ; Apriori algorithm ; BBS ; Web log mining
学位年度：2010
导师：李龙澍
学科代码：081202
学位授予单位：安徽大学
论文提交日期：2010-05-01

摘要

随着社会信息化的发展,Internet越来越受到全世界各类用户的喜爱。他们不断地登陆互联网,在网上搜索、寻找自己所感兴趣的话题和信息,从事一系列的网络交互活动。由于互联网的用户数量众多、构成复杂,互联网的海量数据,以及互联网处理的数据形式比较广泛,迫切需要一种强大的处理技术的诞生。为了提供更高质量的web服务,对web数据进行挖掘成为一种重要的互联网信息处理技术手段,web挖掘应运而生。互联网一般采用的是客户端／服务器结构模式,后台服务器上存储了大量潜在有价值的web日志文件。为了分析网络用户浏览网站的习惯和行为,改善网页间的链接结构和网络拓扑结构,提高网站的系统性能以及为用户提供个性化的服务,可以应用web日志挖掘技术对这些日志数据进行模式发现。
本文以面向研究生群体的小型论坛——安研星空论坛(http: //www.ahusky.cn)的日志文件中的数据作为挖掘处理的数据源,主要从以下几个方面进行了相关的阐述、分析和研究。首先,详细介绍了课题的研究背景及意义和国内外的研究现状；概述了数据挖掘产生、定义、过程、方法和应用领域,以及未来的发展。其次阐述了web挖掘的分类、特点、过程、技术和面临的难题,接着分析了数据预处理的过程,指出了预处理过程中遇到的问题。然后介绍了关联规则挖掘的相关概念和关联规则挖掘的经典算法——Apriori算法,详细介绍了Apriori算法的思想、处理步骤和算法过程,指出了Apriori算法用于web日志挖掘的不足之处,提出了基于网站访问结构和数据库压缩的改进策略,对算法进行改进并对新算法进行了详细的分析,验证了新算法的优越性。最后利用论坛日志数据对经典算法和改进后的的算法进行不同条件下的对比实验,对算法的时间性能进行比较,实验表明改进算法时间性能得到较大提高。通过本文的web日志挖掘的研究工作,可以对网络论坛的系统性能进行改进,为用户提供更有效的服务
With the development of information society, all types of users around the world more and more like using Internet. They continue to visit the Internet, online search, to find their own topics of interest and information, engage in a series of network interactions. Because of the large number of Internet users, a complex and massive Internet data, and different kinds of data forms on Internet, it is urgent to need a powerful data processing technology. In order to provide higher quality of web services, web mining as an important means of Internet information processing, came into being. Internet generally uses a client/server architecture model, and web servers store a lot of potential valuable web log files. In order to analyze the habits and behavior of Internet users visiting the website, to improve the link structure between web pages, network topology structure and system performance,and to provide personalized services,Web log mining can be applied to find interesting model in the log data.
In this paper, the web log data for data mining come from a small forum-AHUSKY(http://www.ahusky.cn) for graduate students. This paper mainly describes, analyzes and studies the following aspects. First of all, the paper introduces the background and significance of the research topics and research progress; and describes the production definition, process, methods, applications and future development of data mining. Secondly, the paper discusses the classification, characteristics, processes, technology and some challenges, and then analyzes the data preprocessing procedure, and points out the problems encountered during it. Then it introduces the concepts of association rule mining and the related classical algorithm-Apriori algorithm, and describes the idea of the Apriori algorithm and the processing steps, and points out shortcomings of the Apriori algorithm for web log mining. This paper presents the improved strategy based on website structure and database compression to improve the classical algorithm, and the new algorithm is analyzed, the superiority of which is proved. Finally, comparative tests under different conditions are done with the forum log data between classical algorithm and improved algorithm. By comparison of algorithm performance, it shows that the improved algorithm has some better performance time. Through this research of the web log mining, we can improve system performance of the forum and provide more effective services.

引文

[1]宋爱波,胡孔法等.Web日志挖掘[J].东南大学学报(自然科学版),2002,32(1)：15-18.
    [2]陈才扣,金远平.挖掘基于Web的访问路径模式[J].小型微型计算机系统,2001(01).
    [3]杨怡玲,管旭东等.Web日志挖掘预处理中的Frame页面过滤算法[J].计算机工程,2001,27(02)：76-77.
    [4]Yang Yiling,Guan Xudong,You Jinyuan. Frame-filtering algorithm in data preprocessing for Web usage mining.ComputerEngineering,2001:131-135.
    [5]宋敏青.数据挖掘在web中的应用与研究[J].现代情报,2002,22(03)：59-61.
    [6]刘沛鸯,郭海儒,袁玲玲.Web日志挖掘中的用户访问模式识别[J].雁北师范学院学报，2006.422(2)：24-25.
    [7]王新,马万青,潘文林.基于web日志的用户访问模式挖掘[J].计算机工程与应用,2006(21)：256-157.
    [8]边小勇,张晓龙.电子商务站点中的频繁查找路径挖掘技术[J].武汉科技大学学报(自然科学版),2006,529(4)：389-390.
    [9]杜家强,韩其睿,王科,杜家兴.Web日志中用户频繁路径快速挖掘算法[J].计算机工程与应用,2005(22)：164-66.
    [10]陆丽娜,沈均毅,杨怡玲,管旭东.Web日志挖掘中的序列模式识别[J].小型微型计算机系统,2000(5)：481-483.
    [11]Srikant R,Agrawal R.Mining generalized association rules[A]. Proceedings of the 21th International Conference on Very Large Databases[C].Zurich,Switzerland, 1995.9:407-419.
    [12]Cheung DW.Effieient mining of association rules indistributed data bases[j].IEEE Transaetions on Knowledge and Data Engineering,1996 8(6):910-921.
    [13]Jiawei Han,Micheline Kamber,范明,孟小峰译.数据挖掘概念与技术[M].机械工业出版社,2007(03).
    [14]周斌,吴泉源,高洪奎.用户访问模式数据挖掘的模型与算法研究[J].计算机研究与发展.1999(07)：871-872.
    [15]Zaiane O R, Xin M, Han J. Diseovering Web Access Patterns and Trends by Applyng OLAP and Data Mining Technology on Web Log In Proc,Advances In Digital Libraries Conf.(ADL'98) IEEE Press ADL'98.Santa Barbara,CA April 1998:19-29.
    [16]Min-Syan Chen, Jong Soo Park. Efficient Data Mining for Path Traversal Patterns. IEEE Trasactions on Knowledge and Data Engineering.1998,10(2):209-221.
    [17]吕寻才.数据挖掘在地震预报中的应用[D].天津：天津大学,2006.
    [18]Pang-Ning Tan,Michael Steinbach,Vipin Kumar.数据挖掘导论[M].人民邮电出版社,2006(01)：237-245.
    [19]邵峰晶,于忠清编著.数据挖掘原理与算法[M].水利水电出版社,2003(08)：132-153.
    [20]陈文伟.数据仓库与数据挖掘教程[M].清华大学出版社,2006(08)：237-245.
    [21]张燕.浅谈网络信息挖掘[J].情报探索.2000(4).
    [22]陈安,陈宁,周龙骧.数据挖掘技术及应用[M].北京：科学出版社,2006.
    [23]陈京民.数据仓库与数据挖掘技术(第2版)[M].北京：电子工业出版社,2007.
    [24]苏新宁等.数据仓库与数据挖掘[M].清华大学出版社,2006(04)：211-232.
    [25]陈文伟,黄金才.数据仓库与数据挖掘.人民邮电出版社,2004(01)：132-159.
    [26]毛国君等.数据挖掘原理与算法[M].清华大学出版社,2005(01)：231-270.
    [27]姚洪波,杨炳儒.Web日志挖掘数据预处理过程技术研究[J].微计算机信息,2006,22：234-236
    [28]何黎明.Web日志的预处理技术[J].长江大学学报(自科版),2007,06(2)：293-294
    [29]Witten,I.H.Data mining:practical machine learning tools and techniques.China Machine Press,2005:259-268.
    [30]Tan.Pang-Ning;Introduction to data mining.Posts&Telecom Press,2006(06): 213-255.
    [31](美)David Hand,Heikki Mannila,Padhraic smyth.The principle of Data Mining[M].机械工业出版社,2003：236-247.
    [32]Roiger,Richard.Data Mining:a tutorial-based primer[M].Tsinghua university press,2003:124-139.
    [33]J.Borges and M.Levene.Data Mining of User Navigation Patterns.In Proceeding of the WEBKDD'99 Workshop on Web Usage Analysis and UserProfiling,August 15,1999,San Diego,CA,USA,pages 31-36,1999.
    [34]Shi Wang,Wen Gao,Jintao Li. Discovering sequence association rules with user access transaction grammars. In Proc. of 11th Int. Workshop on Database and Expert Systems Applications,pp.757-761,2000.
    [35]LCatledge,J Pitkow. Characterizing Browsing Strategies in the World Wide Web.Computer Networks and ISDN Systems,1995,Vol.27,No.6:1065-1073.
    [36]B.Berendt, B.MObasher,M.Nakagawa,et al.The ImPact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis[C].In Proceedings of the 4th Web KDD 2002 Workshop at the ACM-SIGKDD Conference on Knowledge Discovery in Database,Edmonton,Alberta,Canada,2002(7).
    [37]M.Spiliopoulu,B.MObasher,B.Berendt,et al.A Framework for the Evaluation of Session Reconstruction Heuristics in Web Usage Analysis.In INFORMS Journal of Computing, Special Issue on Mining Web-Based Data for E-Business Applications,2003,15(2):171-190.
    [38]Min Qin,KaiHwang.Frequent episode rules for Internet anomaly detection. In Third IEEE Int.Symp.on Network Computing and Applications,pp.161-168,2004.
    [39]Robert Bembenik, Henryk Rybinski.Mining spatial association rules with no distance parameter.Advances in Soft Computing.2006(5),pp.499-508,2006.
    [40]Sutheera Puntheeranurak,HidekazuTsuji.Mining Web logs for a Personalized Reeommender System,2005 IEEE:445-448.
    [41]吴瑛,王秋生.用于挖掘Web日志的数据仓库系统实现.北京航天航空大学学报,2005(07)：10-12.
    [42]Robert Cooly,Bamshad Mobasher,Jaideep Srivastava.Data Preparation for Mining World Wide Web Browsing Patterns.Knowledge.Information System.1(1),1999:5-32.
    [43]Annalisa Appice,Paolo Buono.Analyzing multi-level spatial association rules through a graph-based visualization.IEA/ATE 2005,LNAI3533,pp.448-458,2005.
    [44]Wu Xingdong,Zhang Chengqi,Zhang Shichao.Mining both positive and negative association rules.In Proc. of the 19th International Conference on Machine Learning(ICML'02).The University of New South Wales, Sydney, Austrilia,pp.658-665,2002.
    [45]H.Hu, J.Li. Using association rules to make rules-based classifiers robust, In Proc. of the 6th Australasian Database Conference,Newcastle,Australia.pp.47-54,2005.
    [46]崔立新,苑森森,赵春喜.约束性关联规则发现方法[J].计算机学报,23(2),pp.216-220,2000.
    [47]Amaud Soulet,Jiri Klema,et al. Efficient mining under rich constraints derived from various datasets. In KDID 2006,LNCS 4747,pp.223-239,2007. [48] http://baike.baidu.com/view/1848615.htm?fr=ala0_1.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700