基于关联原理的Web使用挖掘研究

英文题名：Research on Web Usage Mining Based on Association Principle
作者：符翔
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web日志挖掘 ; 数据预先处理 ; 关联原理 ; Apriori算法
英文关键词：Web log mining ; Data Pre-processing ; Association principle ; Apriori algorithm
学位年度：2010
导师：金瓯
学科代码：081203
学位授予单位：中南大学
论文提交日期：2010-05-01

摘要

随着因特网的普及和迅速发展,电子商务的快速发展也得到研究者们更多的关注,期望能够在这种崭新的商务形式下,利用它的诸多优点,取得更多的经济效益。Web服务器以日志的方式记录下人们的诸多浏览动作,这就可以以此为根据改善网站的拓扑结构,从而改进网站的性能,也允许让我们来更深的探讨用户浏览站点的特有方式,为客户提供更多的人性化服务。由于商业上有如此强烈的需求,由此产生了对Web日志进行挖掘。因此,开展本研究方向有很大的实用意义和价值。
     本论文针对Web使用挖掘进行了较深入的研究。首先对Web挖掘、Web日志挖掘的基本理论知识和分类进行了总体研究。具体说明了数据来源及日志记录的内容与格式。接着,具体研究了日志挖掘的预先处理日志的过程,包含清理数据,辨别用户,辨别会话,过滤框架,补充路径,辨别事务。
     然后,详尽介绍了关联原理的一些基本概念,讲述了基于关联原理的经典算法-Apriori算法。重点是提出了在算法Apriori的基础上把事务集放进事务矩阵的思想,对原算法进行了一定的改进。改进算法首先去掉首页,这样会明显的减少矩阵的维数,然后不再需要搜索候选项集,提高了计算的效率。理论分析和实验证明了改进的算法是有效且可行的。接着利用频繁项求出关联规则,这样通过Web日志得到了有联系的规则。最后根据Web日志挖掘的流程设计并实现了一个基本的挖掘系统进行实验,此系统设计为三大部分：数据预先处理模块,频繁模式挖掘模块,关联规则挖掘模块。
With the rapid development and popularization of Internet, development of Electronic Commerce has aroused more concerns among researchers. They expect to get more economic benefits in this new business model by using its advantages. Web server records user's browsing behaviors in the form of web log. This act will allow us to study particular rules of browsing website so as to provide more personalized work for users. Also, based on the especial principle, status and the topology structure of web will be improved. As the requirements of business, the technology of Web Log Mining emerges out. Carrying out this examination term has great value.
     This paper makes an intensive study to the Web Log Mining. Firstly, it has a general study to the basic knowledge and classification of Web Mining and Web Log Mining, and introduces the content and format of the web log in detail. Secondly, the paper discusses data preprocessing process of web log mining totally including data cleaning, user distinguish, session distinguish, frame filter, path supplement, transaction recognition.
     Then, this paper obviously presents the concepts of association principle. Primarily the paper tells a classic Apriori Algorithm of association principle. On the basis of study Apriori Algorithm applying association principle of transaction matrix. The new algorithm remove the first page and this will significantly reduce the dimensions of the matrix, then the algorithm no longer need to search for candidate itemsets and this will improve the eddiciency of computing. this improvement of the algorithm is effective and feasible. This paper introduces a method on how to get association rules through frequent item sets. Finally, the thesis designs and implements a Web mining system for data mining experiment according to Web log mining process. The system is divided into three parts:data pre-processing module, mining frequent patterns modules and association rule mining module.

引文

[1]范明.数据挖掘[S].北京：机械工业出版社.2000
    [2]邢桂芬,梅馨.文本挖掘技术综述[J].江苏大学学报(自然科学版),2003,35(5)：56～57
    [3]冯艳,王坚强.数据挖掘技术在电子商务上的应用[J].湖南商学院学报,2002,2(9)：125～126
    [4]韩家炜.数据挖掘：概念与技术.Data Mining:Concepts and Techniques, 1nd edition[M]. Norgan Kaufmann,2006
    [5]A.Joshi, R.Krishnapuram, Robus fuzzy clustering methods to support web mining. In Proc. Workshop in Data Mining and Knowledge Discovery. SIGMOD,1998, 12(2):8-12
    [6]M.S.Chen, J.S.Park. Data mining for path traversal patterns in a web environment[C].In Proceedings of the 16th International Conference on Distributed Computing Systems,1996,385～392
    [7]R. Kosala, H. Blockeel. Web Mining Research:A Survey. SIGKDD Explorations, 2000,2(1):1～15
    [8]冯洁.Web日志挖掘相关算法研究及其原形系统设计[D]：[学位论文]西南交通大学,2004
    [9]杨小华,周龙骧.基于用户访问模式的www浏览路径优化[J].软件学报,2001,6(3)：134～137
    [10]陆丽娜,杨怡玲等.Web日志挖掘中的序列模式识别[J].小型微型计算机系统,2006,5(11)：45～49
    [11]王熙法,岳丽华.时序模式发现算法研究[J].计算机研究与发展,2000,11(13)：67～71
    [12]胡和平,程英.应用多维数据立方体开采Web日志的多维关联规则[J].计算机应用研究,1999,10(7)：67～69
    [13]Facca M, Lanzi P. Mining Intersfiong Knowledge from Weblogs:A Survey[J]. Data and Knowledge Engineering,2005,53(3):225-241
    [14]Micheline Kamber, Jiawei Han.数据挖掘概念与技术[M].北京：机械工业出版社,2001
    [15]邵峰晶,于忠清.数据挖掘原理与算法[D].水利水电出版社,2003,12(8)：132～153
    [16]Pang-Ning Tan, Michael Steinbach, Vipin Kuma.数据挖掘导论.人民邮电出版社,2006.01：237～245.
    [17]陈文伟.数据仓库与数据挖掘教程..清华大学出版社,2006.08：137～145
    [18]Wang Bin, Liu Zhijing. Web mining research. Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications,2003:84～85
    [19]S.Linoff,沈均毅.Web数据挖掘：将客户数据转换为客户价值.电子工业出版社,2004：18～49
    [20]涂承胜,鲁明羽,陆玉昌.Web内容挖掘技术研究.计算机应用研究[J],2003,18(9)：56～62
    [21]梁协雄,雷汝焕,曹长修.现代数据挖掘技术研究进展.重庆大学学报,2004,3(12)：21～26
    [22]Pitkow, Robert Cooley, Mukund Deshpande. Web Usage Mining:Discorery and Applications of Usage Patterns from Wet Data[J]. In Proc, ACM SIGKDD,2000, 1(2):12～23
    [23]陈健,印鉴.Web使用挖掘技术研究综述.计算机工程,2003.9(10)：34～41
    [24]Schafer, Konstan, Riedl. E-Commerce Recommendations Applications[J]. Journal of Data Mining and Knowledge Discovery,2001,4(15):115～153
    [25]熊忠阳,周亚峰.Web访问挖掘的预处理技术的研究[J].计算机技术与发展,2007,17(8)：11～14
    [26]邹涛,王继成.www上的信息挖掘技术及其实现.计算机研究与发展,1999,36(8)：1019～1024
    [27]侯德文,林瑞娟.Web挖掘及其在电子商务中的应用研究[J].计算机技术与发展.2006,16(8)：186～191
    [28]袁柱.电子商务中Web数据挖掘的应用研究[J].商场现代化,2007,11(8)：106～109
    [29]Agrawal R, Srikant R. Mining sequential pattern[C]//Proc. of the 11th International Conference on Data Engineering. Taipei,1995
    [30]Spiliopoulou.M. web usage mining for web site evaluation[J]. Communications of ACM,2000(8):90～125
    [31]黄艳,王延章,苑森森.一种高效相联规则提取算法[J].吉林大学自然科学学报,1999,4(2)：36～39
    [32]管旭东,陆丽娜.Web日志挖掘中的数据预处理的研究[J].计算机工程,2000,8(4)：128～133
    [33]Cooley R, Srivastava J. Data preparation for minmg world wideweb browsing patterns. Journal of Knowledge and Information Systems[D],1999,8(10):32～57
    [34]J.etal S. Web Usage Mining:Discovery and application of usage patterns from Web data[J]. SIGKDD Explorations,2000,1(2):12～23
    [35]Baglioni M, Ferrara U, Romei A. Preprocessing and mining Weblog data for Web personalization[C], Proceedings of 8th Natl'conf of the Italian Association for Artificial Intelligence,2003
    [36]Shahabi C, Zarkesh A, Adibi J, Knowledge Discovery from Users Web-page Navigation[C]. Proc. of Workshop on Research Issues in Data Engineering. Birmingham, England:[s. n.],1997
    [37]Spiliopoulou M, Mobasher B, Berendt B. The impact of site structure and user environment on session reconstruction in web usage analysis[J]. Informs Journal of Computing, Special Issue on Web Based Data for E-Business Applications, 2003,15(2):171～190
    [38]朱晋华,陈俊杰.Web日志预处理中会话识别的优化[J].太原理工大学学报,2008,39(2)：111～114
    [39]Fayyad U M, Piatetsky, Shapiro G. The KDD process for extracting useful knowledge from volumes of data[J]. Communications of the ACM,1996,39(11): 27～34
    [40]J.S.Park, M.S.Chen, P.S.Yu. Efficient parallel data mining for association rules[D]. In:Proceedings of the 4th Internationsl Conference on Information and Knowledge Managemeng,1995,3(12):122-126
    [41]Jung J. Semantic preproeessing of Web request streams for Web usage mining[J]. Journal of Universal Computer Science,2005,11(8):1383～1396
    [42]J. Pitkow. In sarch of reliable usage data on the WWW[C]. Sixth International World Wide Web Conference, Santa, Clara, CA,1997,451～463
    [43]毛国君,欣立娟,王实.数据挖掘原理[M].北京：清华大学出版社,2006,246～254
    [44]许欢庆,王永成.基于用户访问路径分析的网页预取模型[J].软件学报,2003,6(12)：312～317
    [45]马传香,张凌.序列模式挖掘算法的分析与比较[J].湖北大学学报,2006,28(2)：138～143
    [46]仇佩亮,姜园,周东方.用于数据挖掘的聚类算法[J].电子与信息学报,2005,4(1)：96～99
    [47]刘红岩,陈剑,陈国青.数据挖掘中的数据分类算法综述[J].清华大学学报 (自然科学版),2002,6(21)：123～132
    [48]张海英.一种自适应快速关联规则挖掘算法[J].西安理工大学学报,2007,2(2)：160～162
    [49]R.Agrawal, T.Imielinski, A.Swami. Miniassociation rules between sets of items in large databases. In Proc. of the ACMSIGMOD Conference on Management of Data, Washington,D.C., May 1993,207～216
    [50]徐勇,李杰,王云峰.最简关联规则极其挖掘算法[J].计算机工程,2007,33(13)：46～50
    [51]刘椿年,毛国君.基于项目序列集操作的关联规则挖掘算法[J].计算机学报,2002,25[4]：417～422
    [52]Chan. Mining changes in association rules:a fuzzy approach[J]. Fuzzy Sets and Systems,2005,149(1):87～104
    [53]F.Esposito, A.Appice, D.Malerba. mining association rules in census data[J]. Research in Official Statistics,2002,5(1):19～43
    [54]Oliveira, Zaiane. Aunified framework for protecting sensitive association rules in business collaboration[J]. International Journal of Business Intelligence and Data Mining,2006,1(3):247～287
    [55]E.L.Lawler, J.K.Lenstra, A.H.G.Rinnooy. Sequencing and Scheduling:Algorithms and Complexity[J]. HanBooks in Operations Research and Management Science, 1993(4):445～522
    [56]K Juliseh. Clustering intrusion detection alarms to support root cause analysis. ACM Trans on Information and System Security,2003,6(4):443～471
    [57]罗长华,余力,刘鲁.我国电子商务推荐策略的比较分析[D].系统工程理论与实践,2004,23(8)：967～102
    [58]Sarwar B, Karypis G, Konstan J, Riedl. Analysis of recommendation algorithms for E-commerce. In:ACM Conference on Electronic Commerce,2000,157～168
    [59]彭旭友,靳峰,黄光球.基于兴趣度的协同过滤商品推荐系统模型[J].微电子学与计算机,2005,22(3)：5～9
    [60]姚红波,杨炳儒.Web日式挖掘数据处理过程技术研究[J].微计算机信息,2006,22(3)：235～241
    [61]王太雷.基于相似模式聚类的电子商务网站电子推荐系统研究[J].计算机工程与应用,2005,41(6)：150～155
    [62]张娥,冯秋红,田增瑞.Web使用模式研究中的数据挖掘[J].计算机应用研究,2001,18(4)：82～85
    [63]岳训.基于矩阵聚类的电子商务网站个性化推荐系统[J].小型微型计算机系统,2003,11(14)：245～250

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700