基于Web日志的数据挖掘研究

英文题名：Research of Data Mining Based on Web Log
作者：田海山
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; Web挖掘 ; 客户聚类 ; 页面聚类 ; 页面预取
英文关键词：data mining ; web mining ; user clustering ; web page clustering ; frequent access paths ; web page prefetching
学位年度：2003
导师：彭玉青
学科代码：081203
学位授予单位：河北工业大学
论文提交日期：2003-01-01

摘要

近年来，Internet正以令人难以置信的速度在飞速发展，越来越多的机构、团体和个人在Internet上发布信息、查找信息。虽然Internet上有海量的数据，但由于Web是无结构的、动态的，并且Web页面的复杂程度远远超过了文本文档，人们要想找到自己想要的数据犹如大海捞针一般。网站不能对用户及其页面进行聚类，因此也不能针对特定的用户给出特殊的服务。另外，网站的拓扑结构与用户期望之间也存在着差距。而有些特殊用户的硬件资源有限，他们使用掌上电脑浏览网页，如何为他们实现页面预取也是应当研究的课题。
     如何解决这些问题?将传统的数据挖掘技术与Web结合起来，进行Web挖掘就是一个途径。Web挖掘就是从Web文档和Web活动中抽取感兴趣的潜在的有用模式和隐藏信息的过程。Web挖掘可以在很多方面发挥作用，如对搜索引擎的结构进行挖掘，确定权威页面，Web文档分类，Web Log分类、智能查询等。
     本文首先介绍了Web挖掘的定义、任务、分类，Web挖掘的模型及处理过程。
     接着，提出了一种适用于Web日志挖掘的数据结构及相应的算法。数据结构是一个用户／页面(User_URL)关联矩阵，用来表示用户对页面的访问信息。挖掘算法采用矩阵聚类(Matrix Cluster)，可以实现客户、页面聚类和频繁访问路径识别及访问预测等。
     本文最后总结了工作尚存的不足，并指出了Web挖掘研究的方向、应用前景和它所面临的挑战。
     实验证明，采用以上算法对校园网的Web日志进行挖掘效果良好。另外，把算法应用于电子商务网站，可以建设一个自适应网站(Adaptire Website)，进而实现针对具体客户的个性化服务，最终为商家的决策提供有力的支持。
Internet has developing with incredible speed for several years, in rencent years, more and more institutions, groups and individuals issuance and lookup information in the Internet. There is a mass of information in the Internet, but Web is unstructured and dynamic, and the composition of Web page is more complicated than text archive, so looking for data which someone want in the Internet is such difficult as looking for a needle in a bottle of hay. The website can't c luster it's users and web pages, so i t can't provide special service for a given people. Besides, the organization of websites' content may be quite different from the organization expected by visitors to the website. What's more, thers are some peculiar users whose hardware resource is finite, they use palmtop (such as Palm Pilots,Pocket PC,Handspring etc.) browse web page, then how to prefetch web page for them is worth to research.
    How to resolve these problems? Web mining which combine classical data mining technology with web is an appropriate approach. Web mining is a process that extracting some interesting and latent useful pattern and recondite information from web archives and web activitys. Web mining can react on several fields such as search engining structure's miningx confirm authoritative web page, classifing web archives,classifying web log, intelligent query etc.
    The thesis intruoduce the definition, mission, classification of web mining as well as the model and process of it at first.
    Then, a data structure and the corresponding arithmetic which suit to web mining are bring forward. The data structure is a User_URL martrix, it show the information that use access webpage. Mining arithmetic which utilize matrix cluster will cluster user, webpage and identity the frequent path as well as predict access.
    In the end, make a summarize of disadvantage which exists in the thesis,at same time, point out the direction, future and challenge of the web mining.
    The result of experiments show that the arithmetic which is applied to campus net's web log is efficient. In addition, applying the arithmetic to e-business website will construct an adaptive website, this will provide personal service to a special user, finally, this will provide trader powerful support to decision.

引文

[1．] Jiawei Han,Micheline Kamber．《数据挖掘—概念与核技术》．范明，孟小峰等译，机械工业出版社．
    [2．] 宋擒豹，沈钧毅．Web日志的高效多能挖掘算法．计算机研究与发展第38卷第3期 P328—333
    [3．] 陈莉，焦李成．Internet／Web数据挖掘研究现状及最新进展．西安电子科技大学学报(自然科学版)2001年2月第28卷第1期。
    [4．] 张娥，冯秋红，宣慧玉，田增瑞．Web使用模式研究中的数据挖掘．计算机应用研究 2001年第3期 P80—P83
    [5．] 数据挖掘资料汇编．http://datamining.126.com
    [6. ] Web Mining. http://www.cs.ualberta, ca/~tszhu/webmining. htm
    [7. ] Myra Spiliopoulou ,Carsten Pohle and Lukas C.Faulstich. Improving the Effectiveness of a Web Site with Web Usage Mining (Spiliopoulou1999c) .http://www.informatik.uni-siegen.de/～galeas/papers/web_usage_mining/
    [8. ] Jose Borges,Mark Levene. Data Mining of User Navigation Pattems_(Borges1999a) http://www.informatik.uni-siegen.de/～galeas/papers/web_usage _mining/
    [9. ] Jian Pei, JiaWei Han, Behzad Mortazavi-asl and Hua Zhu。 Mining Access Patterns Efficiently from Web Logs. http://www. informatik.uni-sie gen.de/～galeas/paper
    [10. ] Shigeru Oyanagi, Kazuto Kubota and Akihiko Nakase. Application of Matrix Clustering to Web Log Analysis and Access Prediction.http://robotics.stanford.edu/～ronnyk/WEBKDD2001/WEBKDD2001Accept.html
    [11. ] R.Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules, Proc. 20th VLDB Conf.,pp,487-499 (1994)
    [12. ] Pang-Ning Tan,Vipin Kumar. Mining Indirect Associations in Web Data.http://robotics. stanford, edu/～ronngkt/WEBKDD2001/WEBKDD2001Accept.html
    [13. ] Zhexue Huang, Joe Ng, David W. Cheung ,Michael K. Ng _, Wai-Ki Ching. A Cube Model for Web Access Sessions and Cluster Analysis.

    http://robotics.stanford.edu/～ronnyk/WEBKDD2001/WEBKDD2001Accept.html
    [14. ] Alexandros Nanopoulos, Dimitris Katsaros,Yannis Manolopoulos,E.ective Prediction. Web-user Accesses:A Data Mining Approach.http://robotics.stanford, edu/～ronnyk/WEBKDD2001/WEBKDD2001Accept.html
    [15. ] Bettina Berendt. Understanding Web usage at different levels of abstraction coarsening and visualising sequences.http://robotics.stanford.edu/～ronnyk/WEBKDD2001/WEBKDD2001Accept.html
    [16．] 周斌，吴泉源，高洪奎．用户访问模式数据挖掘的模型与算法研究．计算机研究与发展第36卷第7期 1999年7月
    [17．] 邹涛，王继成，朱华宇，金翔宇，张福炎，WWW上的信息数据挖掘技术及实现．计算机研究与发展第36卷第8期 1999年8月
    [18．] 周斌，吴泉源．序列模式挖掘的一种渐进算法．计算机学报第22卷第8期 1999年8月
    [19．] 杨怡玲，管旭东，陆丽娜，尤晋元．一个简单的Web日志挖掘系统．上海交通大学学报第34卷第7期 2000年7月
    [20．] 阳小华，周龙镶．基于用户访问模式的WWW浏览路径优化．软件学报第12卷第6期 2001年6月
    [21．] 陈才扣，金远平．挖掘基于Web的访问路径模式．小型微型计算机系统。第22卷第1期 2001年1月
    [22．] 陈宁，周龙镶．数据采掘在Internet中的应用．计算机科学第26卷第7期 1999年7月
    [23．] 韩家炜，孟小峰，王静，李盛恩．Web挖掘研究．计算机研究与发展第38卷第4期 2001年4月
    [24．] 郝先臣，张德干，尹国成，赵海．用于电子商务中的数据挖掘技术研究．小型微型计算机系统第22卷第7期 2001年7月
    [25．] 卢正鼎，刘芳，路松峰．利用文本挖掘实现Web智能服务．小型微型计算机系统第22卷第6期 2001年6月
    [26．] 游湘涛，叶施仁，史忠植．多策略能用数据采掘工具MSMiner．计算机研究与发展第38卷第5期 2001年5月
    [27．] 杨怡玲，管旭东，尤晋元．Web日志挖掘预处理中的Frame页面过滤算法．计算机工程第27卷第2期 2001年2月
    [28．] 路松峰，胡和平．加权关联规则的开采小型微型计算机系统第22卷第3期 2001年3月


    [29．] 王运峰，张蕾，韩纪富，黄勇．数据库中的关联规则的并行挖掘算法．计算机工程与应用 2001．16
    [30．] 刁力力，胡可云，陆玉昌，石纯一．数据挖掘与组合学习．计算机科学第28卷第7期
    [31．] 张伟，刘勇国，彭军，廖晓峰，吴中福．数据挖掘发展研究．计算机科学第28卷第7期
    [32．] 杨学兵，陆勤，蔡庆生．一种高效的挖掘序贯模式的算法．小型微型计算机系统第22卷第2期 2001年2月
    [33．] 可炎祥，石莉，张戈，黄浩，李超．时序模式的几种开采算法及比较分析．小型微型计算机系统第22卷第5期 2001年5月
    [34. ] Chung-Hong Lee and Hsin-Chang Yang. A Web Text Mining Approach Based on Self-Organizing Map. http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [35. ] Seung-Jin Lim,Yiu-Kai Ng. Web View:A Tool for Retrieving Internal Structures and Extracting Information from HTML Documents. http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [36. ] Minos N.Garofalakis,Rajeev Rastogi,S.Seshadri and Kyuseok Shim. Data Mining and the Web:Past,Present and Future. http://citeseer.nj.nec.com/231213.html
    [37. ] Adil Faisal, Cyrus Shahabi, Margaret McLaughlin, Frederick Betz. INsite:Introduction to a generic paradigm for interpreting user-web space interaction.http://etupc19.wiwi.uni-karlsruhe.de/webmining/bib/pdf/Faisal1999.pdf
    [38. ] Wei-Lun Chang,Soe-Tsyr Yuan. A Synthesized Learning Approach for Web-Based CRM.http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [39. ] Weiyang Lin, Sergio A. AIvarez and Carolina Ruiz. Collaborative Recommendation viaAdaptive Association Rule Mining. http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [40. ] Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, Yuqing Sun, Jim Wiltshire. Discovery of Aggregate Usage Profiles for Web Personalization.http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [41. ] Bettina Berendt. Web usage mining, site semantics,and the support of navigation.http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [42. ] Filip. Coenen, Gilbert. Swinnen,Koen.Vanhoof, Geert. Wets. Websites: Tactical versus Strategic Changes. http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [43. ] Sanford Gayle, SAS institute Inc, Cary, NC. The Marriage of Market Basket Analysis to Predictive Modeling. http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [44. ] Hiroki Kate, Takehiro Nakayama, Yohei Yamane. Navigation Analysis Tool based on the

    Correlation be-tween Contents Distribution and Access Patterns. http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [45. ] Pang-Ning Tan,Vipin Kumar. Modeling of Web Robot Navigational Patterns. http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [46. ] Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng. Integrating E-Commerce and Data Mining:Architecture and Challenges. http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [47. ] Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl.Application of Dimensionality Reduction in Recommender System—A Case Study.http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [48. ] GmbH -ChristianeTheusinger,Klaus-Peter Huber. Analyzing the footsteps of your customers.http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [49. ] Slobodan Vucetic, and Zoran Obradovic. A Regression-Based Approach for Scaling-Up Per-sonalized Recommender Systems in E-Commerce.http://robotics.stanford.edu/～ronnyk/WEBKDD2000
    [50. ] A. Joshi, C. Punyapu, and P. Karnam . Personalization and asynchronicity to support mobile web access, in Proc. Workshop on Web Information and Data Management, 7th Intl. Conf. on Information and Knowledge Management, November 1998.
    [51. ] Wolfgang Gaul Lars Schmidt-Thieme. Mining web navigation path fragments.http://robotics.stanford.edu/～ronnyk/WEBKDD2000

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700