面向智能Web站点的数据挖掘技术研究及应用

英文题名：Research and Application of Data Mining for Intelligent Website
作者：水俊峰
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：智能站点 ; 数据挖掘 ; 数据预处理 ; Web挖掘 ; Web日志挖掘 ; 关联规则 ; 聚类分析 ; 序列模式
英文关键词：Intelligent Website ; Data Mining ; Data Preprocessing ; Web Mining ; Web Log Mining ; Association Rule ; Clustering Analysis ; Sequential Pattern
学位年度：2003
导师：夏红霞
学科代码：081203
学位授予单位：武汉理工大学
论文提交日期：2003-02-01

摘要

目前，Internet和电子商务的发展带动了面向Web的数据挖掘技术的发展。在电子商务中，运用数据挖掘技术对服务器上的日志文件等Web数据进行客户访问信息的Web数据挖掘，根据对客户的访问行为、访问频度、访问时间的分析，得到群体客户行为和方式的普遍知识，动态地调整页面结构，改进服务，给客户个性化的界面，使电子商务活动更具有针对性。
     Web挖掘技术使得人们能够充分了解Web中页面的关系，以及Web站点的组织形式与用户的访问模式之间的关联。其中，面向Web服务器日志的Web日志挖掘技术尤其得到众多研究人员的关注，利用Web日志挖掘，我们可以知道用户对网站的浏览模式，可以根据用户的浏览行为发现相似行为的用户群，以及根据Web页面被用户访问的情况将具有相同特征的页面分组。
     基于上面的讨论，文中提出的提高Web服务的质量的解决方案是：采用数据挖掘技术中的Web日志挖掘为核心技术，建立一个智能Web站点(Intelligent Web Site，简称IWS)。智能Web站点利用Web日志、文档、数据库以及站点结构等可以获得的数据，采用数据挖掘技术，从中获取用户访问模式，根据用户当前访问的情况，实时地推荐用户可能感兴趣的内容，同时，Web服务器根据站点的使用情况，寻找站点设计的不合理之处，从而提醒管理员进行修正。
     本文首先提出了IWS的结构和组成模块，然后围绕智能Web站点中的模块，研究了其中的一些关键的数据挖掘技术与算法，最后在此基础上实现了一个原型系统。根据这条思路，本文主要包括以下内容：第2部分给出采用Web日志挖掘技术的智能站点体系结构，作为论文后续内容的一个索引。第3到第5部分是本文的重点，论述了设计智能Web站点所需要的数据挖掘技术，第3部分主要介绍了Web日志数据预处理技术研究中的一种改善预处理结果的方法——Frame页面过滤技术。第4部分论述了一种快速高效挖掘Web日志文件中聚类模式的算法——SLIC(Slope-Item Clustering)。第5部分提出了挖掘Web日志中频繁访问页组的一个加强算法。第6部分简述了面向Web日志挖掘的智能站点的实时推荐模块和管理员模块。根据前面的讨论，第7部分给出一个试验原型系统——IWS，最后一章总结了本文的所做的研究工作并给出了进一步的研究方向。
At present, the development of Internet and e-commerce drives the research for data mining technology facing web. In e-Commerce, the user's browsing behavior can be discovered by applying data mining technology on web data such as server logs, and the general knowledge of the group customer's behaviors and patterns can be obtained by analyzing the user's accessing behavior and accessing time. In addition, the page structure, the service and marketing strategies can be modified and improved dynamically according to the discovered knowledge to make the electronic commercial activity more pointed.
    Web mining technology make people can fully find out the relation of the web pages, and the connection between the web organizational forms of website and the access mode of the customer. Among them, the web log mining technology gets the concern of the numerous researchers especially. By utilizing web log mining, we can know the browser mode of the customer, find the similar user group according to browser behaviors and divide the pages with the same characteristic into groups by the web pages visited by the user.
    On the basis of discussion above, the solution to improve the quality of website service putting forward in the article is adopting web log mining technology of data mining as key technology to establish a intelligent website (Intelligent Web Site, abbreviated as IWS). Intelligent website utilizes web logs, files, database, website structure, and other data resources that may win to obtain user's accessing pattern by adopting data mining technology, and according to the situation which users visit at present, recommend the content that users might interested in real-time, besides, web server according to operating position of website, look for unreasonable place that website design, thus remind administrators to revise.
    In this paper, we propose structure and composition module of the IWS at first, and study some data mining algorithm, and then realize one prototype system on this basis finally. According to this, the paper includes the contents as follows: The second chapter puts forward the design standard and architecture of intelligent website based on web log mining as an index of follow-up content of thesis. Chapter 3 mainly introduces a method to improve the data preprocessing of the web log mining, that is Frame page filter technology. The 4th chapter expounds one fast high-efficient cluster pattern algorithm to mine web log ?SLIC. Chapter 5 proposes a strengthening algorithm of frequently accessing web page group in web log mining. The 6th part discusses the real-time recommendation module and administrator's module for intelligence website based on web log mining. According to ahead discussion, 7 part give and publish one test prototype system -IWS, the last chapter summarizes research work of this dissertation a
    nd further researches are prospected.

引文

[1]利玄英，李名世，智能Web站点技术研究，厦门大学学报(自然科学版)，2001，Vol．40，No．6:1311-1314
    [2]李华，何茜，吴中福，基于Web的个性化学习系统研究，计算机工程与应用，2002(13)：239-242
    [3]武新玲，自适应站点的研究与实现，硕士学位论文，浙江大学，2002．3
    [4] R. Baeza-Yates, E. Berthier Tibeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Company, 1999.
    [5]赵一唯，王和珍，WWW信息检索综述，南京大学学报(自科版)，2001，37(2)：192—198
    [6]Jiawei Han, Micheline Kamber著，数据挖掘概念与技术，机械工业出版社，2001，8
    [7]邹显春，谢中，周彦晖，电子商务与Web数据挖掘，计算机应用，2001，Vol．21，No．5：21-23
    [8]宋敏青，数据挖掘在Web中的研究与应用，现代情报，2002，No．3：59-62
    [9]韩家炜，孟小峰，王静李，盛恩，Web挖掘研究，计算机研究与发展，2001，Vol．38，No．4．405-413
    [10] Suhail Ansari, Ron Kohavi, Llew Mason and Zijian Zheng, Integrating E-Commerce and Data Mining: Architecture and Challenges, WEBKDD'2000.
    [11] O.Zaiane Resource and Knowledge Discovery from the Internet and Multimedia Repositories. 1999. Ph.D Thesis.
    [12] Raymond Kosala, Hendrik Blockeel. Web mining research: A survey SIDKDD Explorations 2000.7
    [13] O. R. Zaiane, and J. Han, "Resource and knowledge discovery in global information systems: A preliminary design and experiment. In Proc. First Int. Conf. On Knowledge Discovery and Data Mining, Montreal, Canada, 1995.
    [14] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "The KDD Process for Extracting Useful Knowledge from Volumes of Data" , Communication of the ACM, Vol.39, No.11, 1996, pp. 27-34
    [15]乔智勇，刘志镜，Web数据挖掘系统的设计及实现研究，计算机工程与设计，2002，Vol．23，No．7．36-39
    [16]谢丹夏，Web上的数据挖掘技术和工具设计，计算机工程与应用，2001(6)：85-87
    [17]徐宝文，张卫丰，数据挖掘技术在Web预取中的应用研究，计算机学报，2001，Vol．24，No．4:430-436
    [18] Cooley, R., Mobasher, B., and Srivastava, J. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1):5-32,

    1999．
    [19]陈宝树，党齐民，Web数据挖掘中的数据预处理，计算机工程，2002，Vol．28，No．3：125-127
    [20] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, pages 487-499, Santiago, Chile, 1994.
    [21]陈滢，徐宏炳，王能斌，DBMS Web支撑框架研究，计算机研究与发展，1998：35(6)525-529
    [22] Charu C. Aggarwa and Philip S. Yu, Data Mining Techniques for Associations, Clustering and Classification, PAKDD-99, Apirl, 1999.
    [23] Ng, R.T., and Hah, J.W. Efficient and effective clustering methods for spatial data mining. In Proc. VLDB'94, Santiago, Chile, 1994.
    [24] Zhang, T., Ramakrishnan, R., and Livny, M. BIRCH: An efficient data clustering method for very large databases. In Proc SIGMOD'96, Montreal, Canada, 1996.
    [25] Huang Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. In Proc. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
    [26] A.Savasere, E.Omiecinski, and S.Navathe. An efficient algorithm for mining association rules in large databases. In Proc. 1995 Int. Conf. Very Large Data Bases(VLDB'95), p432-443, Zurich, Switzerland, Sept. 1995
    [27]夏红霞，赵杨，曹献嫒，钟珞，关联规则层次算法的研究与改进，武汉工业大学学报，2000．5
    [28] B. Mobasher R. Cooley, J. Srivastava. Automatic Personalization Based on Web Usage Mining. Communication of ACM, August, 2000 (Volume 43, Issue 8)
    [29] Takehiro Nakayama, Hiroki Kato. Discovering the gap between Web site designers' expectations and users' behavior 2000 Takehiro Nakayama Hiroki Kato Computer Networks, Volume: 33, Issue: 1-6, June, 2000, pp. 811-822
    [30]朱红，王兆锐，由颖，基于WEB的数据挖掘模型，沈阳工业大学学报2002，Vol．24 No．1：61-63
    [31]牛锦中，牛锦宇，李锦涛编著，WWW服务器技术：Apache使用指南与实现原理，中国水利水电出版社，2002．3
    [32]艾默德等著，康博译，用J2EE和UML开发Java企业级应用程序，清华人学出版社，2002．7
    [33]夏红霞，郑巧仙，陈文平，银行数据仓库系统的设计，计算机应用，2002．03
    [34]钟珞，吕品，夏红霞，异构数据库互操作的实现方法，武汉理工大学学报2001．3
    [35]夏红霞，曹献嫒，郝海芳，钟珞，基于数据仓库的数据采掘技术应用，微机发展，2000．1
    [36]夏红霞，赵杨，钟珞，数据仓库中的索引技术，微机发展，2000．6

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700