Web日志挖掘技术的研究与自适应Web站点的构建

英文题名：Web Log Mining and Adaptive Web Site Development
作者：凌志泉
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：数据挖掘 ; Web日志挖掘 ; 自适应Web站点
英文关键词：Data mining ; Web Log mining ; Adaptive Web site
学位年度：2003
导师：寇纪淞
学科代码：1201
学位授予单位：天津大学
论文提交日期：2003-05-01

摘要

在Web的应用和规模快速增长下,把数据挖掘技术应用于Web是-个极具挑战性的研究方向。从Web服务器的日志挖掘中发现有用的,重要的知识(包括模式、规则、可视化结构等),成为数据挖掘与知识发现的又一重要研究和应用领域。本文作者对Web日志挖掘作了系统性的研究,通过对Web日志的挖掘,找出用户浏览页面的关联规则、聚类信息、访问路径等,并把它们应用到Web站点的智能化设计中。所作的工作主要有以下几个方面:
    1. 本文介绍了Web数据挖掘的基本概念,分类,并给出Web数据挖掘的基本原理,基本方法,并指出Web数据挖掘的用途。
    2. 为了更加合理地组织Web服务器的结构,需要通过Web日志挖掘分析用户的浏览模式,而Web日志挖掘中的数据预处理工作关系到挖掘的质量。文章就此进行了深入的研究,提出一个包括数据净化、用户识别、会话识别和路径补充等过程的数据预处理模型,并通过一个实例具体介绍各过程的主要任务。
    3. 从Web日志挖掘过程预处理阶段的结果用户会话文件开始,提出了一种基于扩展有向树模型进行用户浏览模式识别的Web日本挖掘方法,并在实验室对该方法进行了简单实现和实际日志数据的测试。
    4. 推荐是Web个性化服务的核心。提出一种自动分层推荐算法,利用页面分层自动选择最佳的匹配粒度,进行基于频繁导航路径的推荐。实验结果表明,该算法大大减少了在线匹配的开销,可以成功地应用到Web日志挖掘中。
    5. 提出了一个基于Web日志挖掘技术的应用,即用户自适应的Web站点,介绍了这一系统的实现方法和主要特色。
As the application and the scale of the Web increase fast. It becomes an extremely challenging research direction to apply data mining techniques to the Web. Discovery for useful and important knowledge (including patterns, association rules and visible structures) from the Web log on the server is becoming another important research and application area. The author made in-depth research and analysis in Web log mining. In this paper, by mining Web log,some user browsing patterns are discovered such as association rule, clustering pattern, accessing path and so on. Then those patterns are applied to design Web site and improve Web function. The works that has been done can be stated as below:
    1. This paper introduces basic conception and classification of Web mining, especially principles and methods. The author also points out the applications of Web mining.
    2. In order to organize the Web server architecture more logically, Web log mining is needed to analyze user's browsing patterns. This paper studies the data preprocessing phase of Web log mining, which is the key to get good mining results and presents a data preprocessing model including middle steps like data cleaning, user recognition, session recognition, and path supplementation. Also, each step is demonstrated through an example.
    3. Based on User Session File, which is the result of the preprocessing phase of Web Log Mining, this paper presents a Web Log Mining method for the recognition of user's browse patterns under the Extended Oriented-Tree model. Further more, the method is implemented basically in our lab and tested using real Web log data.
    4. Recommendation is the kernel of Web personalization. In this paper, we propose an automatic layered recommendation algorithm, which uses page layering to automatically choose the optimal matching granularity and to make recommendation based on frequent navigation paths. The experimental results show that it greatly reduces online cost, and can be successfully applied to Web log mining.
    5. This paper presents the design of the adaptive Web site, which is an application of the Web Log Mining.

引文

[1]Jiawei Han, Data Mining Concepts and Techniques, USA, Morgan Kaufmann Press, 2001:435-449
    [2]Mike Perkowitz, Towards adaptive Web sites: Conceptual framework and case study, Artificial Intelligence, 2000(118): 245–275
    [3]Dell Zhang, A novel Web usage mining approach for search engines, Computer Networks, 2002,39: 303-310
    [4]Yoon Ho Cho, A personalized recommender system based on web usage mining and decision tree induction, Expert Systems with Applications, 2002(23): 329–342
    [5]Den R. Greening, Data mining on the web, Web Techniques, 2000 (1): 227-233
    [6]David C. Brown, Evaluating Web Page Color and Layout Adaptations, Multimedia at Work, 2002: 86-91
    [7]Yongqiao Xiao, Efficient mining of traversal patterns, Data & Knowledge Engineering, 2001(39): 191-214
    [8]Paolo Giudici, Data mining of association structures to model consumer behavior, Computational Statistics & Data Analysis, 2002 (38): 533-541
    [9]Cooley R, Data preparation for mining World Wide Web browsing patterns, Journal of Knowledge and Information Systems, 1999,1(1): 5-32
    [10]Chen M S, Data mining for path traversal patterns in a Web environment, http://citeseer.nj.nec.com/article/chen96data.html, 2001-02-02
    [11]Mobasher B, Automatic personalization based on Web usage mining, Communications of the ACM, 2000, 43(8): 142-151
    [12]Zukerman.I,Predicting user's request on the WWW, UM99-Proceeding of the Seventh International Conference on User Modelin, 1999
    [13]C.H Yun, Mining Web Transaction Patterns in an Electronic Commerce Environment, Proceeding of the 4th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 18-20, 2000
    [14]Agarwal R. and Aggarwal C., A tree projection algorithm for generation of frequent itemsets, In Proceedings of High Performance Data Mining Workshop, Puerto Rico, 1999
    [15]Buchner A. and Mulvenna M. D., Discovering internet marketing intelligence through online analytical Web usage mining, SIGMOD Record, (4) 27, 1999.
    [16]G.Barish and K.Obraczka, World Wide Web caching: trends and techniques, IEEE Commun. Mag. 2000, 38(5): 178-184

    [17]Perkowitz M. and Etzioni O., Adaptive Web sites: automaticlly synthesizing Web pages, Proceedings of Fifteenth National Conference on Artificial Intelligence, Madison, WI, 1998.
    [18]Nasraoui O., Mining Web access logs using relational competitive fuzzy clustering, The Proceedings of the Eight International Fuzzy Systems Association World Congress, August 1999.
    [19]G.Pierre and I.Kuz, Differentiated strategies for replicating Web documents, Comput. Commun. 2001, 24(2): 232-240
    [20]Pitkow J. and Pirolli P., Mining Longest Repeating Subsequences to Predict WWW Surfing, Proceedings of the 1999 USENIX Annual Technical Conference, 1999
    [21]Lau T. and Horvitz E, Patterns of search: analyzing and modeling Web query refinement, User Modeling’99, 1999:119-128.
    [22]Spiliopoulou.M, The laborious way from data mining to Web mining, International journal of Computing Systems, Science and Engineering, 1999, 3(2):42-47
    [23]Zaiane O, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs, Proceedings on Advances in Digital Libraries Conference(ADL98), Santa Barbara, CA, 1998-04
    [24]He D and Goker A, Detecting Session Boundaries from Web User Logs, Proceedings of the 22nd Annual Colloquim of IR Research(IRSG2000), 2000: 57-66
    [25]赵畅、杨冬青,Web日志序列模式挖掘,计算机应用,2000,20(9):13-16
    [26]朱擒豹、沈钧毅,Web页面和客户群体的模糊聚类算法,小型微型计算机系统,2001,22(2):229-231
    [27]钟清流,Web数据挖掘的BN实现方案,计算机工程,2001,27(6):46-48
    [28]张朝晖,发现多值属性的关联规则,软件学报,1998,9(11):801-805
    [29]王实、高文,基于用户访问事务文法的序列关联规则发现,软件学报,2001,12(10):1503-1509
    [30]苏中、马少平,基于Web-Log Mining的N元预测模型,软件学报,2002,13(01):136-141
    [31]杨怡玲,一个简单的日志挖掘系统,上海交通大学学报,2000(7):35-37
    [32]宋爱波,稠密数据库有趣规则的快速挖掘,小型微型计算机系统,2001,22(7):822-826
    [33]陆丽娜,Web日志挖掘中的数据预处理的研究,计算机工程,2000,26(4):66-69
    [34]陈宝树、党齐民,Web数据挖掘中的数据预处理,计算机工程,2002,28(7):125-127
    [35]肖立英、李建华,Web日志挖掘技术的研究与应用,计算机工程,2002,28(7):276-278

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700