基于Web的数据挖掘研究

英文题名：Study on Web Data Mining
作者：张承明
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; Web挖掘 ; 浏览兴趣 ; 个性化推荐
英文关键词：data mining ; Web mining ; browse interest ; personal recommendation.
学位年度：2003
导师：孙忠林
学科代码：081203
学位授予单位：山东科技大学
论文提交日期：2003-05-01

摘要

数据挖掘技术是近年来随着数据库技术和人工智能技术的发展而出现的全新的信息技术，融合了数据库、人工智能和统计学等多种学科的知识，试图从数据中提取出先前未知、有效和实用的知识。数据挖掘技术与统计学、数据库技术、数据库知识发现等学科与密切的联系，也有明显的不同。数据挖掘主要研究内容包括广义知识、关联知识、分类知识、聚类知识、预测型知识和偏差型知识的内容。使用关联分析、分类和聚类分析、神经网络、决策树和规则推理等技术进行挖掘。
     由于Web上的信息具有数量庞大、无序性强、重复性大的特点，人们现在还不能迅速、方便地从Web所包含的大量信息中获取所需要的信息。Web挖掘是传统数据挖掘技术在Web环境下的应用，试图从大量的Web文档集合和用户浏览Web的数据信息中发现蕴涵的、未知的、有潜在应用价值的、非平凡的模式。Web挖掘分为Web内容挖掘、Web结构挖掘和Web使用模式挖掘。Web使用模式挖掘是从用户浏览网站的数据中抽取感兴趣的模式，理解用户的浏览兴趣行为，以便进一步改善网站结构或为用户提供个性化的服务。
     本文对Web使用模式挖掘的数据采集、用户浏览兴趣的度量和表达两个方面进行了研究，主要的工作有：
     1．分析了现有Web使用模式挖掘的数据采集方式，指出了当前数掘采集方式的不足，如由于HTTP协议的无状态连接而难以在Web日志中得到准确的用户浏览信息。提出了一种综合利用服务器日志文件和客户端数据获取用户浏览信息的方法。
     2．兴趣是指个人对客观事物的选择性态度，准确地度量用户浏览兴趣是Web使用模式挖掘的基础。本文针对Web使用模式挖掘领域，首先分析了已有的度量用户浏览兴趣方式的不足之处，如度量方式过于简单而导致不能更好地区分用户感兴趣类与不感兴趣类；没有考虑页面信息量对用户浏览时长的影响等。在此基础上，提出了一种基于用户浏览行为度量用户浏览兴趣的方法。
     3．如何有效地表达用户浏览兴趣是Web使用模式挖掘研究的方向之一。本文在分析了现有的表达用户浏览兴趣方式的基础上，提出了一种基于树形结构表达用户浏览兴趣的方式。
     本文提出的基于用户浏览行为度量和表达用户浏览兴趣的方法改进

    山东科技大学硕士学位论文
    摘要
    了原有的度量和表达方式在数据采集、兴趣度量、兴趣表达儿个方面的不
    足，以便更好地为进一步的挖掘做准备。
Data Mining is fairly a new communicational technology that has been developed with the technology of database and Artificial Intelligence. Data Mining tries to extract the unknown, effective and useful knowledge from data. On one hand. Data Mining technology has a close relationship with Database technology, statistics and KDD; On the other hand, they are quite different. Data Mining mainly studies on research Generalization Knowledge, Association Knowledge, Classification Knowledge, Clustering Knowledge, Prediction Knowledge, and Deviation Knowledge. In the data mining, the technologies of associative analysis, classification, clustering have been used.
    As Web information is of great amount, strong orderlessness, high repeatability, people cannot get the information they need from Web quickly and conveniently. Web mining is the traditional data mining technology used in Web, attempting to find implicative, unknown, and non-trivial schema which has potential application from the innumerable Web file assembly and the data information which can be gotten when the user browse Web. Web using schema mining gets the interesting schema from the data the user browsed, and apprehend the user's browse interest behavior, in order to improve the Website's structure or provide individual service for the user.
    This paper is dedicated to Web schema mining's data acquisition mode, the measurement and expressing of user's browse interest, and the main tasks are as follows:
    1.Analysing the present data acquisition fashion of Web schema mining, pointing out the shortage of the present data acquisition fashion, For example, because the non-state link of HTTP, it is difficult to get exact information of user's browse from Web log; proposing a method which comprehensively use the service log file and the client end data to get the user's browse information.
    2.The interest is the selectivity attitude of objective matter of a person, and measuring user's browse interest exactly is the base of Web schema mining. According to the filed of Web usage schema mining, this paper


    analyses the present the shortage of the style of measure and expresses the browsing interest of user. For instance, the too simple measure fashion often leads to difficulty of distribution which is the user interested in or not; not considering the page information amount's influence on the user's browse time and so on. As a result, point out a method based on user's browse behavior to measure the user's browse interest.
    3.One of the direction of using mode dining studying in Web is how to express user' browse interest effectively. In this paper, we gives a kind of expressing user' browser interest mode which is based on tree-type structure.
    The method based on user's browse behavior and expressing the user's browse interest in this paper improves the shortage of indigenous measurement and expresses the mode in data collection, interest measuring and interest expressing aspects, it can prepared for the further mining work better.

引文

1．王实，高文，李锦涛．Web数掘挖掘[J]．计算机科学，2000，27(4)：28-31．
    2. Han J,Kamber M.Data Mining: Concepts and Techniques[J].San Mateo, CA:Morgan Kaufmann, 2000
    3．Jiawei Han，Micheline Kamber(著)，范明，孟小峰(译)．数据挖掘：概念与技术[M]．北京：机械工业出版社，2001．8。
    4．宋爱波，董逸生，吴文明等．Web挖掘研究综述[J]．计算机科学，2001，28(11)：13-16．
    5. R Kosala, H Blockeel. Web Mining Research: A Survey[J]. SIGKDD Exploration, 2000, 2(1): 1-15.
    6. Eric A Brewer. When Everything is Searchable[J]. Communication of the CAM, 2001, 44(3): 53-55.
    7．胡和平等．数据开采研究的新领域[J]．计算机应用研究，2000(5)：1-3
    8．韩家炜，孟小峰，王静等．Web挖掘研究[J]．计算机研究与发展．2001，38(4)：405-414．
    9．邓英，李明．Web数据挖掘技术及工具研究[J]．计算机工程与应用，2001，(20)：64-65．
    10．王国胤．Rough集理论与知识获取[M]．西安交通大学出版社，2001。
    11．张文修．粗糙集理论与方法[M]．科学出版社，2001。
    12．林士敏，田凤占，陆玉昌．贝叶斯学习、贝叶斯网络与数据采掘．计算机科学[J]，2000，27(10)：69-72
    13．林士敏，田凤占，陆玉昌，用于数据采掘的贝叶斯分类器研究[J]，计算机科学，2000，27(10)：73-76
    14．王继成，潘金贵，张福炎．Web文本挖掘技术研究[J]．计算机研究与发展．2000，37(5)：513-520．
    15．宋爱波，董逸生，吴文明等．Web挖掘研究综述[J]．计算机科学，2001，28(11)：13-16．
    16. R Kosala, H Blockeel. Web Mining Research: A Survey[J]. SIGKDD Exploration, 2000, 2(1): 1-15.
    17．韩客松，王永成．文本挖掘、数据挖掘和知识管理[J]．情报学报，2001，20(1)：100-104．


    18．陈莉，焦李成．Internet／Web数据挖掘研究及最新进展[J]．西安电子科技大学(自然科学版)，2001，28(1)：114-119．
    19．沈洲，王永成，许一震等．自动文摘系统评价方法的研究与实践[J]．情报学报，2001，20(1)：66-72．
    20．朱明，王军，王俊普．Web网页识别中的特征选择问题研究[J]．计算机工程，2000，26(8)：35-37．
    21．吴秀清，韩彬斌．基于Bayes算法的Web网页识别[J]．计算机工程，2000，26(3)：6-7．
    22．范炎，郑诚，王清毅等．用Naive Bayes方法协调分类Web网页[J]．软件学报，2001，12(9)：1386-1392．
    23. Arvind Arasu,Junghoo Cho,Hector Garcia- Molina, and et al. Searching the Web[J]. ACM Transactions on Internet Technology, 2000,1(1): 2-43.
    24．蒋晓冬，金宇晖，谈征．网上高质量智能信息检索系统的实现[J]．计算机工程与科学，1999，21(4)：49-53．
    26. Monika R henzinger.Hyperlink Analysis for the WEB[J]. IEEE Internet Computing, 2001,5(1):5-50.
    26. Yitong Wang and Masaru Kitsuregawa. Link Based Clustering of Web Search Results[J].Web Age Information and Management (WAIM'2001), pp225-236. Spinger-Verlag Berlin Heidelberg 2001.
    27．黄奇，李伟．基于链接的学术性WWW网络资源评价与分类方法[J]．情报学报，2001，20(2)：186-192．
    28. Mike Thelwall. Web Impact Factors and Search Coverage[J]. J. of Documentation, 2000,56(2): 185-189.
    29. Lada A. Adamic, Bernardo A. Huberman. The Web's Hidden Order[J]. Communication of the CAM, 2001, 44(9): 55-59.
    30．李海宏等．基于用户行为挖掘的个性化WEB浏览器模型[J]，计算机科学，2002，29(8)增刊：149
    31．应晓敏，刘明，窦文华．一种面向个性化服务的无需反例集的用户建模方法[J]．国防科技大学学报，2002，24(3)：67-69
    32．陈帼眉，冯晓霞，刘桂珍．学前教育学[M]，北京师范大学出版社，294。
    33．朱明，严捷丰．基于主题的Web信息个性化服务[J]．计算机应用，2002，22(12)：4-6
    34．宋爱波，胡孔法，董逸生．Web日志挖掘[J]．东南大学学报，2002，32(1)：16
    35．石晶等．评测Web使用分析中会话识别的准确度[J]．电子科技大学学报，2002，31(3)：282


    36. Oren Etzioni. The World Wide Web: Quagmire or Gold Mine[J]. Communication of CAM.1996, 39(11):65-68.
    37．沈模卫，崔艳青，陶嵘．超文本阅览中人的因素[J]．浙江大学学报，2002，29(3)：356
    38．张娥，冯秋红等．Web使用模式研究中的数据挖掘[J]．计算机应用，2001，3：80-82
    39．刑东山，沈钧毅．Web使用挖掘的数据采集[J]．计算机工程，2002，28(1)：39-41
    40．施建生，伍卫国，陆丽娜．Web日志挖掘中一种事务识别方法的改进[J]．小型微型计算机系统，2002，23(1)：117-118
    41．李煊，庄镇泉．Web访问挖掘预处理的用户识别算法[J]．计算机工程与应用．2002，7：174-176
    42．陈建华，包煊．Web挖掘系统的设计与实现[J]．计算机工程．2002，28(8)：141-143
    43．陈建中，李岩等．Web挖掘研究[J]．计算机工程与应用．2002．13：43-45
    44．严彩梅．Web用户模式[J]．扬州大学学报．2002，5(3)：54-58
    45．范炎，郑诚，王清毅等．用Bayesian方法协调分类Web网页[J]．软件学报，2001，12(9)：1386-1392．
    46．邹涛，王继成等．WWW上的信息挖掘技术及其实现[J]．计算机研究与发展，1999，36(8)：1019-1024．
    47．韩彬斌，王培康．Web网页识别算法[J]．情报学报，2001，20(1)：77-81．
    48．蒋晓冬，金宇晖，谈征．网上高质量智能信息检索系统的实现[J]．计算机工程与科学，1999，21(4)：49-53．
    49．陈宁等，基于模糊概念图的文档聚类及其在Web中的应用[J]．软件学报，2002，13(8)：1598-1604．

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700