基于Web的用户访问信息挖掘研究

英文题名：Research on Web-Based User Access Information Mining
作者：赵朋
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：数据挖掘 ; Web挖掘 ; 日志挖掘 ; 神经网络 ; 关联规则 ; 数据库
英文关键词：data mining ; web mining ; log mining ; neural network ; association rules ; database
学位年度：2006
导师：杨保安
学科代码：1201
学位授予单位：东华大学
论文提交日期：2005-12-01

摘要

数据挖掘作为一种知识发现的手段,得到了广泛的应用,是数据库最活跃的领域之一。Web挖掘就是将传统数据挖掘技术应用到Web环境中,从Web中抽取信息或知识的过程。在Web挖掘中,基于Web的用户访问信息挖掘应用最为广泛,应用领域涉及电子商务、网络广告、智能推荐系统、网络营销、智能决策领域。一个好的挖掘模型和相应的数据表示及数据库设计是Web访问信息挖掘成功的关键,为此本文进行了相关的研究。
     本文在对Web用户访问信息挖掘的相关理论和最新成果的研究的基础上,对数据预处理阶段和模式发现阶段的几个问题和方法进行了研究,并提出了一些改进方法和算法实现,针对具体的问题建立了相应的数据表示和数据库系统设计,并且在此基础上提出了一个基于数据库的Web用户访问信息挖掘系统,并初步实现了其中的几个功能模块。
     数据预处理阶段是Web挖掘的数据准备阶段。本文通过SQLServer2000实现了基于数据库的数据清洗任务,并提出了一种网络蜘蛛的字符匹配模式的清除方法。用户识别提出了基于Cookie,ip和agent三个属性的识别算法,并且给出了会话识别和事务识别的具体算法,采用基于最大前向访问的事务识别。
     模式发现阶段是Web挖掘的关键。本文首先创建了用户访问兴趣度的数据表示方法,利用概念分层的方式将页面数据进行归纳,并在此基础上导出了适合BP神经网络的数据集,将神经网络应用到用户分类中,构造了一个分类器;其次是在关联规则和序列算法研究的基础上提出并实现了一个频繁访问路径的算法;最后用Matlab实现了一个计算页面类别关联矩阵和统计分析的算法,实现较高概念层次的统计分析和关联规则挖掘,具有较好的扩展性和易用性。
     本文最后在前面工作的基础上提出了一个基于数据库的Web用户访问信息挖掘系统的原型,并就原型的各模块进行了分析,该原型允许所有操作基于数据库,得到的模式及规则也存储在数据库中,更
As a method of knowledge discovery, data mining has been widely used, and was the most active domain of database. Web mining is to use the traditional data mining technologies to extract information and knowledge in the Web environment. The web usage mining is the most wide used method, which is used in the field of e-commerce, internet ads, intelligent recommendation system, internet marketing, and intelligent decision support. A good model of web mining is the key to the success of web usage mining, this dissertation will do some research.The dissertation will improve and implement several methods and arithmetic based on the research of the theory and achievement, which is about web user access information mining. This dissertation will design the database to present corresponding data. Then construct a Web user access information mining system model bade on database, and realize several functional module.Data preprocessing is the preparation of web mining. This dissertation will realize data cleaning in SQLServer2000, and introduce method of data cleaning based on the character matching of the crawler. In the phase of user identifying, method based on Cookie, ip, and agent is used. This dissertation gives the concrete arithmetic of session identification and transaction identification, which uses maximum forward path.Pattern discovery is the key to web mining. This dissertation first constructs data presentation of the user access interesting dimension, uses concept hierarchy to induct the page data, then educes the data set suitable to BP networks, finally uses BP networks to constructs a classifier. Then this dissertation introduces and realizes arithmetic of
    frequent access path based on association rules and sequential mode. At last, this dissertation creates a Matlab arithmetic, which is extensible and practicable, to calculate the relation matrix and statistic analysis.On the ground of work above, this dissertation presents a Web mining system model bade on database, and describes and analyses every module. This model allows that all the operation be based on database. All pattern discovered should be involved in database so that we can manage and apply pattern discovered easily. This dissertation applies web user access information mining to shanghai agriculture information, and finds several useful patterns. The experience data proves that web user access information mining system is practical and effective.The dissertation uses SQL server 2000 as database system, and uses SQL sentence to implement data preprocess. The dissertation uses C++ and Matlab to develop all the function. Web user access information mining is the widely used web mining technique. It can know the interest of users, improve site structure, provide customized service, better marketing policy, recommend and predict the user's behavior. The model given in this dissertation is applicable. Research of this dissertation has theoretical importance and practical value to web user access information mining.

引文

[1] Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. Inc, 2001
    [2] Raymond Kosala, Hendrik Blockeel, Web Mining Research: A Survey, In SIGKDD, 2000.07, 1-15
    [3] Gordon S. Linoff, Michael J. A. Berry, Mining the Web: Transforming Customer Data into Customer Value, John Wiley & Sons Inc, 2001
    [4] 李亚飞,刘业政,Web挖掘的体系研究,合肥工业大学学报(自然科学版),2004.03,305-309
    [5] Bamshad Mobasher, Honghua Dai, Tao Luo, etc. Jim Wiltshire. Discovery of Aggregate Usage Profiles for Web Personalization, 2000.1-14
    [6] Ajith Abraham, Vitorino Ramos, Web Usage Mining Using Artificial Ant Colony Clustering and Linear Genetic Programming, In CEC03-Congress on Evolutionary Computation, IEEE Press, Canberra, Australia, Dec. 2003.8-12
    [7] Georgios Paliouras, Christos Papatheodorou, Vangelis Karkaletsis, etc. Clustering the Users of large web site into communities, In ICML2000
    [8] 邢东山,沈钧毅,宋擒豹,从Web日志中挖掘用户浏览偏爱路径,计算机学报,2003.11,1518-1523.
    [9] 周则顺,水俊峰,夏红霞等,基于Web日志挖掘的智能站点体系,武汉理工大学学报,2003.12,72-75
    [10] 李代平,章文,中文SQLServer2000数据库应用基础,北京:冶金工业出版社,2002
    [11] 朱扬勇,左子叶等译,数据挖掘实践,机械工业出版社,2003
    [12] J. Kleinberg, Authoritative sources in hyperlinked environment, In 9th ACM-SIAM Symposium on Discrete Algorithms, 1998
    [13] S.Brin, L.page, The anatomy of a large-scale hypertextual Web search engine, In 7th International World Wide Web Conference, Brisbane, Australia, 1998
    [14] 杨炳儒,李岩,陈新中,王霞,Web结构挖掘,计算机工程,2003.20,28-30
    [15] 凌志泉,搜索引擎中的网络数据挖掘技术,计算机工程与设计,2003.09,70-72
    [16] 李国辉,汤大权,武德峰,信息组织与检索,北京:科学出版社,2003
    [17] 涂承胜,鲁明羽,陆玉昌,Web内容挖掘技术研究,计算机应用研究,2003.11,5-9
    [18] 张兴华,搜索引擎技术及研究,现代情报,2004.04,142-145
    [19] Bettina Berendt, Andreas Hotho, and Gerd Stumme, Toward Semantic Web Mining, The First International Semantic Web Mining Conference(ISWC2002), Sardinia, Italy, 9-12th June, 2002, pages 264-278
    [20] 张娥,郑斐峰,冯耕中,Web日志数据挖掘的数据预处理方法研究,计算机应用研究,2004.2,58-60
    [21] 陈宝树,党齐民,Web数据挖掘中的数据预处理,计算机工程,2002.07,125-127
    [22] 张维明主编,数据仓库原理与应用,北京:电子工业出版社,2002
    [23] 李煊,庄镇泉,Web访问挖掘中预处理的用户识别算法,计算机工程与应用,2002.07,173-176
    [24] 易敏昕,汪胜,张有仁等,Web使用数据挖掘中数据预处理的研究,计算机工程与应用,2003.24,154-157
    [25] 汤明伟,浅谈COOKIE技术,常州信息职业技术学院学报,2005.03,46-48
    [26] 邓英,李明,用户访问模式挖掘中数据预处理问题的研究,计算机工程与应用,2002.01,188-190
    [27] 王熙照,王丽娟,袁方等,Web用户访问模式挖掘,河北大学学报(自然科学版),2003.04,404-409
    [28] 董恒庆,梅清,Web日志挖掘数据预处理研究,现代计算机,2004.03,6-9
    [29] 胡海璐,周海涛,Visual C++.NET高级编程技术与范例,北京:电子工业出版社,2002
    [30] 裘宗燕译,C++程序设计语言,北京:机械工业出版社,2005
    [31] 郭伟刚,电子商务网站用户访问模式挖掘中的预处理技术,计算机应用,2005.03,691-694
    [32] 张健沛,刘建东,杨静,基于Web的日志挖掘数据预处理方法的研究,计算机工程与应用,2003.10,191-193
    [33] Magdalini Eirinaki, Michalis Vazirgiannis, Web Mining for Web Personalization, ACM Transactions on Internet Technology, 2003.01, 1-17
    [34] A. Joshi, C. Punyapu, P. Karnam, Personalization and a synchronicity to support mobile web access, in Proc. Workshop on Web Information and Data Management, 7th Intl. Conf. on Information and Knowledge Management, November 1998,
    [35] 邓英,李明,Web数据挖掘技术及工具研究,计算机工程与应用,2001.20,92-94
    [36] W. Fan, M. D. Gordon, P. Pathak, Effective profiling of consumer information retrieval needs: a unified framework and empirical comparison, in press, Decision Support System, Elsevier Science B. V. 2004, 1-21
    [37] Mike Perkowitz, Oren Etzioni. Towards adaptive Web sites: Conceptual framework and case study, in press, Artificial Intelligence, Elsevier Science B.V, 2000, 245-275
    [38] Bettina Berendt, Andreas Hotho, Gerd Stumme, Toward Semantic Web Mining. The First International Semantic Web Mining Conference(ISWC2002), Sardinia, Italy, 9-12th June, 2002, pages 264-278.
    [39] A. G. Buchner, M. Baumgarten, S. S. Anand, etc. Navigation pattern discovery from internet data, In MIMIC—Mining the Internet for Marketing Intelligence, 2000
    [40] 袁曾任,人工神经元网络及其应用,北京:清华大学出版社,1996
    [41] 陈丽雯,基于神经网络的数据挖掘模型研究与应用,[学位论文],大连,大连海事大学,2004
    [42] 丛爽,典型人工神经网络结构、功能及其在智能系统中的应用,信息与控制,2001.02,97-103
    [43] 王文剑,BP神经网络模型的优化,计算机工程与设计,2000.06,8-10
    [44] 李宏东,姚天祥等译,模式分类,北京:机械工业出版社,2003
    [45] 闻新,周露,Matlab神经网络应用设计,北京:科学出版社,2000
    [46] 徐宗本,张讲社,郑亚林,计算智能中的仿生学,北京:科学出版社,2003
    [47] 郭晶,杨章玉,Matlab6.5辅助神经网络分析与设计,北京:电子工业出版社,2003
    [48] 张立明,人工神经网络的模型及应用,上海:复旦大学出版社,1994
    [49] 蒋宗礼,人工神经网络导论,北京:高等教育出版社,2001
    [50] 高文忠,顾树生,前馈神经网络的新算法及其收敛性,控制与决策,1995.03,284-288
    [51] Jude W. Shavlik, G. G. Towell, An approach to combining explanation-based and neural learning algorithms, Connection Science, 1989.3, 231—253
    [52] Lawrence O. Hall, Steve G. Romaniuk, A Hybrid Connectionist, Symbolic Learning System, AAAI 1990, 783-788
    [53] 孟祥武,优化神经网络结构,计算机研究与发展,1997.8,594-598
    [54] 张立明,人工神经网络的模型及应用,上海:复旦大学出版社,1994
    [55] 戚得虎,BP神经网络的设计,计算机工程与设计,1998.2,48-50
    [56] 苏金明,阮沈勇,Matlab6实用指南,北京:电子工业出版社,2002.
    [57] 董长虹,Matlab神经网络与应用,北京,国防工业出版社,2005
    [58] 冯艳,王坚强,数据挖掘技术在电子商务上的应用,湖南商学院学报(双月刊),2002.03,17-20
    [59] Long Wang, ristoph Meinel, Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining, In ICWE 2004,531-543
    [60] R. Agrwal, R. Srikant, Fast algorithms for mining association rules, In Proc of the 20th VLDB conference, pages 1994
    [61] W. Fan, M. D. Gordon, P. Pathak, Effective profiling of consumer information retrieval needs: a unified framework and empirical comparison, in press, Decision Support System, Elsevier Science B.V. 2004, 1-21
    [62] 张龙翔,一种基于Web日志挖掘的频繁访问页组加强算法,临沂师范学院学报,2004.06,100-103
    [63] 施建生,伍卫国,陆丽娜等,Web日志中挖掘用户浏览模式的研究,西安交通大学学报,2001.06,621-624
    [64] D. Hanselman, B. Littlefield, Mastering Matlab6: A Comprehensive Tutorial Reference, Prentice Hall, Inc.2001
    [65] 费爱国,王新辉,一种基于Web日志文件的信息挖掘方法,计算机应用,2004.06,57-59
    [66] 林宇等,数据仓库原理与实践,人民邮电出版社,2003
    [67] Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth. Steven White. Model-Based Clustering and Visualization of Navigation Patterns on a Web Site, In WA98052, 2001.09, 1-33
    [68] 王艳清,李海峰,基于XML的网络日志分析,北京化工大学学报,2004.06,98-100
    [69] 崔杰,张颍,数据挖掘技术在CRM中的应用,辽宁工学院学报,2002.06,8-9