基于WUM的个性化智能推荐技术研究

作者：周宇
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web挖掘 ; Web日志挖掘 ; 关联规则 ; 最大前向访问路径 ; 浏览模式
英文关键词：Web mining ; Web log mining ; Association rules ; Maximal forward traversal path ; browsing patterns
学位年度：2003
导师：张森
学科代码：081203
学位授予单位：浙江工业大学

摘要

随着Interent的迅速发展和WWW(world wide web)技术日渐成熟并向社会生活各方面渗透，可利用的信息资源的数量越来越大，类型越来越多，人类交互信息也不可避免地电子化和海量化。巨量的、无组织的信息，以及Interent上信息资源分布的广泛性，给用户寻找感兴趣的信息增加了困难，用户不知道如何更有效地发现自己所需的信息资源。而且，现有的信息发布和搜索引擎，由于其固有的缺点，无法有效地解决这两类问题。
     传统的数据挖掘技术和WEB相结合衍生的WEB挖掘技术为有效解决这一问题开辟了崭新的途径。本文尝试利用WEB挖掘技术对海量的WEB访问日志数据进行深入地分析和研究，挖掘出用户的个性化访问事务模式，并在此基础上对用户进行智能地信息推荐，达到个性化主动信息服务的目的。所做的工作主要包括以下几个方面：
     (1) 分析了数据挖掘技术的产生原因和发展背景，介绍了当前国内外数据挖掘技术研究的现状。
     (2) 对WEB数据挖掘体系结构进行了深入的分析和研究，综述了WEB数据挖掘，给出了相关的定义和分类，并就WEB日志和半结构化数据的挖掘技术进行详细地探讨，描述了WEB日志数据挖掘的一般过程。
     (3) 讨论了WEB使用记录挖掘的预处理方法的一般流程及相关定义。提出了基于引用时长的事务模式识别方法、基于最大前向引用的事务模式识别方法和基于时间窗的事务模式方法。
     (4) 讨论了两种用户事务模式的聚类方法，即基于最大前向访问路径导航-内容事务模式的聚类方法和基于内容事务模式的聚类方法，并分别提出了基于结构系数的用户事务之间的相似度计算方法和基于共同祖先、子孙相似系数的相似度计算方法。试验结果显示。基于最大前向访问路径导航-内容事务模式的聚类将访问路径相似的用户事务模式聚类到一起，因此，比较适合在线个性化推荐服务。而基于内容事务模式的聚类方法则较适合关联性强的WEB页的聚类分析。
     (5) 研究了基于WEB使用模式挖掘的在线个性化智能信息推荐服务，分为在线部分和离线部分。离线部分主要完成从站点服务

    浙江工业大学硕士论文
    器的访问109文件中挖掘出适合在线智能个性化推荐服务的用
    户事务模式，分别采用了基于关联规则挖掘方法和聚类用户事
    务方法获取用户个性化模式。在线部分，实现基于关联规则挖
    掘的个性化智能推荐服务和基于URL聚类模式的个性化智能
    推荐服务。本文对这两种智能推荐方法进行了分析、比较，总
    结了它们的优缺点。实验结果显示，该智能推荐系统是可行和
    有效的。
With the fast-growing Internet and the maturation of WWW (world wide web), applications based on this technology are entering into every aspects of our society, the amount of the information which can be made use of become more and more larger , either to the type of it. Inevitably the transaction information of humankind is being electrified. It is difficult for the user to search out the needed information because of the inorganization and largeness of the information and the universality of the recource in Internet. Further more, the information access and search engine can not resolve these problems efficiency for their inhere defect.
    The amalgamation of the data mining and WEB offer a new way to resolve the problem. This paper try to made in-depth analysis and research on the WEB logs data by WEB data mining resulting in a user' s transaction pattern, and achieve the intelligent services of personalization recommendation. The contents of this dissertation are as follows:
    (1) We review the origin and background of data mining technology; introduce current status of international and domestic research on data mining.
    (2) We made in-depth analysis and research on the systematic structure of WEB date mining, gave outline of WEB date mining, definition and category of WEB date mining, and described general process of data mining for WEB logs.
    (3) To introduce the general structure and definition of the data preprocessing phase of WEB logs mining. The transaction identification based on reference length >
    maximal forward reference and time windows are proposed
    respectively .
    (4) To discuss the clustering methods for two user transaction patterns that are user' s navigation-content transaction based on maximal forward reference and the user' s content-only transaction respectively. In the former, the similarity measures between user' s transaction patterns attempt to incorporate with the structures of WEB



    site and the URLs involved . In the latter , the similarity measures use direct paths, the common ancestors and the common descendants to clustering user' s transaction patterns for the online personalized intelligent recommendation services.
    (5) To propose a intelligent service method on personalized recommendation based on user' s transaction patterns and user' s current navigational activity, the overall process of which can be divided into two parts: offline part and online part. In offline, WEB mining tasks can execute in the logs of WEB service resulting in a user' s transaction pattern file. In online, the candidate URLs for recommendation can be determined by matching association rules in the aggregating tree or URL clusters with the current active session for the intelligent services of personalization recommendation. The advantage and shortcoming of each in two methods are discussed. The experiments demonstrate that our approach is applicable and effective.

引文

[1] Joachims,T.Freitag.D & Mitchell,T.(1997) .WebWatcher:A Tour Guide for the World Wide Web.Proceedings of the 15th International Joint Conference on Artifical Intelligence UCAI-97(pp.770-775) .
    [2] R Armstrong,D.Freitag,T.Joachims,and T.Mitchell.WebWatcher:A learning apprentice for the world wide web.In Working Notes of the AAAI Spring Symposium:Information Gathering from Heterogenous,Distrbuted Environments, pages 6-12,Stanford University,1995． AAAI Press.
    [3] Roberto Okada at al.A Method for Personalized Web Searching with Hierarchical Document Clustering.Trans.of Information Pro.Soc.of Japan. 1998． 39(4) :868-877．
    [4] Mostafa J.et al.A Multilevel Approach to Intelligent information Filtering: Model,System,and Evaluation.ACM Trans.On information systems 1997 15(4) :368-399．
    [5] Nicholas J.Belkin and W.Bruce Croft.Information filling and information retrieval:Two sides of the same coin? Communcations of the ACM,35(12) :29-38, December 1992．
    [6] Peter W.Foltz and Susan T.Dumais.Personalized information delivery:An analysis of information filting methods.Commun ications of the ACM, 35(12) :51-60,December 1992．
    [7] Shoshana Loeb.Architecting personalized delivery of multimedia information. Communications of the ACM,35(12) :39-48,December 1992．
    [8] Marko Balabanovic.An adaptive.Web Page Recommendation Service.the international Conference on Autonomouse Agents,Febryary 1997,Marina del Rey.
    [9] Chen M S,Pank J S,Yu P S.Data Mining for Path Traversal Patterns in a Web Environment.Proceedings of the 16th International Conference on Distributed Computing Systems,1996(5) :27-30


    [10] Zaiane O R,X in M,Han J.Discovering Web Access Pattern and Trends by A Applying OLAP and Mining Technology on Web logs.Proceedings of Advances in Digital Libraries Conference,Santa Barbara,CA,1998．
    [11] Cooley.R.Mobasher.B.and Srivastava.J.Data Preparation for mining World Wide Web browsing patterns.Journal of Knowledge and Information Systems,(1) 1,1999．
    [12] Agrawal P,Imielinski T,Swami A.Mining Association Rules between Sets of Items in Large Databases.Proceeding of ACM SIGMOD,1993． 05:207-216．
    [13] Volker Gaede and Olive Gunther.Multidimensional Access Methods.ACM Comput.Surv.1998,2(30) :171-231．
    [14] D.S.W.Ngu and X.Wu.SiteHelper:A localized agent that helps incremental exploration of the world wide web.In 6th Internation of the World Wide Web conference,Santa Clara,CA,1997．
    [15] Wexelblat,A and Maes,P.Using History to assist information browsing, RIAO'97:Computer-assisted information retrieval.Montreal,1997．
    [16] Schechter,S.Krischnan,M.and Smith,M.D,Using path profiles to predict HTTP requests.In Procceedings of 7th Interentional World Wide Web Conference, Brisbane.Australia,1998．
    [17] Cooley,R.Mobasher,B.and Srivastave,J.Data prepartion for mining World Wide Web browing patterns.Journal of Knowledge and Information Systems,(1) 1,1999．
    [18] Buchner,A.and Mulvenna,M.D.Discovering internet marketing intelligence through online analytical Web usage mining.SIGMOD Record,(4) 27,1999
    [19] R.Arawal and R.Srikant.Fast algorithms for mining association rules.In Proc. of the 20th VLDB Conference,pages 487-499,Santiago,Chile,1994
    [20] Ed Wilson.The Knowledge Discovery Process,A Problem Solving Methodology.Computer Associates International,Inc.1998．
    [21] R.Agrawal,T.Imielinski,and A.Swami:Database mining:A performance perspective.IEEE Trans,on Knowledge and Data Engineering,5(6) ,Dec.1993
    [22] Jiawei Han,Micheline Kamber.Data Mining:Concepts and Techniques.Pages

    187-198. Simo Fraser University, 2000
    [23] Manish Mehta, Rakesh Agrawal and Jorma Rissanen. SLIQ: A Fast and Scalable Classifier for Data Mining. IBM Almaden Research Center, 1996
    [24] Gudivada V N. Information retrieval on the World Wide Web. IEEE Internet Computing, 1997,1(5): 58～68
    [25] Professor Jiawei Han and Jian Pei, Simon Fraser Sequential Pattern Mining: From Shopping History Analysis to Weblog and DNA Mining University, Canada
    [26] 王实，高文．数据挖掘中的聚类算法．计算机科学．2000 Vol．27 No．4
    [27] 王涛，沈谦，朱明星，张良震．遗传与K-Means混合算法用于聚类分析．模式识别与人工智能，Vol．12，No．1 March 1999
    [28] 张燕，浅谈网络信息挖掘．情报检索．2000．12．No．4
    [29] 陈才扣，金远平．挖掘基于WEB的访问路径模式．小型微型计算机系统．2001．1 vol 22．1
    [30] 王实，高文．路径聚类：在WEB站点中的知识发现．计算机研究与发展．2001．4 Vol．38，No．4
    [31] 岳训，孙忠林．基于云模式的WEB日志数据挖掘技术．计算机应用研究．2001．
    [32] 袁友伟．基于WEB的数据挖掘技术及访问路径模式的研究．侏洲工学院学报．2001．9 Vol．15 No 5
    [33] 韩家炜，孟小锋．WEB挖掘研究．计算机研究与发展．2001，4 Vol．38 No．4
    [34] 张娥，冯秋红．WEB使用模式中的数据挖掘．计算机应用研究．2001
    [35] 陆丽娜，杨怡玲．WEB日志挖掘中数据预处理的研究．2000．4．Vol．26 No．4
    [36] 施建生，伍卫国．WEB日志中挖掘用户浏览模式的研究．西安交通大学学报．2001．6 Vol．35 No．6
    [37] 宋伟，王举成．Internet数据挖掘原理及实现．重庆邮电学院学报．2001．6 Vol．13 No．2
    [38] 陆丽娜，陈亚平．挖掘关联规则中Apriori算法的研究．小型微型计算机系统．2000．9 Vol．21 No．9
    [39] 宋擒豹，沈均毅．WEB日志的高效多能挖掘算法．计算机研究与发展．2001．3 Vol．38 No．3


    [40] 许敏，丘玉辉．电子商务中推荐系统存在的问题及其对策研究．计算机科学．2001．Vol．28 No．4
    [41] 刘丽，孙燕唐．基于Internet的分布式智能信息搜索服务模型．电子计算机．2001．10．No．152
    [42] 周斌，吴泉源．用户访问模式数据挖掘的模型与算法研究．计算机研究与发展．1999．7 Vol．36 No．7
    [43] 柳胜国．我国互联网信息挖掘研究现状．图书情报工作．2002．No．5
    [44] 陈定权．WEB信息检索技术最新进展．信息检索技术．2002．Vol．92 No．2
    [45] 申瑞民，舒培，张同珍．个性化数字服务模型．微电子学与计算机．2001．1
    [46] 宋爱波，胡孔法．WEB日志挖掘．东南大学学报(自然科学版)．2002．1 Vol．32 No．1

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700