基于Web-Log的网页预测模型研究

英文题名：Research on Web Prediction Model Based on Web-Log
作者：刘超慧
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web日志挖掘 ; Markov模型 ; 预测 ; 浏览兴趣 ; 关联规则
英文关键词：Web mining ; markov model ; predication ; association rules ; browse interest
学位年度：2008
导师：安建成
学科代码：081203
学位授予单位：太原理工大学
论文提交日期：2008-05-01

摘要

随着互联网信息及用户的飞速增长,如何有效减少用户访问延时,提高网络服务质量是一个迫切需要解决的难题,缓存与预取技术是克服此难题的有效方法。但由于随着WWW上动态内容和个性化服务的比重日益增加,缓存技术对网络性能的改善已不再显著,而预取技术是缓存技术的一种有效补充手段,是突破缓存性能上限的最有效的方法,正越来越成为Web加速技术领域研究的热点。
     在网页预测方面Markov模型是一种简单而有效的工具,但现有的预测方法都有预测准确率和预测覆盖率存在矛盾,并且存储复杂度较高的缺点。因此,改进基于Markov模型进行用户浏览路径预测的方法成为Web日志挖掘的一个新课题。本文对国内外关于Markov模型浏览路径预测的研究现状进行了综合分析,指出了现有的预测方法存在的问题,并提出了改进方案,对如何改进基于Markov模型的预测方法这一问题进行了研究。
     论文首先介绍了Internet和WWW起源、发展及现状,提出了互联网所面临的问题及解决方案。然后阐述了Web数据挖掘的基本概念、分类以及数据预处理的一般方法和过程。介绍了常用的挖掘算法—关联规则算法,并针对其存在的不足提出了改进的算法。
     其次本文提出了新的用户浏览兴趣偏爱度,用传统的用户对网页兴趣偏爱度的方法,无法反应用户的真正浏览兴趣和网页的重要程度。新的偏爱度度量方法,不仅考虑了页面的浏览频度,而且引进了页面的访问时间和页面本身的大小,弥补了传统方法的不足,最后利用实验证明了该度量方法的有效性。
     接着,作者提出了二步Markov预测模型,主要解决了高阶Markov模型空间复杂度过高以及覆盖率逐步下降的问题,在此基础上又提出了混合Markov模型,给出了对应的理论支持和相应的参数求解方法,并在时间复杂度和空间复杂度上进行了分析和对比,结果表明混合Markov模型在这两个方面都优于二阶Markov模型。
     最后,论文对提出的预取模型在真实Web日志中进行了实验,并对实验结果进行了分析。
With the remarkable and exponential growth rate of Web information and users, how to reduce the user perceived access latency and improve the quality of service of the network is coming a crucial problem, and Web prefetching and Web caching are the primary solutions. Web caching technique has been widely used in different places of Internet. But as dynamic documents and personal services increase all over the world, the performance of caching deteriorates significantly. As a result, Web perfecting, which is an efficient way of making up for Web caching, and the most effective method to break the upper bound of caching performance----is coming a hotspot in Web speedup research area.
     The Markov model is a simple and practical tool to prefetch Web. But some existing prediction methods based on Markov model still have some shortcoming. So it becomes a new lesson in the area of Web log mining that how to improve prediction methods. This paper analyses the current domestic and international research results of how to use Markov model to predict Web. Then we find some problems of existing prediction methods based on Markov models and we study the improving of prediction methods based on Markov model.
     First of all, this thesis introduces the development and the state of the Internet and WWW, gives the problems Internet faced and corresponding solutions; and describes the concept, classification of Web data mining; and Web log mining data preprocessing process. In order to overcome the drawbacks of Apriori algorithm for mining frequent itemsets, TIMV algorithm was proposed.
     Second of all, the interest is the selectivity attitude of objective matter of a person, and measuring user's browse interest exactly is the base of Web base of Web schema mining. This paper analyses the present the shortage of the style of measure and expresses the browsing interest of user. For instance, the too simple measure fashion often leads to difficulty of distribution which is the user interested in or not, not considering the page information amount's influence on the users' browse time and so on. As a result, point out a method based on users' browse behavior to measure the users' browse interest.
     Then, a hybrid Markov predictor model was put forward based on the step-2 Markov model, which can solutes the problem of high memory demand and the low applicability. Besides that, this paper gives the sustaining theory and the way to get the parameters.
     Finally, experiments have been made based on the prediction model and experimental results are analyzed.

引文

[1]Brian Douglas Davison,The Design and Evaluation of Web Prefetching and Caching Techniques[EB/OL],http://citeseer.ist.psu.edu/davison02design.html,2002-10.
    [2]Douglis F,FeldMann,Krishmanurthy B,et al,Rate of change and other metrics:a live study of the World Wide Web[A],Proceedings of the 1997 Usenix Symposium on Internet Technologies and Systems,Monterey[C],147-158
    [3]HAN Jing,ZHANG Hong-jiang,CAI Qing-sheng,Prediction for Visiting Path on Web[J],Journal of Software,2002,13(6):1040-1049
    [4]Bestravros A,Using Speculation to Reduce Server Load and Service Time on the WWW Proceedings of the CIKM'95,Baltimore,1995,403-410
    [5]Sarukkai R,Link prediction and path analysis using Markov Chains[J],Computer Networks,2000,33(1-6):337-386
    [6]S.Schechter,M.Krishnan,and M.D.Smith,Using Path Profiles to Predict HTTP Requests[J],Computer Networks and ISDN Systems,vol.30,nos.1-7,Apr.1998,pp.457-467
    [7]Xu CZ,Tamer,Semantics-Based Personalized Prefetching to Improve Web Performance[C],Proceedings of the 20th IEEE Conference on Distributed Computing Systems,2000,636-643
    [8]徐宝文,张卫丰,数据挖掘技术在Web预取中的应用研究[J],计算机学报,2001,24(4):10-17
    [9]朱培栋,卢锡城,周兴铭,基于客户行为模式的Web文档预送[J],软件学报,1999,10(11):1142-1147
    [10]许欢庆,王永成,孙强,基于隐马尔科夫模型的Web网页预取[J],上海交通大学学报,2003,37(3):404-407
    [11]许欢庆,王永成,基于用户访问路径分析的网页预取模型[J],软件学报,2003,14(6):1142-1147
    [12]高凯,王永成,李刚,基于用户浏览兴趣的网页预取策略[J],上海交通大学学报2006,40(3):499-502
    [13]孙强,李建华,李生红等,基于概念联想网络的网页预取模型[J],上海交通大学学报,2004,38(5):779-782
    [14]Tauscher L,Greenberg S,How People Revisit Web Pages:Empirical Findings and Implications for the Design of History Systems[J],International Journal of Human Computer Studies,1997,47(1):97-137
    [15]Pitkow J E,Recker M M,A Simple Yet Robust Caching Algorithm Based on Dynamic Access Patterns[A],In Proceedings2nd International World Wide Web Conference[C],Chicago:[s.n.],1994,1039-1046
    [16]Wessels D,Intelligent caching for World-Wide Web Object[J],Boulder,Colorado:University of Colorado,1995,32(6):324-331
    [17]王世克,吴集,金士尧,Web预取模型分析[J],微机发展,2005,15(8):1-3
    [18]Palpanas T,Web prefetching using partial match prediction[D],Toronto,Ontario,CA:Department of Computer Science,University of Toronto,1998,34(4):101-109
    [19]Padmanabhan V,Mogul J,Using predictive prefetching to improve World Wide Web latency[J],ACM SIGCOMM Computer Communications Review,1996,26(3):22-36
    [20]邢永康,马少平,多Markov链用户浏览模型预测模型[J],计算机学报,2003,26(11):174-176
    [21]韩真,曹新平,TOP-N选择Markov预测模型[J],计算机应用,2005,25(3):670-672
    [22]金民锁,刘红祥,王佐,基于隐马尔科夫模型的浏览路径预测[J],黑龙江科技学院学报,2005,5(3):167-170
    [23]Zhu P D,Lu X C,Zhou X M,Web document presending based on user behavior patters[J],Journal of Software 1999,10(11):1142-1147
    [24]Ken-ichi C,An Interactive Prefetching Proxy Server for Improvement of WWW Latency[A],Proc 7th Annual Conf,Internet Soc[C],Kuala Lumpur:[s.n.],1997
    [25]Yoon S,Jin E,Seo J,et al,Prefetching Brand-new Docu2ments for Improving the Web Performance[Z],Mutimedia Technology Research Lab,Korea Telecom,Republic of Korea,1999
    [26]苏中,马少平,基于Web-Log Mining的N元预测模型[J],软件学报,2002,13(1):136-141
    [27]Krishnamurthy B,Rexford J,Web Protocols and Practice:Networking Protocols,Caching and Trac Measurement[M],Boston:Addison-Wesley,2001
    [28]R.Agrawl,R.Srikant.Fast Algorithm for Mining Association Rules,In Proc 1994 Int,Conf.Very Large Data Base,In VLDB'94,1994:487-499
    [29]Agrawal R Srikant R.Fast algorithm for mining association rules[A]Proceedings of the 20th VLDB conference[C],Scan Mateo:Morgan Kaufmann Publishers,1994:486-499
    [30]Han J,Pei J,Yin Y Mining frequent patterns without candidate generadon[A]Proceeding of the 2000 ACM SIGM OD international conference on management of data[C].New York:ACM Press,2000:1-12
    [31]冯浩,陶宏才,快速挖掘最大频繁项集[J],微电子学与计算机,2007,24(5):123-126
    [32]李超,余昭平,基于矩阵的Apriori算法改进[J],计算机工程,2006,32(23):68-69
    [33]牛小飞,石冰,基于向量和矩阵的挖掘关联规则的高效算法[J],计算机工程与应,2004,40(12):170-173
    [34]黄龙军,段隆振,章志明,一种基于上三角项集矩阵的频繁项集挖掘算法[J],计算机应用研究 2006,23(11):25-27
    [35]I.Zuckerman,D.Albrcht,A.Nicholson.Predicting User's Requests on the WWW[C],In:Proceedings of the 7~(th)International conference On User Modeling,New York,springer,1999:275-284
    [36]J.Borges,M.Levene.Data Mining of User Navigation Patterns,In:Proceedings of the1999 KDD Workshop on Web Mining,CA:Springer Verlag Press,1999:92
    [37]R.Samkkai,Link Prediction and Path Analysis Using Markov Chain[C]s,In:Proceedings of the 9th World Wide Web Conference,Amsterdam,Netherlands,2000:60-77
    [38]闰永权,基于频繁访问模式树的Web使用挖掘研究[D],湖南大学硕士论文,长沙,2006
    [39]曹仰杰,石磊,卫琳等,基于剪枝技术的自适应PPM预测模型[J],计算机工程与应用,2006,28:141-144
    [40]Deshpande M,Karypis G;Using conjunction of attribute values for classification[C],Technical Report02-11,Minesto University,2001
    [41]Ying shi,Watson E,Chen Ye-sho,Model-Driven Simulation of World Wide Web Cache Policies[A],Proceedin gs of the 29th conference on Winter Simulation Conference [C],At 21anta,Geogia,US:[s.n.],1997,1045-1052
    [42]石磊,古志民,卫琳,基于Web流行度的选择Markov预取模型[J],计算机工程 2006,32(22):72-74
    [43]梁意文,曹霞,董红斌,一种基于只能体的Web文档预取模式[J],计算机工程与应用,2001,4:54-56
    [44]金民锁,刘红祥,王佐,基于隐马尔科夫模型的浏览路径预测[J],黑龙江科技学院学报,2005,15(3):167-170
    [45]贺玲,吴玲达,蔡益朝,数据挖掘中的聚类算法综述[J],计算机应用研究,2007(1):10-13
    [46]张丽,郭成城,晏蒲柳,基于结构相关性Markov模型的Web网页预取方法[J],计算机工程与应用,2004,21:163-166
    [47]刘晓鹏,基于用户浏览兴趣度的WEB挖掘[D],辽宁工程技术大学硕士学位论文,辽宁,2006
    [48]Robert Cooley Bamshad Mobasher,and Jaideep Srivastava,Data preparation for mining world wide browsing patterns[J],Knowledge Information Systems,1999,1(1):532
    [49]ChenM S,Park J S,Yu P S,Data mining for path traversal pattern in a Web environment[C),In proceedings of the 16th International Conference on Distributed Systems,1996:385392
    [50]刘炜,陈俊杰,一种Web使用模式挖掘模型的设计[J],计算机应用研究,2007,24(3):184-186
    [51]肖国强,肖轶,一种从Web日志中挖掘访问模式的新算法[J],华中科技大学学报(自然科学版),2004,32(5):70-72
    [52]施建生,伍卫国,Web日志中挖掘用户浏览模式的研究[J],西安交通大学学报,2001,35(61):621-624
    [53]曹新平,刘美华,古志民等,一种智能的预取算法[J],计算机工程与应用,2003,31:103-106
    [54]王颖,张丽霞,刘晓东等,关联规则挖掘技术在Web预取中的应用[J],微电子与计算机,2005,22(4):166-169
    [55]余雪岗,刘衍珩,魏达等,用于移动路径预测的混合Markov模型[J],通信学报,2006,27(12):61-70
    [56]曹忠升,唐曙光,杨良聪,Web-Log中连续频繁访问路径的快速挖掘算法[J],算机应用,2006,26(1):216-219
    [57]陈靖康,基于Web挖掘的Proxy端预取技术的研究与实现[D],东北大学硕士论文,沈阳,2005
    [58]Davison B D,Predicting Web Actions from HTML Content[A],In Proceedings of the Thirteenth ACM Conference on Hypertext and Hypermedia(HT'02)[C],College Park,MD:[s.n.],2002,159- 68

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700