基于Web挖掘的网页动态推荐系统研究

作者：段利君
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web挖掘 ; 网页推荐系统 ; 序列模式 ; 命中率 ; 滑动窗口
英文关键词：web data mining ; webpage recommendation system ; sequence model ; hit ratio ; sliding window
学位年度：2010
导师：钟亦平
学科代码：081203
学位授予单位：复旦大学
论文提交日期：2010-05-22

摘要

使用Web挖掘技术提取用户访问模式具有重要的现实意义。在用户浏览网页时为用户提供预取服务,在电子商务中为用户推荐商品以及改善网站的组织结构等。然而,在信息爆炸的今天,从网站内容到用户浏览行为都时刻发生着变化。这对网页推荐系统的设计提出了新的要求。
     推荐系统为了预测用户下一步可能访问的网页,需要向前参考浏览序列。而序列模式考虑了页面浏览序列,因此本文以序列模式相关理论为基础。在基于序列模式的用户浏览模式挖掘相关研究中,比较流行的有基于Markov模型和PLSA模型。本文分析发现这两种模型在适应网站内容和用户浏览行为迅速变化方面都存在不足。
     本文首先介绍了该领域的国内外研究现状和Web数据挖掘的一般流程。在Web日志数据预处理方面,本文给出了一种过滤日志数据的方法。在网页聚类方面,先分析了现有的各种聚类方法,接着提出了在网站组织结构良好的情况下基于URL的聚类方法包括：基于URL间距离和基于路径树的方法。由于URL间距离的算法不适应动态增长的Web页面结构,本文主要采用的是基于路径树的方法。在序列模式挖掘阶段,本文分析了PLSA方法的不足并提出了RTA算法,此方法基于路径树。随后,本文给出了推荐系统的更新方法。接下来本文分析了用户在访问网站时的使用习惯,并据此给出了网页推荐系统的设计方案。
     本文最后采用命中率来评价推荐系统,给出了推荐页面数、支持度以及滑动窗口长度与命中率之间的关系。并将实验结果与基于PLSA算法的实验进行了对比,结果表明在一定条件下,RTA算法优于PLSA算法。
It is meaningful to extract user navigation model by utilizing web data mining: pre-fetching webpage while user access the website, recommending goods to the user in the scenario of e-business and optimizing the structure of the website. However, under the environment of information exploding, the content of the website or the behavior of user navigation is changing at any given time. All this require a high standard for the designing of webpage recommendation system.
     In order to predict which page the user would need in the next step, the recommendation system need to reference to the pages which had been navigated before. Since sequence model take the page's navigation history into consideration, this paper take the related theory of sequence model as foundation. In the domain of user navigation model based on sequence model, the prevalent models are Markov model and PLSA model.But after detailed analysis, these two models have defects when handle the problem under the condition that the content of the website and the behavior of user navigation are changing.
     This article first introduces the current situation of this domain and the common process of web data mining. It gives a filtering way to preprocess the web log data. For the webpage aggregation, this article introduces several existing methods and then proposes two ways based on URL to solve this problem on the premise that the structure of the website is sound:based on the distance between two URL and based on the path of URL tree.Since the way based on the distance between tow URL can't adapt to the dynamic changing situation, this paper will take the later method. For extracting of the sequence model, it point out the flaws of PLSA and then propose RTA algorithm which is base on path tree. Also, this article tells how to update the recommendation system.Then it gives a solution to designing the webpage recommendation system, which based on the behavior of user navigation.
     This article employs hit ratio to rate the recommendation system. At the end of this article, the experiment shows the relationship between the number of recommendation pages、the support degree、the length of sliding window and the hit ratio. The result proves that PTS is better than PLSA under a specific condition.

引文

[1]R.R.Sarukkai. Link prediction and path analysis using markov chains.In Proccedings of the 9th International World Wide Web Conference, Amsterdam, May 2000.
    [2]M.Deshpaude and G.Karypis. Selective markov models for predicting Web-Webpage accesses.In Proceedings of the First International SIAM Conference on Data Mining, Chicago, Aproil 2001.
    [3]M.S.Chen, J.S.Park, and P.S.Yu. Data mining for path traversal patterns in a Web environment. In Proccedings of the 16th International Conference on Distributed Computing Systems, Hong Kong, May 1996.
    [4]J.Pitkow and P.Pirolli. Mining longest repeating subsequences to predict www surfing. In Proceedings of the 2nd USENIX Symposium on Internet Technologies and Systems.Boulder, Colorado. October 1999
    [5]T.W.Yan, M.Jacobson, H.Garcia-Molian,and U.Dayal.From user access patterns to dynamic hypgertext linking. In Proceeding of the Fifth WWW Conference, Paris, France,1996
    [6]Y.Fu, K.Sandhu, and M.Shih. Clustering of Web users based on access patterns. In International Workshop on Web Usage Analysis and User Profiling(WEBKDD99), San Diego, CA,1999
    [7]T.Kamdar and A.Joshi.On creating adaptive Web servers using Weblog mining. Univ. Of maryland, baltimore county, techinical report. Univ. Of Maryland, Baltimore County. MD,2000
    [8]B.Mobasher, H.Dai and M.Nakagawa T.Luo. Discovery and evalutaion of aggregate usage profiles for Web personalization. Data Mining and Knowledge Discovery 6:61-82,2002
    [9]E.H.Chi, A.S.Rosien, and J.Heer. Lumberjack:Intelligent discovery and analysis of Web user traffic composition. In Proc. ACM_SIGKDD Workshop on Web Mining for Usage Patterns and User Profiles(WebKDD 2002), Edmonton, Canada,2002.
    [10]A.Banerjee and J.Ghosh. Clickstream clustering using weighted longest common subsequences.In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, Chicago, Illinois, April 2001.
    [11]A.Strchl and J.Ghosh. Relationship-based clustering and visulization for high dimensional data mining. INFORMS Journal Of Computing, Special Issue on Web Mining. (A. Tuzhilin and L.Rashid, guest Eds),15(2)208-230,2003
    [12]I.V.Cadez, D.Hecherman, C.Meek, P.Smyth, and White. Model-based clustering and visualization of navigation patterns on a Web site. Journal of Data Mining and Knowledge Discovery,7(4),2003.
    [13]D.Pavlov. Sequence modeling with mixtures of conditional maximum entropy distributions. In Proceeding of the Third IEEE Conference on Data Mining(ICDM-2003).2003
    [14]A.Ypma and T.Heskes. Categorization of Web pages and user clustering with mixtures of hidden markov models. In Proceedings of the WEBKDD2002, Canada,2002
    [15]J.Han, YFu, Discovery of multiple-level association rules from large databases. In Proceedings of the 21st International Conference on Very Large Data Base(VLDB95),1995, pp.420-431.
    [16]Thomas Hofmann,Unsupervised Learning by Probabilistic Latent Semantic Analysis.Machine learning,42,177-196,2001 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.
    [17]Jiu Jun Chen, Ji Gao, Jun Hu, and Bei Shui Liao. Dynamic Mining for Web Navigation Patterns Based on Markov Model.Springer-Verlag Berlin Heidelberg 2004
    [18]Yongjian Fu, Kanwalpreet Sand hu, Ming-Yi shih. Clustering of Web Users Based on Access Patterns. In Proceedings of the 1999KDD Workshop on Web Mining.
    [19]G.Xu, Y.Zhang and X.Zhou, A Latent Usage Approach for Clustering Web Transaction and Building User Profile, The First International Conference on Advaced Data Mining and Application(ADMA2005), Springer, Wuha, China, 2005,pp.31-42.
    [20]Xin Jin, Yanzan Zhou, and Bamshad Mobasher. Task-Oriented Web User Modeling for Recommendation. UM2005,LNAI3538, pp.109-118,2005 Springer-Verlag Berlin Heidelberg 2005
    [21]Guandong Xu, Yanchun Zhang, and Xiaofang Zhou. Towards User Profiling for Web Recommendation. AI2005, LNAI 3808, pp.415-424 2005 Springer-Verlag Berlin Heidelberg 2005
    [22]Sule Gunduz O g uducu·M. Tamer Ozsu. Incremental click-stream tree model: Learning from new users for Web page prediction. Springer Science+Business Media, Inc.2006
    [23]Liang Yan and Chunping Li. Incorporating Pageview Weight into an Association-Rule-Based Web Recommendation System.AI 2006, LNAI4303,pp.577-586 2006.Springer-Verlag Berlin Heidelberg 2006
    [24]]杨正余王卫平.基于用户访问序列的实时网页推荐研究.计算机系统应用,2008,Vol(5).
    [25]Pang-Ning Tan, Michael Steinbach, Vipin Kumar.数据挖掘导论(范明,范宏建译).北京：人民邮电出版社。2006年5月
    [26]数据挖掘资料汇编.中国论文网.2010.3
    [27]Hunt E.B.J Martin, P.T.Stone. Experiments in Induction Academic Press.1966
    [28]J.R.Quinlan. Discovery rules by induction from large collection of examples.In D.Michie, editor, Expert System in the Micro Electronic Age. Edinburgh University Press, Edinburgh, UK,1979
    [29]J.R.Quinlan. C4.5:Programs for Machine Learning. Morgan-Kaufmann Publishers, San Mateo,CA,1993
    [30]L.Breiman, J.H.Friedman, R.Olshen, and C.J.Stone. Classification and Regression Tress. Chapman & Hall, New York,1984
    [31]韩家炜,孟小峰,李盛恩,Web挖掘研究,计算机研究与发展,Vol.38,2001,No.4.
    [32]王继成,播余贵,张福炎,Web文本挖掘技术研究,计算机研究与发展,Vol.37,2000,No.5.
    [33]B.Mobasher, H.Dai, T.Luo, Y.sung, J.Zhu, combing Web usage and content mining formore effective Personalization, In Proceeding of the International conference on E-commerce and Web technologies(ecWeb2000), Spet.2000. Greenwich, UK
    [34]R.Cooley, B.Mobaster, J.Srivastava, Web Mining:Information and Pattern Discovery on the world wide Web, In Proceeding of the 9th International conference on Tools with Artificial Intelligence(ICTAI'97) pp:0558
    [35]涂承胜,鲁明羽,陆玉昌.Web内容挖掘技术研究,计算机应用研究,2003。11：5-9
    [36]Bettina Berend, andreas Hotho, and Gerd stumme, Twoward semantic Web Mining, The First International semantic Web mining Conference(ISC2002), Sardinia, Italy,9-12th June,2002.pp:264-278
    [37]J.Kleinberg. Authoritative sources in hyperlinked environment, In 9th ACM-SLAM Symposium on Discrete Algorithm,1998.
    [38]S.Brin, L.Page. The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems,1998.
    [39]H.Mannila, H.Toivonen, and A.I.Verkammo. Discovering Frequent Episodes in Sequences. In Proc. Of First Int. Conference on Knowledge Discovery and Data Mining, pp:210-215
    [40]Mobasher B, Dai H, Luo T. Discovery of aggregate usage profiles for Web personalization[C].Proceedings of the ACM SIGKDD,2000, pp:142-151.
    [41]宋擒豹,沈钧毅基于关联规则的Web文档聚类算法.软件学报2002年3期
    [42]Ackerman, M. Billsus, D.Gaffney, S.,et al. Learning probablistic user profiles. AI Magazine,1997,18(2):47-56.
    [43]Cheeseman, P., Stutz, J. Bayesian Classification(autoclass):theory and results. In:Fayyad, U.M.,Piatetsky-Shapiro, G.,Smyth,P.,et al.,eds. Advances in Knowledge Discovery and Data Mining. Menlo Park, CA:AAAI/MIT Press, 1996.153-180.
    [44]Ron, W.,BienVenido, V,Mark, A.S.,et al.Hypursuit:a hierarchical network search engine that exploits content-link hypertext clustering. In:ACM, ed. Proceedings of the 7th ACM Conference on Hypertext. New York:ACM Press, 1996.180-193.
    [45]Thomas Gottron. Clustering Template Based Web Documents. Springer-Verlag Berlin Heidelberg2008.
    [46]Bar-Yossef, Z., Rajagopalan, S.:Template detection via data mining and its applications. In:WWW 2002:Proceedings of the 11th International Conference on World Wide Web, pp.580-591.ACM Press, New York(2002)
    [47]Reis, D.C., Golgher, P.B.,Silva, A.s.,Laender, A.F.:Automatic Web news extraction using tree edit distance. In:WWW 2004:Proceedings of the 13th International Conference on World Wide Web, pp.502-511.ACM Press, New York(2004),doi:10.1145/988672.988740
    [48]Joshi, S.,Agrawal, N., Krishnapuram, R.,Negi, S.:A bag of paths model for measuring structural similarity in Web documents. In:KDD 2003:Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.577-582. ACM Press, New York (2003)
    [49]Butter, D.:A short survey of document structure similarity algorithms.In:IC 2004:Proceedings of the International Conference on Internet Computing, pp.3-9.CSREA Press(2004)
    [50]Cruz, I.F.,Borisov, S.,Marks, M.A.,Webbs, T.R.:Measuring structural similarity among Web documents:preliminary results. In:Porto, V.W., Waagen, D.(eds.) EP 1998. LNCS, vol.1477, pp.513-524. Springer, Heidelberg(1998)
    [51]苏中,马少平,杨强,张宏江：基于Web-Log Mining的Web文档聚类.[J].软件学报,2002 Vol,13,N0.1：102-103
    [52]M.Ankest, M.Breunig, H.P. Kreiegel, and J.Sander, OPTICS:Ordering points to identify the clustering structure. In Proc.1999 ACM-SIGMOD Int. Conf. Management of the Data(SIGMOD'99),1999
    [53]孙学刚,陈群秀,马亮.基于主题的Web文档聚类研究.中文信息学报2003Vol,17,NO.3
    [54]James Kennedy, Russell C Eberhart, Yuhui Shi.《群体智能》人民邮电出版社2009 ISBN:9787115195500
    [55]吴斌,傅伟鹏,郑毅等.一种基于群体智能的Web文档聚类算法.计算机研究与发展2002 Vol 39,NO.11
    [56]Berners-Lee.http://www.ietf.org/rfc/rfc1738.txt
    [57][57] T.Hofmann, Probablistic latent semantic indexing. In Proceedings of the 22nd International Conference on Research and Development in Information Retrival, Berkeley, CA,August 1999
    [58]T. Brants, F. Chen, and I. Tsochantaridis. Topic-based document segmentation with probabilistic latent semantic analysis. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, Washington D.C.. November 2002.
    [59]T. Brants and R. Stolle. Find similar documents in document collections. In Proceedings of the Third International Conference on Language Resources and Evaluation(LREC-2002),Las Palmas, Spain, June 2002.
    [60]E.Gaussier, C. Goutte, K. Popat, and F. Chen. A hierarchical model for clustering and categorising documents. In Advances in Information Retrieval Proceedings of the 24th BCS-IRSG European Colloquium on IR Research(ECIR-02), Glasgow, UK, March 2002.
    [61]Y.Kim,J.Chang, and B. Zhang.a empirical study on dimensionality optimazation in text mining for linguistic knowledge acquisition. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-03),Seol, Korea, April 2003.
    [62]W.L. Buntine. Operations for learning with graphical models. Journal of Artificial Intelligence Research,2:159-225,1994.
    [63]Chenxi Lin, Gui-Rong Xue, Hua-Jun Zeng and Yong Yu. Using Probabilistic Latent Semantic Analysis for Personalized Web Search. Y.Zhang et al.(Eds.): APWeb 2005,LNCS 3399, pp.707-717 2005. Springer-Verlag Berlin Heidelberg 2005.
    [64]Xin Jin. Task-Oriented Modeling for the Discovery of Web User Navigational Patterns. A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy. June 2006
    [65]G.Schwarz. Estimating the dimension of a model. The Annals of Statistics, (6):461-464,1978.
    [66]R.Settimi and J.Q. Smith. On the geometry of Bayesian graphical models with hidden variables. In Proceedings of the 14th Annual Conference on Uncertainty in Artificial Intelligence(UAI-98),pages 472-479, San Francisco, CA,1998. Morgan Kaufmann Publishers.
    [67]R.Settimi and J.Q.Smith. Moments, geometry and conditional independence trees with hidden variables. The Annals of Statistics,28:1179-1205,2000.
    [68]Depaul CTI web usage mining data[ol]. http://maya.cs.depaul.edu/-classes/etc584/resource.html
    [69]L.Catledge and J.Pitkow. Characterizing Browsing Strategies in the World-Wide Web.In 3rd International World-wide Web Conference WWW95, http://www.igd.fhg.de/archive/1995_www95/papers/,1995

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700