基于XML及关联规则的个性化推荐技术研究

英文题名：The Research on Personalized Recommendation Technology Based on XML and Association Rule
作者：王双明
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：个性化推荐 ; XML ; 关联规则 ; Web使用挖掘
英文关键词：Personalized Recommendation ; XML ; Association Rule ; Web usage mining
学位年度：2010
导师：王成良
学科代码：081202
学位授予单位：重庆大学
论文提交日期：2010-04-01

摘要

随着Internet应用的迅速发展,信息过载使人们面对太多的信息而难以选择和消化,信息资源分布的广泛性又给用户寻找感兴趣的信息增加了困难,使人们易于信息迷失。迫切需要一种新的技术使人们在海量数据中查找想要的数据和有用信息时能自动地发现、抽取和过滤信息。个性化推荐技术的出现,使得人们从无限的网络信息资源和繁杂的商品世界中解脱出来,大大节省了用户在信息搜索上花费的时间和精力,也使得Web网站从以“网页”为中心转换为以“用户”为中心,给用户提供个性化服务,向着网络服务的更高层次发展。
     由于已有的个性化推荐技术在收集匿名用户信息、推荐实时性和准确性等方面存在不足,本文在研究经典的关联规则挖掘算法基础上,提出基于XML及关联规则的Web挖掘技术来分析和挖掘Web用户访问日志,得到用户对Web网站的频繁访问模式,采用基于关联规则的个性化推荐技术以提高Web站点访问效率。
     论文主要工作包括:
     ①对个性化推荐技术的研究背景、研究现状、实际应用意义以及Web使用挖掘的理论基础进行了阐述和分析,并对关联规则挖掘的基本原理进行了说明。
     ②本文运用由XML技术衍生出来的XGMML和LOGML实现Web访问日志的表示和存储,采用数据清理、用户识别、会话识别、路径补充和事务识别等步骤完成Web日志挖掘中的数据预处理。
     ③在分析了Apriori算法和FP-growth算法后提出了利用MFIT对FP-growth算法进行改进。改进的FP-growth算法降低了挖掘最大频繁项目集的搜索空间,以及减少了超集检测所做的项目匹配次数,从而提高了算法的执行效率。
     ④设计和实现了一个个性化推荐原型系统,在利用用户频繁访问模式进行页面推荐的过程中引入页面的距离因子的计算来提高推荐质量。
     本文的研究工作是对挖掘关联规则的FP-growth算法的切实可行的改进,对研究关联规则的挖掘算法具有一定的参考价值;对用户访问模式的研究有利于提高站点信息服务质量,促进智能信息处理领域的发展,在理论和实践上都有重要的研究意义。
With the rapid development of Internet, the demand of information service via Internet has increased sharply. However, the huge amount of information has added to the difficulties in the information selection and digestion; the wide distribution of information has made it hard for individual user to find the information he or she is interested in. Thus, a new technique is badly in need, which could automatically dig out desirable information from the huge pool of resources, withdrawing it and at the same time filtering out other information unwanted. Fortunately, the emergence of personalized recommendation technology relieves us of infinite data and the commercialized world by saving us plentiful time and energy on searching information. In addition, this new technology also successfully transforms the service of website from webpage-centered mode to user-oriented one. It supplies users with personalized services and forges ahead toward the realization of supreme level of the whole Internet services.
     Because the existing personalized recommendation technology has not solved the problems about anonymous user information collecting and real-time recommendation and accurate recommendation. This thesis focus on how to make use of the Web mining and XML analysis log to get the customer's access to the website pattern, the adoption of personalized recommendation technique based on association rules promotes the websites’visiting efficiency.
     The main works of this thesis are listed as follows:
     ①To analyze and elaborate the research background, current situation, application of personalized recommendation technique as well as the theoretical foundation of web mining, and to discuss the basic principle of association rules.
     ②This thesis discusses the realization of the storage and representation of Web log through the XGMML and LOGML deriving from XML technique, which accomplishes the preprocessing of web log by following such procedures as data cleaning, user identification, session identification, path completion and transaction identification.
     ③After the analysis of the Apriori algorithm and FP-growth algorithm, to put forward the new FP-growth algorithm based on MFIT. The new algorithm has better efficiency in the implementation.
     ④To design and realize a personalized recommendation system prototype which improves the recommendation quality by calculating the distance between the page.
     The research work is a practical improvement to the FP-growth algorithm. It has valuable reference to the research of association rule mining algorithm. The research on user visiting mode improves information service quality and promotes the development of the intelligence information processing, which is significant in both theory and practice.

引文

[1] Jiawei Han,Micheline Kamber著.范明等译.数据挖掘概念与技术[M].北京:机械工业出版社,2007.
    [2] Konstan JA,Miller BN,Maltz D,et al. GroupLens:Applying collaborative filtering to usenet news[C]. Comm.ACM,1997,40(3):77-87.
    [3] Shardanand U,Maes P. Social information filtering:Algorithms for automating "Word of Mouth"[C].Proc Conf Human Factors in Computing Systems Denver,1995:210-217.
    [4] Linden G,Smith B,York J. Amazon.com recommendations:Item-to-item collaborative filtering[C]. IEEE Internet Computing,2003,7(1):76-80.
    [5] Balabanovic M , Shoham Y. Fab:Content-based , collaborative recommendation[C]. Comm.ACM,1997,40(3):66-72.
    [6] Ricci F,Nguyen QN. Acquiring and revising preferences in a critique-based mobile recommender system[C]. IEEE Intelligent Systems,2007,22(3):22-29.
    [7] Agrawal R,Imielinski T,Swami A. Mining association rules between sets of items in large databases[C]. Proceedings of the ACM SIGMOD Conference on Management of Data,1993:207-216.
    [8] Agrawal R,Srikant R. Fast algorithms for mining association rules[C]. Proc Int Conf Very Large Databases (VLDB' 94),Santiago,1994:487-499.
    [9] Han J,Pei J,Yin Y,et al. Mining frequent patterns without candidate generation:A Frequent-Pattern Tree Approach [C]. Data Mining and Knowledge Discovery,2004,8:53-87.
    [10]刑东山,沈钧毅.Web使用挖掘的数据采集.计算机工程,2002,28(1):46-51.
    [11]施建生,伍卫国,陆丽娜等.WEB日志中挖掘用户浏览模式的研究[J].西安交通大学学报,2001,35(6):621-624.
    [12]鲍玉斌,王大玲,于戈.关联规则和聚类分析在个性化推荐中的应用[J].东北大学学报(自然科学版),2003,24(12):1149-1152.
    [13]陈婷,韩伟力,杨珉.基于隐私保护的个性化推荐系统.计算机工程,2009,35(8):283-285
    [14] Pitkow J. Insearch of reliable usage data on the WWW [C]. In: Proc of 6th Int'1 WorldWideWeb Conf. Santa Clara,California,1997:1343-1355.
    [15] Doug Beeferman,Adam Berger. Agglomerative clustering of a search engine query log [EB/OL]. http://www.dougb.com/papers/kdd.pdf,2000-11-3.
    [16]邢东山,沈钧毅,宋擒豹.从Web日志中挖掘用户浏览偏爱路径[J].计算机学报.2003. 26(11):1518-1523.
    [17] Thorsten J,Dayne F,Tom M. Web watcher: a tour guide for the World Wide Web [EB/OL]. http://www.cs.uu.nl/docs/vakken/ll/webwatcher.pdf,1997-10-11.
    [18] Barn shadMobasher,Robert Cooley,and Jaideep Srivastava. Creating adaptive web sites through usage-based clustering of URLs[C]. Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange, 1999:19-25.
    [19] Ngu DSW and Wu X. Sitehelper: A localized agent that helps incremental exploration of the world wide web [C]. In 6th International world Wide Web Conference,Santa clara,CA,1997:691-700.
    [20] Lieberman H. Letizia:an agent that assists web browsing[C].In Proc.of the 1995 international Joint Conference on Artificial Intelligence,Montreal,Canada,1995:701-706.
    [21] Cohen E,Krishnamurthy B and Rexford J. ImProving end to end performance of the web using server volumes and proxy filters[C]. Proceedings Of ACM Sigcomm,1998:241-253.
    [22] Web Log Analysis.A vailable[EB/OL]. http: //www.boutell.com/wusage,2009-6-3.
    [23] Mike. Perkowitz,Oren. Etzioni. Adaptive web sites: automatically synthesizing web pages[C]. In Fifteenth National Conference on Artificial Intelligence,Madison W I,1998:691-702.
    [24] M. Arnoux, Y. Lechevallier, D. Tanasa, B. Trousse, and R. Verde. Automatic Clustering for the Web Usage Mining[C]. In Proceedings of the Fifth International Workshop on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC03), 2003:54-66
    [25] Osmar R.Zaiane,Man Xin,Jiawei Han. Discovering web access patterns and trends by applying OLAP and DataMining technology on Web Logs [A]. In: Proc.Advances in Digital Libraries Conference ADL '98,Santa Barbara,CA,USA,April 1998:19-29.
    [26] Robert Cooley,Jaideep Srivastava. Websift: TheWeb site information filter system [C]. In Proceedings of the Web Usage Analysis and User Profiling Workshop,1999:502-514.
    [27] Robert Cooley,Jaideep Srivastava Discovery of interesting usage patterns from web data [A]. In: Myra Spiliopoulou,editor,LNCS/L NAI Series [M]. Springer,2000.
    [28] Robert Cooley,Bam shad Mobasher and Jaideep Srivastava.Grouping web page reference into transactions for mining world wide web browsing patterns[C]. In proceedings of KDEX' 97,NewPort Beach,California,1997:229-238.
    [29] Myra Spiliopoulou and Lukas C. Faulstich. WUM: A Tool for Web Utilization Analysis[C]. In proceedings Extending Database Technology Workshop,1999:184-203.
    [30] Myra Spiliopoulou,Carsten Poh le,Lukas C. Faulstich. Improving the effectiveness of a web site with web usage mining[M].WEBKDD,San,CA,1999.
    [31] Jitian Xiao,Yanchun Zhang. Clustering of Web Users Using Session-based Similarity Measures[C]. IEEE Computer Networks and Mobile Computing,2001: 223 -228
    [32] Pitkow J and Krishna k.Bharat. WebViz: A tool for world wide web Access log analyses [C]. In First International Conference on the World-Wide Web,Geneva,Switzerland,May,1994:271-277.
    [33] R. Cooley, B. Mobasher, and J. Srivastava. Web mining: Information and pattern discovery on the World Wide Web[C]. In International Conference on Tools with Artificial Intelligence, 1997:558-567.
    [34]李春,朱珍民,周佳颖.个性化服务研究综述.计算机应用研究,2009,26(11):4001-4005.
    [35]余肖生.基于Web挖掘的个性化推荐系统研究[J].现代情报,2008,1:215-217.
    [36] World Wide Web Consortium.Extensible Markup Language(XML)1.0(FifthEdition)[EB/OL].W3C Recommendation. http://www.w3.org/TR/2008/REC-xml-20081126/ ,2008-11-26.
    [37] BosakJ,BrayT,ConnollyD,etal.W3C XML Specification("XMLspec")DTD,Version 2.1[EB/OL].http://www.w3.org/XML/1998/06/xmlspec-report-v21.htm,2000-2-15.
    [38] X.F.Meng,D.F.Luo,M.LiLee,J.An.OrientStore:A Schema Based Native XML Storage System[J].The 29th VLDB Conference.September 2003:1057-1060.
    [39]李骥,陈福生.Native-XML数据库综述[J].计算机工程与设计,2004,25(6):932-934.
    [40] A.Renner.XML Data and Object Databases:A Perfect Couple[J].Proceedings of the 17th International Conference on Data Engineering,Heidelberg,Germany,2001:143-148.
    [41] H.Jiang,H.Lu,W.Wang.Path Materialization Revisited:An Efficient Storage Model for XML Data[C].Proceedings of the 13th Australasian Database Conference(ADC2002),Melbourne,Australia,2002:61-72.
    [42] World Wide Web Consortium.XML Path Language(XPath) Version 1.0[EB/OL]. W3C Recommendation.http://www.w3.org/TR/xpath,1999-12-16.
    [43] World Wide Web Consortium.XML Path Language ( XPath ) 2.0[EB/OL]. W3C Recommendation.http://www.w3.org/TR/xpath20,2007-1-23
    [44] World Wide Web Consortium.XQuery1.0:An XM Lquery Language[EB/OL]. W3C Recommendation,http://www.w3.org/TR/xquery,2007-1-23
    [45] John P ,Krishnamoorthy M,Mohammed J . LOGML-Log Markup Language for Web Usage Mining[C]. WEBKDD Workshop 2001: Mining Log Data Across All Customer TouchPoints (with SIGKDD01) . San Francisco : [ s.n. ],2001:405-413.
    [46] J. Punin and M. Krishnamoorthy. WWWPal System - A System for Analysis and Synthesis of Web Pages[C]. In Proceedings of the WebNet 98 Conference,Orlando,1998:301-321.
    [47] Henry S. Thompson,David Beech,et al. XML Schema Part 1: Structures Second Edition [EB/OL] . http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/ ,2004-10-28
    [48]刘立军,周军梅,红岩.Web使用挖掘的数据预处理[J].计算机科学,2007,34(5):2000-2004.
    [49]赵红玲等.Web日志挖掘中数据预处理的研究[J].计算机应用研究,2005,(6):67-69.
    [50]石晶等.评测Web使用分析中会话识别的准确度[J].电子科技大学学报,2002,31(3):282-285.
    [51] R Cooley,J Sriva stava. Grouping web page references into transactions for mining world wide web browsing patterns[C]. In proceedings of KDEX'97,Newport Beach,CA,USA,1997:2-7.
    [52] Grahne G,Zhu JF. High performance Mining of maximal frequent itemsets[C]. San Francisco,CA:Proc of the 6th SIAM Int'l Workshop on High Performance Data Mining(HPDM),2003:135-143.
    [53]宋爱波,董逸生,陈静.基于Weblog的模式及应用的研究[J].小型微型计算机系统,2002,23(11):1331-1335.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700