基于点击数据分析的个性化搜索引擎研究

英文题名：The Research of Personalized Search Engine Based on Analysis of Click Data
作者：蔺继国
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：个性化搜索引擎 ; 相关反馈 ; 协同过滤 ; PageRank ; 点击数据
英文关键词：Personalized Search Engine ; Relevance Feedback ; Collaborative Filtering ; PageRank ; Click Data
学位年度：2010
导师：徐锡山
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2010-11-01

摘要

随着互联网技术在全球范围内的飞速发展,互联网逐渐成为群众发布信息、获取信息和传递信息的主要载体,网络信息呈现一种爆炸式增长态势。人们一方面在享受着互联网带来的方便及丰富的信息资源,另一方面也不可避免地遇到难以快速获取有效信息的问题。搜索引擎作为获取网络信息的一个方便入口,正不断被人们使用和依赖。
     但是,传统搜索引擎对所有网络用户提供一个统一的入口,对所有用户的相同查询词返回一个相同的结果列表,这个结果列表中仍然包含很多网页,用户感兴趣的信息往往仍然被一些冗余信息淹没。为了深入理解用户的搜索目的,对不同用户提供不同的个性化服务,个性化搜索技术应运而生。
     然而,个性化搜索技术的研究工作仍然处于一种鱼龙混杂的局面,没有一款商用个性化搜索引擎产品提供的个性化服务能够真正让人耳目一新。本文针对个性化搜索技术的现状及问题,基于用户点击数据分析方法对个性化搜索技术进行了深入研究。本文的主要工作有以下几个方面:
     (1)对现有个性化搜索技术的研究状况进行了分析比较,指出了现有个性化搜索引擎的不足之处。
     (2)提出一种基于点击数据分析的隐式相关反馈信息提取策略,比显式反馈方法更具有实际应用价值。
     (3)设计了一种基于添加修正参数的个性化PageRank算法,通过将提取的隐式信息反馈到PageRank中,实现了搜索结果的个性化排序,结果更接近用户的搜索需求。
     (4)将协同过滤技术应用于个性化PageRank算法,利用兴趣组内其他用户的相关反馈信息来改善同组者搜索结果的排序质量。
     (5)提出基于兴趣聚类技术的用户分组方法,以实现用户的合理分组,进一步减少用户使用系统时的复杂度。
With the rapid expansion of information technology throughout the world, Internet has become the main platform of information releasing, exchanging and acquiring. While enjoying the convenience and abundant information bringing by the Internet, people also encounter the problem inevitably that they cannot get efficient information rapidly. As a handy entry for people to gain information, Search engine is used widely and depended on by people.
     But, the traditional search engines offer only one uniform entrance for all network users, and always return a same result list if given a same query although may queried by different person. The result list contains a lot of information remain, and the information the user interested in may submerged by many redundant things. To understand user’s query motivation deeply, and provide personalized service for different people, technologies of personalized search are put forward and researched.
     However, research work of personalized search is still in a state that good and evil ones mixed up. And there is no commercial personalized search engine which gives a personalized service that can let us feel new and fresh. Herein the status quo and problem of the personalized search, this thesis proposed a personalized search scheme based on analysis of click data. The main contents are as follows.
     (1) Gave an analysis on related technologies of personalized searching, and then put forward the weakness and problems of the personalized search engine nowadays.
     (2) Proposed an integrated strategy which extracts implicit relevance feedback by analyzing users’click data. It has much more value in actual application than explicit feedback.
     (3) Brought forward a personalized PageRank algorithm based on adding amendatory vector, and put the implicit relevance feedback which was extracted from click data into the algorithm, then implemented a personalized ranking method of searching result.
     (4) Used the collaborative filtering into personalized PageRank algorithm, and improved the quality of the searching result ranking by using relevance feedback of others’in the group who owns similar interests.
     (5) Proposed a method of classifying users based on clustering basal users’interesting, so as to implement the reasonable grouping of users, and decrease the complexity of the system.

引文

[1]第26次中国互联网络发展状况统计报告.中国互联网络信息中心,2010.7.
    [2]蔡柯柯.基于查询特征上下文的检索模型研究.浙江大学博士学位论文,2007.7.
    [3]上海网站优化服务.个性化搜索是搜索引擎未来的难题http://www.11365.net/Network-Marketing/seo/100.html.
    [4] Andrei Broder. The Next Generation Web Search and the Demise of the Classic IR model. Proceedings of the 29th European conference on IR research, 2007, 1~1.
    [5]江婕,李建民,曾勍炜.基于用户反馈的个性化搜索引擎的研究.计算机与现代化,2010, 6:116~121.
    [6]梅耶尔披露谷歌方向:模式、媒体和个性化http://media.ifeng.com/news/newmedia/web/200912/1215_4266_1475419.shtml.
    [7]龚笔宏.基于用户反馈的个性化检索技术.北京大学博士学位论文,2007.
    [8]李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统.北京:科学出版社, 2005.
    [9]李树青,韩忠愿.个性化搜索引擎原理与技术.北京:科学出版社,2008.6.
    [10]吴丽辉,张凯,张刚,王斌.个性化Web信息采集系统PSearch的设计.全国第八届计算语言学联合学术会议(JSCL-2005)论文集,2005,395~400.
    [11]江涛樊,孝忠.主题爬虫的设计与实现.计算机应用,2004,6:270~272.
    [12] Anthony Scime, Larry Kerschberg. WebSifter: an ontology-based personalizable search agent for the web. Proceedings of International Conference on Digital Libraries: Research and Practice, 2000, 203~210.
    [13] Eric J. Glover, Steve Lawrence, William P. Birmingham, C. Lee Giles. Architecture of a metasearch engine that supports user information needs. Proceedings of Eighth International Conference on Information and Knowledge Management, 1999.
    [14] Zhu Shanfeng, Deng Xiaotie, Chen Kang, Zheng Weimin. Using online relevance feedback to build effective personalized metasearch engine. Proceedings of Second International Conference on Web Information System Engineering, 2001, 1: 262~268.
    [15] B. Uygar Oztekin, George Karypis, Vipin Kumar. Expert agreement and content based reranking in a meta search environment using Mearf. Proceedings of the 11th international conference on World Wide Web, Hawaii, 2002, 333~344.
    [16] Byoung-Tak Zhang, Young-Woo Seo. Personalized Web-document filteringusing reinforcement learning. Applied Artificial Intelligence, 2001, 15(7): 665~685.
    [17] B.Uygar Oztekin, Levent Ert?z, Vipin Kumar, Jaideep Srivastava. Usage aware pagerank. World Wide Web Conference, 2003.
    [18] https://www.google.com/history/
    [19]文振威,秦晓.个性化搜索引擎的研究与设计.计算机工程与设计,2009,30(2):342~344.
    [20] Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, Alexandros Ntoulas. Releasing Search Queries and Clicks Privately. Proceedings of 18th International World Wide Web Conference, 2009, 171~180.
    [21]微软和Facebook进行个性化搜索.http://it.sohu.com/20101018/n275874549 shtml.
    [22] David Carmel, Naama Zwerdling, Ido Guy, Shila Ofek-Koifman, Nadav Har'el, Inbal Ronen, Erel Uziel, Sivan Yogev, Sergey Chernov. Personalized Social Search Based on the User’s Social Network. Proceeding of the 18th ACM conference on Information and knowledge management (CIKM '09). New York: ACM Press, 2009,1227~1236.
    [23]张东方.基于相关反馈的人机对话搜索引擎系统.大连理工大学硕士学位论文,2005.
    [24] He Zhang, Markus Koskela, Jorma Laaksonen. Report on forms of enriched relevance feedback. TKK Reports in Information and Computer Science, 2008.
    [25]余慧佳,刘奕群,张敏等.基于大规模日志分析的网络搜索引擎用户行为分析.中文信息学报,2007,21(1):109~114.
    [26] Diane Kelly, Nicholas J. Belkin. Display time as implicit feedback: understanding task effects. Sanderson M, Jarvelin K, Allan J, et al. Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM Press, 2004, 377~384.
    [27] Ryen W. White, Joemon M. Jose, Ian Ruthven. A task-oriented study on the influencing effects of query-biased summarization in the web searching. Information Processing and Management, 2003, 39(5): 707~733.
    [28] Joseph A. Konstan, Bradley N. Miller, David Maltz, Jonathan L. Herlocker, Lee R. Gordon, John Riedl. GroupLens: applying collaborative filtering to UsenetNews. Communications of the ACM, 1997, 40(3): 77~87.
    [29] Mark Claypool, Phong Le, Makoto Wased, David Brown. Implicit interest indicators. Proceedings of the 6th international conference on Intelligent user interfaces (IUI '01). New York: ACM Press, 2001, 33~40.
    [30] Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, Thomas White. Evaluating implicit measures to improve the search experiences. ACMTransactions on Information Systems, 2005, 23(2): 147~168.
    [31] Ahmed Hassan, Rosie Jones, Kristina Lisa Klinkner. Beyond DCG: User Behavior as a Predictor of a Successful Search. In Proceedings of the third ACM international conference on Web search and data mining (WSDM '10), New York: ACM Press, 2010, 221~230.
    [32] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Geri Gay. Accurately interpreting clickthrough data as implicit feedback. Baeza-White R, Ziviani N. Proceedings of the 28th annual international ACM SIGIR conference on research and development in information, New York: ACM Press, 2005, 154~161.
    [33] Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, Christos Faloutsos. Efficient Multiple-Click Models in Web Search. Proceedings of the Second ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2009, 124~131.
    [34] Jingfang Xu, Chuanliang Chen, Gu Xu. Improving Quality of Training Data for Learning to Rank Using Click-Through Data. Proceedings of the third ACM international conference on Web search and data mining (WSDM '10), New York: ACM Press, 2010, 171~180.
    [35] En Cheng, Feng Jing, Lei Zhang, Hai Jin. Scalable Relevance Feedback Using Click-Through Data for Web Image Retrieval. Proceedings of the 14th annual ACM international conference on Multimedia (MULTIMEDIA '06), New York: ACM Press, 2006, 173~176.
    [36] Hang Cui, Ji-Rong Wen, Jian-Yun Nie, Wei-Ying Ma. Probabilistic query expansion using query logs. Proceedings of the 11th international conference on World Wide Web. New York: ACM Press, 2002, 325~332.
    [37] Barry Smyth, Evelyn Balfe, Jill Freyne, Peter Briggs, Maurice Coyle, Oisin Boydell. Exploiting query repetition and regularity in an adaptive community-based web search engine. User Modeling and User-Adapted Interaction: The Journal of Personalization Research, 2005, 14(5):383~423.
    [38] Shihao Ji, Ke Zhou, Ciya Liao, Zhaohui Zheng, Gui-Rong Xue, Olivier Chapelle, Gordon Sun, Hongyuan Zha. Global Ranking by Exploiting User Clicks. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '09). New York: ACM Press, 2009, 35~42.
    [39] Zhicheng Dou, Ruihua Song, Xiaojie Yuan, Ji-Rong Wen. Are Click-through Data Adequate for Learning Web Search Rankings?. Proceeding of the 17th ACM conference on Information and knowledge management. New York: ACM Press, 2008, 73~82.
    [40] Wilfred NG, Lin Deng, Dik-Lun LEE, Spying Out Real User Preferences in WebSearching,ACM Transactions on Internet Technology,2006.
    [41] Ricardo Baeza-Yates , Raffaele Perego, Fabrizio Silvestri. Query Log Mining. The 32nd Annual ACM SIGIR Conference tutorial, 2009.
    [42] Linden, Gregory D, Jacobi, Jennifer A, Benson, Eric A. Collaborative recommendations using item-to-item similarity mappings. Washington DC: Patent and Trademark Office. 2001.
    [43] Will Hill, Larry Stead, Mark Rosenstein, and George Furnas. Recommending and evaluating choices in a virtual community of use. Proceedings of the SIGCHI conference on Human factors in computing systems (CHI '95). New York: ACM Press, 1995,194~201.
    [44] Upendra Shardanand, Pattie Maes. Social information filtering: algorithms for automating“word of mouth”. Proceedings on Human Factors in Computing Systems. New York: ACM Press, 1995, 210~217.
    [45] Ryen W. White, Mikhail Bilenko, Silviu Cucerzan. Studying the Use of Popular Destinations to Enhance Web Search Interaction. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval , 2007, 159~166.
    [46] Dan Cosley, Steve Lawrence, David M. Pennock. REFEREE: an open framework for practical testing of recommender systems using ResearchIndex. In Proceedings of the 28th international conference on very large databases. 2002, 35~46.
    [47]王鹏.移动搜索引擎原理与实践.北京:机械工业出版社,2009.2.
    [48]王亮.搜索引擎零距离——基于Ruby+Java搜索引擎原理与实现.北京:清华大学出版社,2009.6.
    [49]邵峰晶,于忠清,王金龙,孙仁诚.数据挖掘原理与算法(第二版).北京:科学出版社,2009.8.
    [50] Daxin Jiang, Jian Pei, Hang Li. Web search/browse log mining: challenges, methods, and applications. In Proceedings of the 19th international conference on World Wide Web. New York: ACM Press, 2010, 1351~1352.
    [51]吴泽欣. SEO教程——搜索引擎优化入门与进阶.北京:人民邮电出版社, 2008.12.
    [52] Distribution of Clicks on Google’s SERPs. SEO Articles http://www.seoresearcher.com/distribution-of-clicks-on-googles-serps-and-eye-tracking-analysis.htm, 2006.
    [53] R. Agrawal, A. Halverson, K. Kenthapadi, N. Mishra, P. Tsaparas. Generating Labels from Clicks. Proceedings of the Second ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2009, 172~181.
    [54] Seikyung Jung, Jonathan L. Herlocker, Janet Webster. Click data as implicitrelevance feedback in web search. Information Processing and Management, 2007, 43(3): 791~807.
    [55]侯越先,张鹏,于瑞国.基于内容相关性挖掘的反馈式搜索引擎框架.天津大学学报,2008,41(8):941~945.
    [56] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Libraries, 1998.
    [57] Huberman BA, Pirolli PLT, Pitkow JE, Lukose RM. Strong regularities in world wide web surfing. Science, 1998.
    [58] Haveliwala T. Efficient Computation of PageRank. Technical Report. Stanford InfoLab, 1999.
    [59]邱哲,符滔滔.开发自己的搜索引擎Lucene 2.0+Heritrix[M],北京:人民邮电出版社,2007.6.
    [60]江婕.个性化搜索引擎的研究与实现.南昌大学硕士学位论文,2008.12.
    [61]王太雷.基于相似模式聚类的电子商务网站个性化推荐系统研究.计算机工程与应用,2005,41(6):152~157.
    [62]付志涛.基于Web日志的网络用户聚类研究与实现.南京理工大学硕士学位论文,2007.6.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700