面向搜索引擎的智能个性化研究

英文题名：Intelligent and Personalized Research for Web Search Engine
作者：徐静秋
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：个性化 ; 信息检索 ; 搜索引擎 ; 查询扩展 ; 网页去重
英文关键词：Personalization ; Information Retrieval ; Search Engine ; Query Expansion ; Duplicated WebPages Deletion
学位年度：2008
导师：朱征宇
学科代码：081202
学位授予单位：重庆大学
论文提交日期：2008-04-01

摘要

随着互联网上文档数量的快速增长,在Web搜索的研究方面我们面临着许多新的挑战。搜索引擎上大多数的查询是短小且意义不明确的,即使输入相同查询词的用户也可能有完全不同的搜索意图。目前,大多数的搜索引擎并没有考虑用户个人的需要,对提交相同查询的用户,返回的搜索结果是完全一样的。为了提高搜索质量,个性化的Web搜索已成为信息检索领域的研究热点之一。本文有针对性地重点展开面向搜索引擎的智能个性化研究,不仅充分利用当前流行搜索引擎的优点,如快速响应请求,并且覆盖大量的信息资源等,而且能根据用户不同的兴趣和背景提供相关的搜索结果。
     其研究的内容主要包括以下几点::
     ①详细分析了现有向量空间模型的词间关系计算方法;基于新的用户兴趣模型,为了有效挖掘各兴趣子类中特征词间的关联关系,本文结合余弦相似性度量和词同现分析,设计了一种新的词间关系计算方法,建立与用户相关的词间关联度量化描述,可用于查询词扩展。
     ②结合浏览行为分析和浏览内容挖掘,准确定位用户查询的兴趣类别;利用兴趣子类中的词间关联度计算,设计搜索词智能语义扩展算法,对用户的初始查询自动增加几个能准确表达其搜索意图的扩展词,一起提交给某大型搜索引擎如Yahoo/Google,进行实际的信息检索。这样的查询扩展方式能使普通搜索引擎实现个性化服务,即对提交相同查询词的用户返回不同的搜索结果。
     ③内容完全重复或近似重复的网页充斥着互联网。搜索引擎的返回结果中也往往包含许多内容重复的网页,它们不但加重了用户浏览的负担,而且降低了搜索服务的质量。本文提出一种基于内容分析的检查相似文档的方法,尤其是对重复文档或近似重复文档的识别。为了进一步提高Web检索的质量,此方法主要应用于对搜索引擎返回的前N篇文档进行去重处理。
     本文第五章通过实验证明当前工作的有效性和可行性,上述研究在个性化搜索领域中具有一定的学术参考价值和较好的应用价值。
Along with the amount of Web documents on Internet grows rapidly, we are facing a lot of new challenges in the research of Web search. A vast majority of queries to search engines are short and under-specified and users may have completely different intentions for the same query. Currently, most of the main Web search engines are built to server all users, independent of the special needs of any individual user. In order to improve web search quality, personalized web search has now become to be a focus research in the domain of Web information retrieval. This paper has a further study on it, proposes intelligent and personalized information retrieval research for Web Search Engines. It not only makes good use of the advantages of popular search engines, such as a fast response to user query and a huge amount of information and resources for users, but also can provide relevant search results for people with different interests and background.
     The main research includes such aspects as below:
     ①In vector space model traditional approaches to calculate terms associations are analyzed in detail. In order to effectively analyze the relation between feature terms in an interest category of a user, this paper proposes a novel algorithm measuring term associations based on user profiles. The algorithm combines cosine similarity measures with co-occurrence data analysis. Quantitative correlation analysis between feature terms relevant with users is built, and servers for query expansion.
     ②A user query can be accurately mapped relevant interest categories in a new user interest model which combines with user's browsing content and behavior. A personalized query expansion algorithm is proposed by computing the term-term associations according to the current user profile. When the user inputs query keywords, the system can automatically generate a few personalized expansion words, and then these words together with the query keywords are submitted to a popular search engine such as Yahoo or Google. These expansion words help to express accurately the user’s search intention. The new query expansion can make a common search engine personalized, that is, the search engine can return different search results to different users who input the same keywords.
     ③The presence of replicas or near-replicas of documents is very common on the Web. These near-replicas that a search engine returns increase the burden on Web users and decrease the quality of searching service. This paper proposes a method based on content analysis to detect similar pages, in particular replicas and near-replicas. In order to further improve Web search quality, the method is applied to detect and remove replicas and near-replicas in the top N documents, which are returned by a search engine.
     In section 5, experimental results show the affectivity and feasibility of the present work. The research above has good academic reference value and good applied value in the domain of personalized Web search.

引文

[1] Micro Speretta, Susan Gauch. Personalized Research Based on User Search Histories. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05) [C], 2005.
    [2] ZHONGMING MA, GAUTAM PANT, and OLIVIA R. LIU SHENG. Interest-Based Personalized Search. ACM Transactions on Information Systems [J], 2007, 25(1):1-38.
    [3] Christos Makris, Yannis Panagis, Evangelos Sakkopoulos, Athanasios Tsakalidis. Category ranking for personalized search. Data & Knowledge Engineering [J], 2007, (60):109–125.
    [4] Fang Liu, Clement Yu, Weiyi Meng. Personalized Web Search for Improving Retrieval Effectiveness. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING [J], 2004, 16(1):28-40.
    [5] M.M. Sufyan Beg, Nesar Ahmad. Web search enhancement by mining user actions. Information Sciences [J], 2007, 177(23):5203-5218.
    [6] Vinay Bhat, Tim Oates, Vishal Shanbhag, Charles Nicholas. Finding aliases on the web using latent semantic analysis. Data & Knowledge Engineering 49 (2004) 129–143.
    [7] Dae-Young Choi. Enhancing the power of Web search engines by means of fuzzy query. Decision Support Systems [J], 2003, 35(1): 31– 44.
    [8] Bernard J. Jansen. Search log analysis: What it is, what's been done, how to do it. Library & Information Science Research [J], 2006, 28(3): 407–432.
    [9] Lori Lorigo, Bing Pan, Helene Hembrooke, Thorsten Joachims, Laura Granka, Geri Gay. The influence of task and gender on search and evaluation behavior using Google. Information Processing and Management, 2006, 42(4):1123–1131.
    [10] Shailendra Singh, Lipika Dey. A new customized document categorization scheme using rough membership. Applied Soft Computing [J], 2005, 5(4): 373–390.
    [11]曾春,邢春晓,周立柱.个性化服务技术综述.软件学报[J],2002,13(10):1952-1961.
    [12]王自强,冯博琴.Web信息查询优化的遗传算法。控制与决策[J],2005,20(2):187-190.
    [13]宁小红,余森森.基于s-Tree算法的个性化推荐服务研究.计算机科学[J],2007,34(4):217-221.
    [14] Kazunari Sugiyama, Kenji Hatano, Masatoshi Yoshikawa. Adaptive web search based on user profile constructed without any effort from users. Proceedings of the 13th international conference on World Wide Web[C], 2004.
    [15] Zhengyu Zhu, Qihong Xie, Xinghuan Chen, Qingsheng Zhu. A Web Personalized Service Based on Dual GAs. The First International Conference on Natural Computation (ICNC'05) [C], 2005.
    [16] Yi-Hung Wu, Yong-Chuan Chen, Arbee L.P.Chen: Enabling Personalized Recommendation on the Web based on User Internets and Behaviors, 11th International Workshop on research Issues in Data Engineering [C], 2001.
    [17]朱征宇,裴仰军,陈华月.个性化服务中用户近期兴趣视图的生成.计算机工程与设计[J],2005,26(4):951-954.
    [18] Zhengyu ZHU, Yunyan TIAN, Kunfeng YUAN, Yong YANG. An Improved Web Documents Claustering Methord [J]. 2007, 3(3):1087-1094.
    [19] Christopher Manning, Hinrich Schütze. Foundations of Statistical Natural Language Processing [M]. MIT Press. Cambridge, MA: May 1999.
    [20] Gen Kawamura, Shigeto Seno, Yoichi Takenaka, Hideo Matsuda. A Combination Method of the Tanimoto Coefficient and Proximity Measure of Random Forest for Compound Activity Prediction. IPSJ Digital Courier [J], 2008, 4:238-249.
    [21] S.-C. Wang and Y. Tanaka. Topic-oriented query expansion for web search. In Proc. of the 15th Intl. Conf. on World Wide Web[C], 2006.
    [22] Riccardo Serafin, Barbara Di Eugenio. FLSA: Extending latent semantic analysis with features for dialogue act classification. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics[C], 2004.
    [23] April Kontostathis,William M. Pottenger. Detecting Patterns in the LSI Term-Term Matrix. The 2002 IEEE International Conference on Data Mining[C], 2002.
    [24] Peat H J, Willett P. The limitations of term co-occurrence data for query expansion in document retrieval systems [J]. Journal of the American Society for Information Science, 1991, 42(5): 378- 383.
    [25] Wen JR, Nie JY. Clustering user queries of a search engine. Proceedings of the 10th International World Wide Web Conference (WWW 10)[C], 2001.
    [26]谭琼,李晓黎,史忠植.一种实现搜索引擎个性化服务的方法[J].计算机科学2002,29(1):23-25.
    [27] Liang T P, Lai H J. Discovering user interests from web browsing behavior: An application to internet news services. Proceeding of the 35th Hawaii International Conference on System Sciences[C], 2002.
    [28] WuY H, Chen YC, Chen L P. Enabling personalized recommendation on the web based user interests and behaviors[C]. Proceeding of the 11th International Workshop on ResearchIssues in Data Engineering (RIDE '01)[C], 2001.
    [29] Claypool M, Le P, Waseda M. Implicit interest indicators. Proceeding of the ACM Intelligent User Interfaces Conference[C], 2001.
    [30] Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval [M], Addison Wesley, 1999.
    [31]张敏,宋睿华,马少平.基于语义关系查询扩展的文档重构方法.计算机学报[J],2004,27(10),1395-1401.
    [32] iProspect Search Engine User Behavior Study[EB/01]. http://www.iprospect.com/about/whitepaper_seuserbehavior_apr06.htm, 2006.
    [33]中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/01]. httpc//www.conic. net. cn/index/OE/00/11/index. htm, 2005-07-01.
    [34]高凯,王永成,肖君.文档去重策略.上海交通大学学报[J],2006,40:775-777.
    [35] T. Mitchell, Machine Learning [M]. McGraw Hill, 1997.
    [36]任翔.面向个性化服务的网页特征描述及其应用研究[D],2008.
    [37] Donna Harman. Common Evaluation Measures. Proceedings of the 13th Text Retrieval Conference[C], 2005.
    [38]李晓明,刘建国.搜索引擎技术及趋势[EB/01] .http://www.se-express.com/se/se07.htm,2007-04-27.
    [39] Stan Lovic, Meiliu Lu, and Du Zhang. Enhancing Search Engine Performance Using Expert Systems. Information Reuse and Integration, 2006 IEEE International Conference[C], 2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700