基于用户词典的搜索个性化研究

英文题名：Research of Search Engine Personalization Based on User Dictionary
作者：罗颖
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：中文搜索引擎 ; 浏览历史 ; 用户词典 ; 词扩展策略
英文关键词：Chinese search engine ; browsing history ; user dictionary ; terms-expansion tragedy
学位年度：2009
导师：朱征宇
学科代码：081202
学位授予单位：重庆大学
论文提交日期：2009-04-01

摘要

网络是人们获取知识和传递信息的桥梁。然而,随着近年来internet的高速发展,网络上信息的数量也呈现指数级的增长,在这一背景下,互联网使用者往往无法轻松找到需要的信息,一种能够充分利用互联网信息的技术呼之欲出。
     个性化搜索(Personalized Search)技术一直是近年来信息检索领域的热点,它弥补了目前搜索引擎不区分用户的功能缺陷。
     为了向用户提供个性化的信息检索服务,本文采用一系列以用户浏览历史为依据的个性化策略,使搜索引擎能够区分用户,提供真正面向用户的个性化搜索服务。在对课题的研究中,本文所做的主要贡献体现在以下方面:
     ①合理利用互联网用户的网络浏览历史,采用一种以经典TF-IDF算法为基础的策略,在对不同用户进行特征描述前,首先形成该用户的个性化用户词典。用户词典的采用,不仅可以缩小用户描述空间,大大缩减形成用户描述文件的时间复杂度。同时该词典还支持二级向量的使用,使用户描述更加丰富。
     ②为了优化用户兴趣描述模型,本文提出一种基于超链接标记的互联网网页正文识别及提取方法,准确获取互联网网页所表达的核心信息,有效削减互联网广告等对用户兴趣贡献不大的信息所带来的噪音。同时,采用一种包含聚类反馈信息的网页频繁词处理策略,在用户词典中剔除对用户兴趣干扰较大的互联网频繁词,从而优化用户词典的描述准确度,以形成更加精确的用户模型。
     ③对搜索引擎模型进行改造,使用用户词扩展算法,准确定位用户搜索词的类别,计算搜索词同候选关键词之间的相似度,在候选词中选取合适的、面向用户的扩展词,以推荐给用户。并将词扩展策略以搜索组件的形式集成在搜索模型当中,当用户向搜索引擎提交搜索关键词的同时,根据日常学习到的用户个人兴趣,由本文所述的个性化策略对用户潜在的搜索意图进行理解,自动增加几个体现用户偏好的扩展词一起提交给搜索引擎,这样能过滤出需要的信息,以实现搜索引擎的个性化,从而可以提高搜索引擎检索效率。
     本文尝试将目前主流的商业搜索引擎作为个性化策略中的模块进行研究,充分利用搜索引擎查全率高,反映速度快的特点,研发了安装于用户主机的客户端搜索组件PSEplugin,该组件具有很大的应用价值和推广潜力。课题研究过程中,通过实验证明了PSEplugin及各相关技术应用于信息检索领域的有效性和实用性。
Network is the bridge to obtain knowledges and send messages for people. However, in recent years,along with the high-speed development of the internet, the amount of information on the internet has increased tempestuously. For this reason, internet users often cannot find the needful information easily, a novel technology that can make full use of the information on the internet is on the tip of our tongue.
     Personalized Search is always the hot spot of subject of information retrieval in recent years, it makes complete the function deficiency in classifying users of the traditional search engine. This paper has contributed for following aspect:
     First, make use of the browsing history of the internet users reasonably and use a trategy based on the classical TF-IDF to establish a user dictionary before modeling. The adoption of UD has not only decrease the time complexity, but also give much support for double vector description.
     At the second place,this paper put forward a method to extract the text of the web document. For this will help capture the key interest of the users in order to optimize the user profiles. At the mean time,we use a way which contains information of clustering feedback to get rid of the frequently-emerged net-word.
     Finally, use an algorithm of terms-expansion to get some appropriate terms,which can be adopted to submit to search engine together wich the initial keyword. These terms can somehow represent the user's interest in information retrival. So by the use of them, the results of the search engine can be filtered to personalize the search engine and increase the efficiency.,
     This paper also attempt to research the main search engine as a part of the subject. This is because the commercial search engine is functionality and time efficient. We also developed an client component PSEplugin to implement the functions mentioned above. It is proved that the PSEplugin and the correlative technologies are effective and practical..

引文

[1]杨海涛,石磊,卫琳.一个基于搜索结果的个性化推荐系统[J].计算机工程与应用, 2006, 32:150-153.
    [2]李子臣.搜索技术的现状与发展前景[J].情报科学, 2006, 03:468-474.
    [3] Active Networks Working Group. Architectural Framework for Active Networks Version 1.0 [DB/OL].Http://www.cc.gatech.edu/projects/canes/arch-1-0.ps, 2003/2004.
    [4] Yulia I Wijata. Resource Management in Active Network [R]. Newyork: University of Kansas, 2001.
    [5]袁薇,高淼.搜索引擎系统中个性化机制的研究[J].微电子学与计算机, 2006,02:68-75
    [6]张选平,马琮,蒋宇,袁明轩,梁平.一种基于概念抽取的相关词推荐模型[J].微电子学与计算机, 2006.05:163-169
    [7] Kimball R. Digital Preservation [ J ] . Intelligent Enterprise , 2000 ,3 (4) :215 - 217
    [8] Google Inc. http://www.google.com
    [9] Baidu Inc. http://www.baidu.com.
    [10] Riecken D. Personalized Views of Personalization[J]. Communications of the ACM, 2000, 43(8): 27-28.
    [11] Gao Yuanyuan. A Cognitive Map-based on Decision Support Model for Web Resource[C]. Proc. of Canadian Conference on Electrical and Computer Engineering, Niagara Falls, 2004-05.
    [12] Wu Kunlun, Aggarwal C, Philip S, et al. Personalization with Dynamic Profiler[C]. Proc. of the 3th International Workshop on Advanced Issues of E-commerce and Web-based Information Systems, San Juan, California, 2001-06.
    [13]张元馨,赵仲孟,沈钧毅.一种基于向量空间模型的个性化搜索引擎研究[J].微电子学与计算机, 2003,11:52-55
    [14] Yunyan TIAN, Zhengyu ZHU, Jingqiu XU, Xiang REN, Xin DENG An Improved Partitioning-based Web Documents Clustering Method Combines Genetic Algorithm with ISODATA 2007.
    [15] Ping Liang. Advanced Search, File System and Intelligent Assistant Agent [P]. US Patent Applications: 11/024,325, 11/024, 098, 11/024,324.
    [16] Ping Liang. Internet and Computer Information Retrieval and Mining with Intelligent Conceptual Filtering. Visualization and Automation, [P]US Patent Application 60/624, 249.
    [17] Christian kurzke etc. WebAssist: a User Profile Specific Informtion Retrieval Assistant.Computer Networks and ISDN System, 30(1998).
    [18]李蕾,周国民.一种个性化搜索引擎系统[J].现代图书情报技术. 2007,01:81-85.
    [19]陈敏,苗夺谦,段其国.基于用户浏览行为聚类Web用户[J].计算机科学. 2008, 03:186-187、255.
    [20]庄力可,张长水,勒中坚.基于时间密度的Web日志用户浏览行为分析[J].计算机科学. 2004,04:108-112
    [21]王春红,张敏.隐含语义索引模型的分析与研究[J].计算机应用,2007,05:1283-1288.
    [22]陈晓金,王兵. Ontology的构建及在个性化检索中的研究[J].兰州交通大学学报,2008,03:126-129.
    [23]一种基于动态特征词典的SVM中文电子邮件过滤方法[J].计算机科学,2008,03:49-51
    [24]张芳,肖国强.基于专业搜索引擎的元搜索引擎的设计[J].郑州大学学报,2007, 02:38-41。
    [25]王浩鸣,张曰贤,吴志军,史西兵.基于智能Agent的中文元搜索引擎模型研究[J].计算机工程与应用,2005年31期:154-156.
    [26]余正涛,宋丽哲,樊孝忠.基于本体的个性化领域信息服务[J].计算机工程,2005年,第31卷5期:22-24、81.
    [27]张国印,陈先,皮鹏.基于词频统计的个性化信息过滤技术[J].哈尔滨工程大学学报,2003年,第24卷1期:63-67
    [28]张宏斌,朱明富,谢湘生.面向用户的一种遗传算法研究[J].系统工程与电子技术,2003年,第25卷第7期:878-881.
    [29]伍大清,阳小华,刘元剑,许纲理.基于用户模型的个性化信息检索研究[J]. 2008年9月,第27卷第3期:120-124.
    [30]徐静秋,朱征宇,谭明红等.基于二级向量描述的搜索引擎个性化服务模型[J].计算机科学,2007,11:89-93
    [31]余刚,陈华月,朱征宇,高原.基于词同现频率的文本特征描述[J].计算机工程与设计,2005.08:2180-2182
    [32]熊忠阳,黎刚,陈小莉陈伟文本分类中词语权重计算方法的改进与应用[J].计算机工程与应用,2008.05:187-189
    [33] Web文档中词语权重计算方法的改进[J].计算机工程与应用, 2007.19:192-198
    [34]戎晓霞,王金栋,吴胜远.基于BHO和协同技术的多级文语IE的实现[J].计算机工程, 2004, 02:42-44
    [35]罗欣,夏德麟,晏蒲柳.基于词频差异的特征选取及改进的TF-IDF公式[J].计算机应用,2005,09:2031-2033.
    [36]鲁松、白硕等.文本中词语权重计算方法的改进[J], 2000 International Conference onMultilingual Information Processing, pp31-36,2000.
    [37] Auen J.Natural language understanding[M]. [S.l.]: The Benjamin/Cummings Publishing Company, 1991.
    [38]代六玲,黄河燕.中文文本分类中特征抽取方法的比较研究[J].中文信息学报, 18( 1)
    [39] Jameson, A generalizing the double-stereotype approach: A psychological perspective. In Proceedings of the Third International Conference on User Modeling, 1992: 69-83
    [40] Resnick, P., Iacovou, N., Suchak, M, Bergstrom, P., Riedl, J. GroupLens:An open architecture for collaborative filtering of netnews, In Proc. ACM Conference on Computer-Supported Cooperative Work, 19 94:175-186
    [41] Orwant, J. Heterogeneous learning in the Doppelganger user modeling system,User Modeling and User-Adapted Interaction, 1995, 4(3):107-130
    [42] ShianHuaLin,JanMingHo.Discovering informative content blocks from Web documents[J]. SIGKDD, 2002
    [43] Lan Yi ,Bing Liu , Xiaoli Li. Eliminating Noisy Information in Web Pages for Data Ming[C]. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003 .296～305.
    [44]邓健爽,郑启伦,彭宏,邓维维.基于搜索引擎的关键词自动聚类法[J].计算机科学. 2007.03: 162-164
    [45]徐科,崔志明.基于搜索历史的用户兴趣模型的研究[J].计算机技术与发展. 2006.05: 18-20
    [46]罗欣,夏德麟,晏蒲柳.基于词频差异的特征选取及改进的TF-IDF公式[J].计算机应用, 2005, 09:2031-2033.
    [47]曾致远,张莉.基于向量空间模型的网页文本表示改进算法[J].计算机工程, 2006, 03:134-139
    [48] Yuejie Zhang, Tao Zhang, Shijie Chen. Research on Lucene-based English-Chinese Cross-Language Information Retrieval.Journal of Chinese Language and Computing15 (1): (25-32)
    [49] Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard Source: Indexing by latent semantic analysis. Journal of the American Society for Information Science, v 41, n 6, Sep, 1990, 391-407.
    [50] Wang, S. and Tanaka, Y. 2006. Topic-oriented query expansion for web search. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700