中文信息检索系统与文档重排技术研究

英文题名：Research of Chinese Information Retrieval System and Document Reranking
作者：方芳
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息检索 ; 倒排索引 ; 向量空间模型 ; 查询扩展 ; 文档重排
英文关键词：Information Retrieval ; Inverted Index ; Vector Space Model ; Query Expansion ; Document Reranking
学位年度：2010
导师：陈建勋 ; 刘茂福
学科代码：081203
学位授予单位：武汉科技大学
论文提交日期：2010-04-20
答辩委员会主席：张晓龙

摘要

随着计算机系统性能的提高,互联网信息的飞速发展,以及企业信息化程度的迅速提高,中文信息资源以极快的速度递增。信息的增加在满足人们对信息需求的同时也给人们快速、准确的查找所需要的信息带来了一定的难度。在这种情况下,信息检索技术成为研究的热点。
     信息检索(Information Retrieval,IR),通常指文本信息检索,包括信息的存储、组织、表现、查询、存取等各个方面,其核心为文本信息的索引和检索。信息检索的主要技术包括索引处理、查询扩展、检索模型、重排处理等,中文信息检索还涉及到分词处理。
     针对中文信息检索相关技术的研究,本文的研究内容可以分为两个部分。首先,以NTCIR7的中文IR4QA子任务为实验背景,设计并实现了一个中文信息检索系统。系统在索引时对原始文本进行分词处理后以词为单元生成倒排索引,检索部分则采用了经典的向量空间模型。为了解决词不匹配的问题,检索得到初始结果后,利用一种基于局部共现的查询扩展方法进行查询扩展处理。实验结果表明,经过查询扩展处理后,系统性能得到明显提升。对于系统所得结果,经过NTCIR7官方评价工具的评估,可以看到我们的检索系统有较好的检索性能。另外,对特定类型问题进行了文档重排技术的研究。针对检索系统将检索结果反馈给用户时,用户往往只浏览前N个检索结果的情况,本文结合开放性资源维基百科和定义以及人物传记这两种类型问题的特点,将与特定问题相关的维基百科页面引入,以对初检结果进行文档重排处理。实验表明,这种方法能有效提高排在前面的文档的精度。
With the improvement of computer system performance, the rapid development of Internet information, as well as the degree of enterprise informatization, the Chinese information resources get a fast rate of increase. The increases of information meet the information needs of people and also lead to the difficulty for the fast, accurate search requirement at the same time. In this case, the information retrieval technology becomes a research hotspot.
     Information Retrieval usually refers to text information retrieval, including information storage, organization, performance, query, access and other aspects, and the core of it is the text indexing and retrieval. The main technique about information retrieval system includes the index processing, query expansion, retrieval model, document reranking and so on. For Chinese information retrieval,the word segment technique is also very important.
     The studies about the Chinese information retrieval of this paper can be divided into two parts. Firstly, taking the NTCIR7 Chinese IR4QA subtask as the experimental background, we complete the design and implementation of a Chinese information retrieval system. The index function component segments the original documents into words and then generates an inverted index with word units. The retrieval component applies the classical vector space model. In order to solve the problem of word mismatch, a query expansion method based on the local co-occurrence is employed for attaining more useful key words and generating a new query after obtaining the initial search results. The experimental results show that this query expansion strategy improves the system performance significantly. And evaluated by the NITCIR7 official tool, we can also see that our system owns a relatively good performance. Secondly, we do research on document reranking technique about the specific types of questions. When the retrieval system returns the results to the users, the users may be used to just browse the top N documents. In view of this kind of phenomenon, we try to improve the precision of the top results by document reranking. This paper notices the characteristics about the open resource Wikipedia and the definition as well as the biography type of questions. We make use of the Wikipedia pages related to the specific questions for document reranking. Experiments show that our method can improve the precision of the top results efficiently.

引文

[1]王斌.文本检索综述[M].数字图书馆论坛,2006,1-9.
    [2]王斌.现代信息检索[M].机械工业出版社,2006,1.
    [3] Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval [M].王知津,贾神新,郑红军等译.北京:机械工业出版社,2005,6-7.
    [4] Vannevar Bush. As We May Think [M]. Atlantic Monthly, 1945.
    [5] H.P.Luhn. A Statistical Approach to Mechanized Encoding and Searching of Literary Information [J]. IBM Journal of Research and Development, 1957, 10(1): 309-317.
    [6] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[C]. Communications of ACM, 1975, 18(11): 613-620.
    [7] Buckley C. Implementation of the SMART Information Retrieval System[C]. Cornell University, Technical Report, 1985, 85-86.
    [8] Gerard Salton. The SMART Retrieval System Experiments in Automatic Document Processing [M]. Prentice Hall, Inc. Englewood Clifrs, NJ, USA, 1971.
    [9] Robertson, S. E. and Sparek Jones K. Relevance Weighting of Search Terms [J]. Journal of the American Society for Information Science, 1976, 27(3): 129-146.
    [10] Van Rijsbergen C. J. A New Theoretical Framework for Information Retrieval[C]. Proceedings of 1986 ACM SIGIR Conference Research and Development in Information Retrieval, 1986, 194-200.
    [11] Robertson, S. E. The Probability Ranking Principle in Information Retrieval [J]. Journal of Documentation, 1977, 33(4): 294-304.
    [12]刘普寅,吴孟达.模糊理论及其应用[M].北京:国防科技大学出版社,2000,1-17.
    [13] Thomas K Landauer, Peter W. Foltz, Darrell Laham. An Introduction to Latent Semantic Analysis [J]. In Discourse Processes, 1998, 25(5125): 259-284.
    [14] S.Deerwester, S. Dumais. Indexing by Latent Semantic Analysis [J]. Journal of the American Society for Information Science, 1990, (41): 391-407.
    [15]林士敏,田凤占,陆玉昌.贝叶斯学习、贝叶斯网络与数据采掘[J].计算机科学, 2000,27(10):69-72.
    [16]王双成,林士敏,陆玉昌.贝叶斯网络结构学习分析[J].计算机科学,2000,27(10):77-79.
    [17]牛耘,朱献有.神经网络技术在汉语歧义切分中的应用[J].情报学报,1999,18(3):37-45.
    [18]王开铸,李俊杰.无词典自动分词的研究[J].计算语言学进展与应用,北京:清华大学出版社,1995.
    [19]韩客松,王永成.汉语语言的无词典分词模型系统[J].计算机应用研究,1999,16(10): 8-9.
    [20]孙茂松,肖明.基于无指导学习策略的无词表条件下的汉语自动分词[J].计算机学报,2004,27(6):736-742.
    [21]李振星,徐泽平.全二分最大匹配快速分词算法[J].计算机工程与应用,2002,38(11):106-109.
    [22]黄德根,朱和合.基于最长次长匹配的汉语自动分词[J].大连理工大学学报,1999,39(6):831-835.
    [23] Sakai, T., Kando, N., Lin, C.-J., Mitamura, T., Ji, D., Chen, K.-H., Nyberg, E. Overview of the NTCIR-7 ACLIA IR4QA Task[C]. Proceedings of NTCIR-7 Workshop Meeting, 2008, 77-114
    [24] linliangyi2005, zhuoshiyao, ljmelody84, liangjie.lee. ik-analyzer java开源中文分词器[DB/OL]. http://code.google.com/p/ik-analyzer/.
    [25] G.Salton and M.E.Lesk. Computer Evaluation of Indexing and Text Processing [J]. Journal of the ACM, 1968, 15(1): 8-36.
    [26] G.Salton. The AMAER Retrieval System Experiments in Automatic Document Processing [M]. Prentice Hall, Inc. Englewood Cliffs, NJ, 1971.
    [27] G.Salton and M.J.McGill. Introduction to Modern Information Retrieval [M]. McGraw-Hill Book Co., New York, 1983.
    [28] Scott D., Susan TD. Indexing By Latent Semantic Analysis [J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
    [29] Qiu Y G. Concept Based Query Expansion[C]. In: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York: ACM Press, 1993. 160-169.
    [30] Buckley C.Salton G.Allan J.Singhal A. Automatic Query Expansion Using SMART[C]. In the Proceedings of the TREC-3, 1995. 69-80.
    [31] Ricardo B-Y,Berthier R-N.Modern Information Retrieval [M]. New York: ACM press, 1999.
    [32] Xu J.X. Croft W.B. Improving the Effectiveness of Information Retrieval with Local Context Analysis [J]. ACM Transactions on Information Systems, 2000, 18(1): 79-112.
    [33]丁国栋,白硕,王斌.一种基于局部共现的查询扩展方法.中文信息学报[J].2006,20(3): 84-91
    [34] iProspect. iProspect’s Search Engine User Attitudes Survey Results [DB/OL]. http://www.iprospect.com/. 2004.
    [35] Salton G, McGill M. An Introduction to Modern Information Retrieval [M]. New York: McGraw Hill, 1983.
    [36] Angelina Geetha, A.Kannan. Enhancement of Search Results Using Dynamic Document Seed Reranking Algorithm [J]. Journal of Computer Science, 2007, 3(6): 436-440.
    [37] van Rijsbergen, C.G. Information Retrieval [M]. Butterworths, London, second edition, 1979.
    [38] Lee Kyung-Soon, Young-Chan Park, Key-Sun Choi. Re-ranking Model Based on Document Clusters [J]. Information Processing and Management, 2001, 37: 1-14.
    [39] Shi Zhongmi, Baohua Gu, Fred Popowich, Anoop Sarkar. Synonym-based QueryExpansion and Boosting-based Re-ranking: A Two-phase Approach for Genomic Information Retrieval[C]. In the Proceedings of TREC2005, 2005.
    [40] Kamps, J. Improving Retrieval Effectiveness by Reranking Documents Based on Controlled Vocabulary[C]. In Proceedings of the 21th European Conference on Information Retrieval, 2004.
    [41] Balinski, J., Danilowicz, C. Re-ranking Method Based on Inter-document Distance [J]. Information Proceeding and Management, 2005, 41: 759-775.
    [42] Yang, L.P., Ji, D. H., Tang, L. Document Re-ranking Based on Automatically Acquired Key Terms in Chinese Information Retrieval[C]. In Proceedings of 20th International Conference on Computational Linguistics (COLING), 2004.
    [43] Yang, L.P., Ji, D.H., Zhou, G.D., Nie, Y. Improving Retrieval Effectiveness by Using Key Terms in Top Retrieved Documents[C]. In Proceedings of 27th European Conference on Information Retrieval, 2005.
    [44] Anick, P.G, Vaithyanathan, S. Exploiting Clustering and Phrases for Content-based Information Retrieval[C]. In proceedings of 20th ACM SIGIR International Conference on Research and Development in Information Retrieval, 1997, 314-323.
    [45] Hearst, M.A. and Pedersen, J.O. Re-examining the Cluster Hypothesis: scatter/gather on Retrieval Results[C]. In Proceedings of 19th ACM SIGIR International Conference on Research and Development in Information Retrieval, 1996, 76-84.
    [46]任江涛,孙婧昊,施潇潇等.一种用于文本聚类的改进的K均值算法[J].计算机应用,2006,26: 73-75.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700