基于统计语言模型的中文网页信息检索研究

英文题名：Research on Chinese Web Information Retrieval of Model Based on Statistical Language
作者：李贞
论文级别：硕士
学科专业名称：情报学
中文关键词：统计语言模型 ; 中文网页信息检索 ; 数据平滑技术 ; 中文分词
英文关键词：Statistical language models ; Chinese web information retrieval ; Data
英文关键词：smoothing techniques ; Chinese word segmentation
学位年度：2012
导师：李进华
学科代码：120502
学位授予单位：华中师范大学
论文提交日期：2012-05-01

摘要

互联网飞速发展,信息呈指数增长,信息获取途径更为多样化,但是信息搜索却变得更为复杂了。人们迫切需要高层次的信息处理技术来处理海量信息,快速检索到所需信息,从而帮助更好的进行决策和研究。然而,信息处理技术的普及与广泛应用很大程度上得益于自然语言处理技术的发展,为了有效解决信息检索问题,对信息检索在文档内容表示、检索模型、匹配策略以及排序算法等方面的研究逐渐增多。其中,对检索模型的研究仍然是信息检索研究的一个热点,各种检索模型和方法相继出现,如：布尔模型、向量空间模型、概率模型。尤其是近年来提出统计语言模型,将自然语言与统计学相结合来研究信息检索,借助强大的数学基底,成为信息检索中占据统治地位的检索模型,并取得了大量研究成果。
对中文网页海量数据进行研究,并将中文分词组件与lemur结合构建适宜于中文的信息检索系统方面的研究相对缺乏。本文在大规模中文网页语料库CWT200G的基础上,参考TREC和SWEM信息检索标准流程,以Lemur为基准工作平台,将其与中科院分词组件—汉语词法分析系统ICTCLAS相结合,形成一个可供实验的简单的信息检索系统。首先,阐述了本文的理论基础,介绍了基于统计语言方法的中文网页信息检索模型研究中所要研究的重点问题：统计语言模型、数据平滑、中文分词和中文文本索引等。然后,对信息检索评测的中文网页语料库和实验所需平台及系统进行简单介绍,对数据如何处理做了详细分析。最后,通过实验数据对比分析向量空间模型、概率模型等传统信息检索模型与统计语言模型对中文网页语料库进行主题检索时性能优劣；同时,在统计语言模型进行主题检索实验的时候,分别对Simplified Jelinek-Mercer平滑方法、Dirichlet Prior平滑方法和Absolute Discouting平滑方法进行实验,并对比这三种平滑方法在信息检索中的性能。
As the rapid development of Internet, information has grown exponentially, accessing information becomes more and more diverse, but information search has become even more complicated. An urgent need for high-level information processing technology to handle the vast amounts of information, and retrieve the necessary information to quickly to help people make better decisions and research. However, the popularity and wide application of information processing technology is largely thanks to the development of natural language processing technology, in order to solve the problem of information retrieval effectively, the research of information retrieval in the document content, the retrieval model, matching strategy and sorting algorithms gradually increasing. Retrieval model is still a hot topic of information retrieval research, a variety of retrieval models and methods have emerged, such as:boolean model, vector space model, probabilistic model. Especially in recent years, put forward a statistical language model, combining the natural language and statistical, with a strong mathematical basement, statistical language models become dominant in the information retrieval model, and has made a lot of research.
On the basis of large-scale Chinese web corpus CWT200G, reference the information retrieval standard procedures of TREC and SWEM, combining the working platform of Lemur with word components which is Chinese lexical analysis system ICTCLAS of the Chinese Academy of Sciences's products, and available a simple information retrieval system. First of all, described the theoretical basis of this paper describes the need to study the key issues in the study of Chinese Web information retrieval method based on statistical language model:statistical language model, data smoothing, Chinese word segmentation and Chinese text indexing. Then a brief introduction on the Chinese Web page corpus of information retrieval evaluation and experimental platforms required, and system and do a detailed analysis of the data is how to deal with. Finally, the experimental comparison of the data analysis of the pros and cons of the traditional vector space model, probabilistic model of information retrieval models and statistical language model on the Chinese Web page corpus theme retrieval performance; the same time, the topic retrieval experiments in the statistical language model, respectively Simplified Jelinek-Mercer smoothing method,Dirichlet Prior smoothing methods and the Absolute Discouting smoothing method, and compare the performance of the three smoothing methods in information retrieval.

引文

①Brants T. Natural language processing in information retrieval [M].2003.
    ②黄敏.自然语言处理与信息检索[J].图书情报工作.2001,4：41-44.
    ③王灿辉,张敏,马少平.自然语言处理在信息检索中的应用综述[J].中文信息学报.2007,21(2)：35-45.
    ①陈千,向阳,魏韡.概念匹配方法综述[J].计算机应用研究.2010(04).
    ①刘挺,秦兵,张宇.信息检索系统导论[M].机械工业出版社.2008,45-69.
    ① Mooers C. N. The theory of digital handling of non-numerical information and its implications to machine economics[J].Zator Co.1950.
    ② Salton G., Wong A., Yang C. S. A vector space model for automatic indexing[J]. Communications of the ACM. 1975,18 (11):613-620.
    ③ Salton G., Fox E. A., Wu H. Extended Boolean information retrieval[J]. Communications of the ACM.1983,26 (11):1022-1036.
    ④ Maron M. E., Kuhns J. L. On relevance, probabilistic indexing and information retrieval [J]. Journal of the ACM (JACM).1960,7 (3):216-244.
    ⑤丁国栋.基于统计语言建模的信息检索及相关研究[D].中国科学院研究生院(计算技术研究所).2006.
    ⑥ Robertson S. E., van Rijsbergen C. J., Porter M. F. Probabilistic models of indexing and searching[J]. Butterworth & Co.1980:35-56.
    ⑦ Robertson S. E., Walker S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval[J]. Springer-Verlag New York, Inc.1994:232-241.
    ⑧Robertson S. E., Walker S., Jones S.et al. Okapi at TREC-3[J]. NIST SPECIAL PUBLICATION SP.1995:109.
    ⑨ Robertson S. E., Walker S. Okapi/keenbow at trec-8[J].1999.
    ⑩Ponte J. M., Croft W. B. A language modeling approach to information retrieval[J]. ACM.1998:275-281.
    ① Rosenfeld R. Two decades of statistical language modeling:Where do we go from here?[J]. Proceedings of the IEEE. 2000,88 (8):1270-1278.
    ② Manning C. D., Schiitze H., Mitcognet. Foundations of statistical natural language processing[J]. MIT Press.1999.
    ③ Shannon C. Prediction and Entropy of Printed English.1951[J]. Shannon:Collected Papers. IEEE Press.1993.
    ④ Rosenfeld R. Two decades of statistical language modeling:Where do we go from here?[J]. Proceedings of the IEEE. 2000,88 (8):1270-1278.
    ⑤ Berger A., Lafferty J. Information retrieval as statistical translation[J]. ACM.1999:222-229.
    ⑥ Ponte J. M., Croft W. B. A language modeling approach to information retrieval[J]. ACM.1998:275-281.
    ① Hiemstra D., Kraaij W. Twenty-One at TREC-7:Ad-hoc and cross-language track[J].1999.
    ② Miller D. R. H., Leek T., Schwartz R. M. A hidden Markov model information retrieval system[J]:ACM.1999: 214-221.
    ③ Xu J., Croft W. B. Cluster-based language models for distributed retrieval[J].ACM.1999:254-261.
    ④ Lavrenko V., Croft W. B. Relevance based language models[J]. ACM.2001:120-127.
    ⑤ Shakery A., Zhai C. Smoothing document language models with probabilistic term count propagation[J]. Information Retrieval.2008,11 (2):139-164.
    ⑥ Mei Q., Fang H., Zhai C. X. A study of Poisson query generation model for information retrieval[J]. ACM.2007: 319-326.
    ⑦ Zhai C, Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval [J]. ACM.2001:334-342.
    ⑧ Chen S. F., Goodman J. An empirical study of smoothing techniques for language modeling[J]. Computer Speech & Language.1999,13 (4):359-393.
    ⑨王志勇.基于统计语言学模型的中文文本信息检索[D].第二军医大学.2004.
    ⑩袁毓林.基于统计的语言处理模型的局限性[J].语言文字应用.2004,17(2)：99-108.
    ①钱如栏,董云耀.中文问答系统中基于SLM的信息检索及其平滑技术研究[J].计算机工程与科学.2010(001)：136-140.
    ①Cronen-Townsend S., Zhou Y., Croft W. B. Predicting query performance[M]. ACM.2002:299-306.
    ①Zhai C. X. Statistical language models for information retrieval a critical review[J]. Foundations and Trends in Information Retrieval.2008,2(3):137-213.
    ①毛伟.基于统计语言模型的中文自动文本分类系统[D]：北京邮电大学.2006.
    ② Zhai C. X., Lafferty J. Two-stage language models for information retrieval[M]. ACM.2002:49-56.
    ③Chen S. F., Goodman J. An empirical study of smoothing techniques for language modeling[J]. Computer Speech & Language.1999,13 (4):359-393.
    ①江铭虎,袁保宗.一种适应域的汉语N—gram语言模型平滑算法[J].清华大学学报：自然科学版.1999,39(9)：99-102.
    ①Robertson S. E. The probability ranking principle in IR[J]. Journal of documentation.1977,33 (4):294-304.
    ②Zhai C., Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval [J]. ACM.2001:334-342.
    ① Zhai C., Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval [J]. ACM.2001:334-342.
    ②李纲,郑重.应用于信息检索的统计语言模型研究进展[J].情报理论与实践.2008,31(3)：471-476.
    ①马玉春,宋瀚涛.Web中文文本分词技术研究[J].计算机应用.2004,24(004)：134-135.
    ②吴凡.信息检索中的中文分词问题研究[J].情报杂志.2008,27(7)：41-43.
    ①黄科,马少平.基于统计分词的中文网页分类[J].中文信息学报.2002,16(6)：25-31.
    ②王还,刘杰,常宝儒.现代汉语频率词典[M].北京：北京语言学院出版社.1986.
    ③费洪晓,康松林,朱小娟等.基于词频统计的中文分词的研究[J].计算机工程与应用.2005,41(7)：67-68.
    ①邱哲,符滔滔.开发自己的搜索引擎Lucene 2.0+ Heritrix[M]北京：人民邮电出版社.2007.
    ② Baeza-Yates R., Ribeiro-Neto B现代信息检索[M].北京：机械工业出版社.2005.
    ①湛燕,陈昊,袁方等.基于中文文本分类的分词方法研究[J].计算机工程与应用.2003,39(23)：87-88.
    ②刘群,张华平,俞鸿魁等.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展.2004,41(8)：1421-1429.
    ① Palmer D., Burger J. Chinese word segmentation and information retrieval[J].1997.
    ②Peng F., Huang X., Schuurmans D.et al. Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR[J]. Association for Computational Linguistics.2002:1-7.
    ③ Foo S., Li H. Chinese word segmentation and its effect on information retrieval[J]. Information processing & management 2004,40 (1):161-190.
    ④金澎,刘毅,王树梅.汉语分词对中文搜索引擎检索性能的影响[J].情报学报.2006,25(1):21-24.
    ⑤吴建胜,战学刚,迟呈荚,一种基于自动机的分词方法[J].计算机工程与应用.2005,41(008)：81-82.
    ⑥张淑梅.词典与后缀数组相结合的中文分词[D].吉林大学.2006.
    ⑦赵伟,戴新宇,尹存燕等.一种规则与统计相结合的汉语分词方法[J].计算机应用研究.2004,21(3)：23-25.
    ⑧蔡灿民,吴晟,霍雪娜等.自动分词中智能词典的研究[J].科技广场.2007(003)：34-36.
    ①任雪利,代余彪.基于后缀数组的分词技术[J].计算机系统应用.2010,19(8).
    ②岳中原.词典与统计相结合的中文分词的研究[D].武汉理工大学.2010.
    ③ Luk R. W. P., Kwok K. L. A comparison of Chinese document indexing strategies and retrieval models[J]. ACM Transactions on Asian Language Information Processing (TALIP).2002,1 (3):225-268.
    ④李晓明,闫鸿飞.搜索引擎-原理、技术与系统[M].北京：科学出版社.2005：123-136.
    ①中国互联网络信息中心.第29次中国互联网络发展状况统计报告[J].
    [1]Brants T. Natural language processing in information retrieval[M].2003.
    [2]黄敏.自然语言处理与信息检索[J].图书情报工作.2001,4：4144.
    [3]王灿辉,张敏,马少平.自然语言处理在信息检索中的应用综述[J].中文信息学报.2007,21(2)：35-45.
    [4]陈千,向阳,魏韡.概念匹配方法综述[J].计算机应用研究.2010(04)
    [5]刘挺,秦兵,张宇.信息检索系统导论[M].机械工业出版社.2008.
    [6]Mooers C. N. The theory of digital handling of non-numerical information and its implications to machine economics[J].Zator Co.1950.
    [7]Salton G., Wong A., Yang C. S. A vector space model for automatic indexing[J]. Communications of the ACM.1975,18 (11):613-620.
    [8]Salton G., Fox E. A., Wu H. Extended Boolean information retrieval[J]. Communications of the ACM.1983,26 (11):1022-1036.
    [9]Maron M. E., Kuhns J. L. On relevance, probabilistic indexing and information retrieval[J]. Journal of the ACM (JACM).1960,7 (3):216-244.
    [10]丁国栋.基于统计语言建模的信息检索及相关研究[D].中国科学院研究生院(计算技术研究所).2006.
    [11]Robertson S. E., van Rijsbergen C. J., Porter M. F. Probabilistic models of indexing and searching[J]. Butterworth & Co.1980:35-56.
    [12]Robertson S. E., Walker S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval[J]. Springer-Verlag New York, Inc.1994: 232-241.
    [13]Robertson S. E., Walker S., Jones S.et al. Okapi at TREC-3[J]. NIST SPECIAL PUBLICATION SP.1995:109.
    [14]Robertson S. E., Walker S. Okapi/keenbow at trec-8[J].1999.
    [15]Ponte J. M., Croft W. B. A language modeling approach to information retrieval[J]. ACM.1998:275-281.
    [16]Gey F. C. Inferring probability of relevance using the method of logistic regression[J]. Springer-Verlag New York, Inc.1994:222-231.
    [17]Nallapati R. Discriminative models for information retrieval[J]. ACM.2004:64-71
    [18]Fumera G., Roli F. Cost-sensitive learning in support vector machines[J]. VIII Convegno Associazione Italiana per L'Intelligenza Artificiale.2002.
    [19]Gao J., Qi H., Xia X.et al. Linear discriminant model for information retrieval[J]. ACM.2005:290-297.
    [20]Burges C., Shaked T., Renshaw E.et al. Learning to rank using gradient descent[J]. ACM.2005:89-96.
    [21]Frank E., Hall M. A simple approach to ordinal classification[J]. Machine Learning: ECML 2001.2001:145-156.
    [22]Har-Peled Sariel, Roth Dan, Zimak Dav. Constraint classification A new approach to multiclass classification and ranking[J].2002.
    [23]Chu W., Keerthi S. S. New approaches to support vector ordinal regression[J]. ACM. 2005:145-152.
    [24]Vapnik V. N. The nature of statistical learning theory[J]. Springer-Verlag New York Inc.2000.
    [25]Rosenfeld R. Two decades of statistical language modeling:Where do we go from here?[J]. Proceedings of the IEEE.2000,88 (8):1270-1278.
    [26]Manning C. D., Schiitze H., Mitcognet. Foundations of statistical natural language processing[J]. MIT Press.1999.
    [27]Shannon C. Prediction and Entropy of Printed English.1951 [J]. Shannon:Collected Papers. IEEE Press.1993.
    [28]Berger A., Lafferty J. Information retrieval as statistical translation [J]. ACM.1999: 222-229.
    [29]Hiemstra D., Kraaij W. Twenty-One at TREC-7:Ad-hoc and cross-language track[J].1999.
    [30]Miller D. R. H., Leek T., Schwartz R. M. A hidden Markov model information retrieval system[J]:ACM.1999:214-221.
    [31]Xu J., Croft W. B. Cluster-based language models for distributed retrieval [J]. ACM. 1999:254-261.
    [32]Lavrenko V., Croft W. B. Relevance based language models[J]. ACM.2001: 120-127.
    [33]Tao T., Wang X., Mei Q.et al. Language model information retrieval with document expansion[J]. Association for Computational Linguistics.2006:407-414.
    [34]Shakery A., Zhai C. Smoothing document language models with probabilistic term count propagation[J]. Information Retrieval.2008,11 (2):139-164.
    [35]Rong Jin, Hauptmann Alex G., Zhai Chengxiang. Title language model for information retrieval [J]. ACM SIGIR 02.2002:42-48.
    [36]Zaragoza H., Hiemstra D., Tipping M. Bayesian extension to the language model for ad hoc information retrieval[J]. ACM.2003:4-9.
    [37]Hiemstra D., Robertson S., Zaragoza H. Parsimonious language models for information retrieval[J]. ACM.2004:178-185.
    [38]Li X., Croft W. B. Time-based language models[J]. ACM.2003:469-475.
    [39]Mei Q., Fang H., Zhai C. X. A study of Poisson query generation model for information retrieval[J]. ACM.2007:319-326.
    [40]Zhai C., Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval[J]. ACM.2001:334-342.
    [41]Chen S. F., Goodman J. An empirical study of smoothing techniques for language modeling[J]. Computer Speech & Language.1999,13 (4):359-393.
    [42]王志勇.基于统计语言学模型的中文文本信息检索[D].第二军医大学.2004.
    [43]袁毓林.基于统计的语言处理模型的局限性[J].语言文字应用.2004,17(2)：99-108.
    [44]钱如栏,董云耀.中文问答系统中基于SLM的信息检索及其平滑技术研究[J].计算机工程与科学.2010(001)：136-140.
    [45]Cronen-Townsend S., Zhou Y., Croft W. B. Predicting query performance[M]. ACM.2002:299-306.
    [46]Zhai C. X. Statistical language models for information retrieval a critical review[J]. Foundations and Trends in Information Retrieval.2008,2 (3):137-213.
    [47]毛伟.基于统计语言模型的中文自动文本分类系统[D]：北京邮电大学.2006.
    [48]Zhai C. X., Lafferty J. Two-stage language models for information retrieval[M]. ACM.2002:49-56.
    [49]江铭虎,袁保宗.一种适应域的汉语N—gram语言模型平滑算法[J].清华大学学报：自然科学版.1999,39(9)：99-102.
    [50]Robertson S. E. The probability ranking principle in IR[J]. Journal of documentation. 1977,33 (4):294-304.
    [51]Jelinek F. Interpolated estimation of Markov source parameters from sparse data[J]. Pattern recognition in practice.1980:381-397.
    [52]李纲,郑重.应用于信息检索的统计语言模型研究进展[J].情报理论与实践.2008,31(3)：471-476.
    [53]马玉春,宋瀚涛.Web中文文本分词技术研究[J].计算机应用.2004,24(004)：134-135.
    [54]吴凡.信息检索中的中文分词问题研究[J].情报杂志.2008,27(7)：41-43.
    [55]黄科,马少平.基于统计分词的中文网页分类[J].中文信息学报.2002,16(6)：25-31.
    [56]王还,刘杰,常宝儒.现代汉语频率词典[M].北京：北京语言学院出版社.1986.
    [57]费洪晓,康松林,朱小娟等.基于词频统计的中文分词的研究[J].计算机工程与应用.2005,41(7)：67-68.
    [58]邱哲,符滔滔.开发自己的搜索引擎Lucene 2.0+ Heritrix[M].北京：人民邮电出版社.2007.
    [59]Baeza-Yates R., Ribeiro-Neto B现代信息检索[M].北京：机械工业出版社.2005.
    [60]湛燕,陈昊,袁方等.基于中文文本分类的分词方法研究[J].计算机工程与应用.2003,39(23)：87-88.
    [61]刘群,张华平,俞鸿魁等.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展.2004,41(8)：1421-1429.
    [62]Palmer D., Burger J. Chinese word segmentation and information retrieval[J].1997.
    [63]Peng F., Huang X., Schuurmans D.et al. Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR[J].Association for Computational Linguistics.2002:1-7.
    [64]Foo S., Li H. Chinese word segmentation and its effect on information retrieval[J]. Information processing & management.2004,40 (1):161-190.
    [65]金澎,刘毅,王树梅.汉语分词对中文搜索引擎检索性能的影响[J].情报学报.2006,25(1)：21-24.
    [66]吴建胜,战学刚,迟呈英.一种基于自动机的分词方法[J].计算机工程与应用.2005,41(008)：81-82.
    [67]张淑梅.词典与后缀数组相结合的中文分词[D].吉林大学.2006.
    [68]赵伟,戴新宇,尹存燕等.一种规则与统计相结合的汉语分词方法[J].计算机应用研究.2004,21(3)：23-25.
    [69]蔡灿民,吴晟,霍雪娜等.自动分词中智能词典的研究[J].科技广场.2007(003)：34-36.
    [70]任雪利,代余彪.基于后缀数组的分词技术[J].计算机系统应用.2010,19(8)
    [71]岳中原.词典与统计相结合的中文分词的研究[D].武汉理工大学.2010.
    [72]Luk R. W. P., Kwok K. L. A comparison of Chinese document indexing strategies and retrieval models [J]. ACM Transactions on Asian Language Information Processing (TALIP).2002,1 (3):225-268.
    [73]李晓明,闫鸿飞.搜索引擎-原理、技术与系统[M].北京：科学出版社.2005：123-136.
    [74]中国互联网络信息中心.第29次中国互联网络发展状况统计报告[J].

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700