详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
     对中文网页海量数据进行研究,并将中文分词组件与lemur结合构建适宜于中文的信息检索系统方面的研究相对缺乏。本文在大规模中文网页语料库CWT200G的基础上,参考TREC和SWEM信息检索标准流程,以Lemur为基准工作平台,将其与中科院分词组件—汉语词法分析系统ICTCLAS相结合,形成一个可供实验的简单的信息检索系统。首先,阐述了本文的理论基础,介绍了基于统计语言方法的中文网页信息检索模型研究中所要研究的重点问题:统计语言模型、数据平滑、中文分词和中文文本索引等。然后,对信息检索评测的中文网页语料库和实验所需平台及系统进行简单介绍,对数据如何处理做了详细分析。最后,通过实验数据对比分析向量空间模型、概率模型等传统信息检索模型与统计语言模型对中文网页语料库进行主题检索时性能优劣;同时,在统计语言模型进行主题检索实验的时候,分别对Simplified Jelinek-Mercer平滑方法、Dirichlet Prior平滑方法和Absolute Discouting平滑方法进行实验,并对比这三种平滑方法在信息检索中的性能。
As the rapid development of Internet, information has grown exponentially, accessing information becomes more and more diverse, but information search has become even more complicated. An urgent need for high-level information processing technology to handle the vast amounts of information, and retrieve the necessary information to quickly to help people make better decisions and research. However, the popularity and wide application of information processing technology is largely thanks to the development of natural language processing technology, in order to solve the problem of information retrieval effectively, the research of information retrieval in the document content, the retrieval model, matching strategy and sorting algorithms gradually increasing. Retrieval model is still a hot topic of information retrieval research, a variety of retrieval models and methods have emerged, such as:boolean model, vector space model, probabilistic model. Especially in recent years, put forward a statistical language model, combining the natural language and statistical, with a strong mathematical basement, statistical language models become dominant in the information retrieval model, and has made a lot of research.
     On the basis of large-scale Chinese web corpus CWT200G, reference the information retrieval standard procedures of TREC and SWEM, combining the working platform of Lemur with word components which is Chinese lexical analysis system ICTCLAS of the Chinese Academy of Sciences's products, and available a simple information retrieval system. First of all, described the theoretical basis of this paper describes the need to study the key issues in the study of Chinese Web information retrieval method based on statistical language model:statistical language model, data smoothing, Chinese word segmentation and Chinese text indexing. Then a brief introduction on the Chinese Web page corpus of information retrieval evaluation and experimental platforms required, and system and do a detailed analysis of the data is how to deal with. Finally, the experimental comparison of the data analysis of the pros and cons of the traditional vector space model, probabilistic model of information retrieval models and statistical language model on the Chinese Web page corpus theme retrieval performance; the same time, the topic retrieval experiments in the statistical language model, respectively Simplified Jelinek-Mercer smoothing method,Dirichlet Prior smoothing methods and the Absolute Discouting smoothing method, and compare the performance of the three smoothing methods in information retrieval.
①Brants T. Natural language processing in information retrieval [M].2003.
    ① Mooers C. N. The theory of digital handling of non-numerical information and its implications to machine economics[J].Zator Co.1950.
    ② Salton G., Wong A., Yang C. S. A vector space model for automatic indexing[J]. Communications of the ACM. 1975,18 (11):613-620.
    ③ Salton G., Fox E. A., Wu H. Extended Boolean information retrieval[J]. Communications of the ACM.1983,26 (11):1022-1036.
    ④ Maron M. E., Kuhns J. L. On relevance, probabilistic indexing and information retrieval [J]. Journal of the ACM (JACM).1960,7 (3):216-244.
    ⑥ Robertson S. E., van Rijsbergen C. J., Porter M. F. Probabilistic models of indexing and searching[J]. Butterworth & Co.1980:35-56.
    ⑦ Robertson S. E., Walker S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval[J]. Springer-Verlag New York, Inc.1994:232-241.
    ⑧Robertson S. E., Walker S., Jones S.et al. Okapi at TREC-3[J]. NIST SPECIAL PUBLICATION SP.1995:109.
    ⑨ Robertson S. E., Walker S. Okapi/keenbow at trec-8[J].1999.
    ⑩Ponte J. M., Croft W. B. A language modeling approach to information retrieval[J]. ACM.1998:275-281.
    ① Rosenfeld R. Two decades of statistical language modeling:Where do we go from here?[J]. Proceedings of the IEEE. 2000,88 (8):1270-1278.
    ② Manning C. D., Schiitze H., Mitcognet. Foundations of statistical natural language processing[J]. MIT Press.1999.
    ③ Shannon C. Prediction and Entropy of Printed English.1951[J]. Shannon:Collected Papers. IEEE Press.1993.
    ④ Rosenfeld R. Two decades of statistical language modeling:Where do we go from here?[J]. Proceedings of the IEEE. 2000,88 (8):1270-1278.
    ⑤ Berger A., Lafferty J. Information retrieval as statistical translation[J]. ACM.1999:222-229.
    ⑥ Ponte J. M., Croft W. B. A language modeling approach to information retrieval[J]. ACM.1998:275-281.
    ① Hiemstra D., Kraaij W. Twenty-One at TREC-7:Ad-hoc and cross-language track[J].1999.
    ② Miller D. R. H., Leek T., Schwartz R. M. A hidden Markov model information retrieval system[J]:ACM.1999: 214-221.
    ③ Xu J., Croft W. B. Cluster-based language models for distributed retrieval[J].ACM.1999:254-261.
    ④ Lavrenko V., Croft W. B. Relevance based language models[J]. ACM.2001:120-127.
    ⑤ Shakery A., Zhai C. Smoothing document language models with probabilistic term count propagation[J]. Information Retrieval.2008,11 (2):139-164.
    ⑥ Mei Q., Fang H., Zhai C. X. A study of Poisson query generation model for information retrieval[J]. ACM.2007: 319-326.
    ⑦ Zhai C, Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval [J]. ACM.2001:334-342.
    ⑧ Chen S. F., Goodman J. An empirical study of smoothing techniques for language modeling[J]. Computer Speech & Language.1999,13 (4):359-393.
    ①Cronen-Townsend S., Zhou Y., Croft W. B. Predicting query performance[M]. ACM.2002:299-306.
    ①Zhai C. X. Statistical language models for information retrieval a critical review[J]. Foundations and Trends in Information Retrieval.2008,2(3):137-213.
    ② Zhai C. X., Lafferty J. Two-stage language models for information retrieval[M]. ACM.2002:49-56.
    ③Chen S. F., Goodman J. An empirical study of smoothing techniques for language modeling[J]. Computer Speech & Language.1999,13 (4):359-393.
    ①Robertson S. E. The probability ranking principle in IR[J]. Journal of documentation.1977,33 (4):294-304.
    ②Zhai C., Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval [J]. ACM.2001:334-342.
    ① Zhai C., Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval [J]. ACM.2001:334-342.
    ①邱哲,符滔滔.开发自己的搜索引擎Lucene 2.0+ Heritrix[M]北京:人民邮电出版社.2007.
    ② Baeza-Yates R., Ribeiro-Neto B现代信息检索[M].北京:机械工业出版社.2005.
    ① Palmer D., Burger J. Chinese word segmentation and information retrieval[J].1997.
    ②Peng F., Huang X., Schuurmans D.et al. Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR[J]. Association for Computational Linguistics.2002:1-7.
    ③ Foo S., Li H. Chinese word segmentation and its effect on information retrieval[J]. Information processing & management 2004,40 (1):161-190.
    ③ Luk R. W. P., Kwok K. L. A comparison of Chinese document indexing strategies and retrieval models[J]. ACM Transactions on Asian Language Information Processing (TALIP).2002,1 (3):225-268.
    [1]Brants T. Natural language processing in information retrieval[M].2003.
    [6]Mooers C. N. The theory of digital handling of non-numerical information and its implications to machine economics[J].Zator Co.1950.
    [7]Salton G., Wong A., Yang C. S. A vector space model for automatic indexing[J]. Communications of the ACM.1975,18 (11):613-620.
    [8]Salton G., Fox E. A., Wu H. Extended Boolean information retrieval[J]. Communications of the ACM.1983,26 (11):1022-1036.
    [9]Maron M. E., Kuhns J. L. On relevance, probabilistic indexing and information retrieval[J]. Journal of the ACM (JACM).1960,7 (3):216-244.
    [11]Robertson S. E., van Rijsbergen C. J., Porter M. F. Probabilistic models of indexing and searching[J]. Butterworth & Co.1980:35-56.
    [12]Robertson S. E., Walker S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval[J]. Springer-Verlag New York, Inc.1994: 232-241.
    [13]Robertson S. E., Walker S., Jones S.et al. Okapi at TREC-3[J]. NIST SPECIAL PUBLICATION SP.1995:109.
    [14]Robertson S. E., Walker S. Okapi/keenbow at trec-8[J].1999.
    [15]Ponte J. M., Croft W. B. A language modeling approach to information retrieval[J]. ACM.1998:275-281.
    [16]Gey F. C. Inferring probability of relevance using the method of logistic regression[J]. Springer-Verlag New York, Inc.1994:222-231.
    [17]Nallapati R. Discriminative models for information retrieval[J]. ACM.2004:64-71
    [18]Fumera G., Roli F. Cost-sensitive learning in support vector machines[J]. VIII Convegno Associazione Italiana per L'Intelligenza Artificiale.2002.
    [19]Gao J., Qi H., Xia X.et al. Linear discriminant model for information retrieval[J]. ACM.2005:290-297.
    [20]Burges C., Shaked T., Renshaw E.et al. Learning to rank using gradient descent[J]. ACM.2005:89-96.
    [21]Frank E., Hall M. A simple approach to ordinal classification[J]. Machine Learning: ECML 2001.2001:145-156.
    [22]Har-Peled Sariel, Roth Dan, Zimak Dav. Constraint classification A new approach to multiclass classification and ranking[J].2002.
    [23]Chu W., Keerthi S. S. New approaches to support vector ordinal regression[J]. ACM. 2005:145-152.
    [24]Vapnik V. N. The nature of statistical learning theory[J]. Springer-Verlag New York Inc.2000.
    [25]Rosenfeld R. Two decades of statistical language modeling:Where do we go from here?[J]. Proceedings of the IEEE.2000,88 (8):1270-1278.
    [26]Manning C. D., Schiitze H., Mitcognet. Foundations of statistical natural language processing[J]. MIT Press.1999.
    [27]Shannon C. Prediction and Entropy of Printed English.1951 [J]. Shannon:Collected Papers. IEEE Press.1993.
    [28]Berger A., Lafferty J. Information retrieval as statistical translation [J]. ACM.1999: 222-229.
    [29]Hiemstra D., Kraaij W. Twenty-One at TREC-7:Ad-hoc and cross-language track[J].1999.
    [30]Miller D. R. H., Leek T., Schwartz R. M. A hidden Markov model information retrieval system[J]:ACM.1999:214-221.
    [31]Xu J., Croft W. B. Cluster-based language models for distributed retrieval [J]. ACM. 1999:254-261.
    [32]Lavrenko V., Croft W. B. Relevance based language models[J]. ACM.2001: 120-127.
    [33]Tao T., Wang X., Mei Q.et al. Language model information retrieval with document expansion[J]. Association for Computational Linguistics.2006:407-414.
    [34]Shakery A., Zhai C. Smoothing document language models with probabilistic term count propagation[J]. Information Retrieval.2008,11 (2):139-164.
    [35]Rong Jin, Hauptmann Alex G., Zhai Chengxiang. Title language model for information retrieval [J]. ACM SIGIR 02.2002:42-48.
    [36]Zaragoza H., Hiemstra D., Tipping M. Bayesian extension to the language model for ad hoc information retrieval[J]. ACM.2003:4-9.
    [37]Hiemstra D., Robertson S., Zaragoza H. Parsimonious language models for information retrieval[J]. ACM.2004:178-185.
    [38]Li X., Croft W. B. Time-based language models[J]. ACM.2003:469-475.
    [39]Mei Q., Fang H., Zhai C. X. A study of Poisson query generation model for information retrieval[J]. ACM.2007:319-326.
    [40]Zhai C., Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval[J]. ACM.2001:334-342.
    [41]Chen S. F., Goodman J. An empirical study of smoothing techniques for language modeling[J]. Computer Speech & Language.1999,13 (4):359-393.
    [45]Cronen-Townsend S., Zhou Y., Croft W. B. Predicting query performance[M]. ACM.2002:299-306.
    [46]Zhai C. X. Statistical language models for information retrieval a critical review[J]. Foundations and Trends in Information Retrieval.2008,2 (3):137-213.
    [48]Zhai C. X., Lafferty J. Two-stage language models for information retrieval[M]. ACM.2002:49-56.
    [50]Robertson S. E. The probability ranking principle in IR[J]. Journal of documentation. 1977,33 (4):294-304.
    [51]Jelinek F. Interpolated estimation of Markov source parameters from sparse data[J]. Pattern recognition in practice.1980:381-397.
    [57]费洪晓,康松林,朱小娟等.基于词频统计的中文分词的研究[J].计算机工程 与应用.2005,41(7):67-68.
    [58]邱哲,符滔滔.开发自己的搜索引擎Lucene 2.0+ Heritrix[M].北京:人民邮电出版社.2007.
    [59]Baeza-Yates R., Ribeiro-Neto B现代信息检索[M].北京:机械工业出版社.2005.
    [62]Palmer D., Burger J. Chinese word segmentation and information retrieval[J].1997.
    [63]Peng F., Huang X., Schuurmans D.et al. Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR[J].Association for Computational Linguistics.2002:1-7.
    [64]Foo S., Li H. Chinese word segmentation and its effect on information retrieval[J]. Information processing & management.2004,40 (1):161-190.
    [72]Luk R. W. P., Kwok K. L. A comparison of Chinese document indexing strategies and retrieval models [J]. ACM Transactions on Asian Language Information Processing (TALIP).2002,1 (3):225-268.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700