基于潜在中间语义的多语言信息检索研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着因特网的发展,由于网络资源语种的多样性和网络用户所掌握语言的差异性,不可避免地给人们利用网络检索信息带来了语言障碍。例如,一个中国用户可能希望找到英语信息,而他的英语水平又不足以使他能用英语准确地表达自己的需求。多语言信息检索(Multilingual Information Retrieval,MLIR)正是为了满足这种需要而产生的,它是指用户用母语提交查询,搜索引擎在多种语言的数据库中进行信息检索,返回能够回答用户问题的所有语言的文档。其主要的困难在于语言之间在表达方式和语义对应上的不确定性。
     基于辞典的模式和机器系统翻译的技术一度成为人们进行多语言信息检索的热点研究技术。然而仅用翻译模型进行多语言信息检索,难以处理词汇翻译的多对多问题和未登录词问题。通过使用平行语料库,从语义(概念)层面上来构建多语言信息检索模型,是当今多语言信息检索研究的新趋势。
     利用语言之间的潜在中间语义对应关系,把词空间映射到一个抽象的概念空间,可避免直接翻译到目标语言而导致的语义偏离,能部分解决词汇歧义和未登录词问题。据此,本文应用扩展的偏最小二乘理论提出了一种基于潜在中间语义的多语言信息检索模型:在统一框架下对双语语料库的平行文档进行分析建模,提取语言之间的潜在中间语义对应关系,在潜在中间语义空间中进行检索,从而实现多语言信息检索。
     本文主要工作:
     1、分析并深入研究了基于辞典翻译进行多语言信息检索存在的翻译歧义问题,针对该问题,应用扩展的偏最小二乘理论,提出了一种同时考虑双语平行文档的语义对应模型;
     2、在自建的中英平行语料库和蒙特利尔大学提供的英法平行语料库基础上,对平行文档进行分析建模,建立了中英、英法跨语言信息检索模型,并利用英语作为过渡语言,建立了中法跨语言信息检索模型;
     3、在TREC5&9和TREC3的AP&SDA数据集上进行了中、英、法三种语言的跨语言信息检索实验,且与单语言的信息检索模型进行了比较,实验结果显示本文提出的模型表现了较好的性能。
With the rapid development of the Internet, the diversity of network resources languages and the differences of languages which the Internet users use inevitably result in language barrier when the users retrieve information from the Internet. For example, a Chinese user may want to find information in English, but his proficiency of English can’t meet the requirement to make himself understood. Aiming to solve the problem, Multilingual Information Retrieval (MLIR) make users submit queries in mother language, and searching engine retrieves information from multilingual databases, then it can render documents in all languages that can answer the users’questions. While the main difficulty of MLIR lies in the uncertainty of the language expression and the semantic correspondences among languages.
     The technology based on the pattern of dictionary and the machine translation system was once the focus of researching technology of multilingual information retrieval. However, only via translation model to perform multilingual information retrieval, it is difficult to solve the problems of lexical translation: polysemy and unknown words. Nowadays, a new trend in the research on multilingual information retrieval is to construct a multilingual information retrieval model at the semantic (concept) level by using parallel corpus.
     Making use of the latent interlingua semantic correspondences between languages, the word space can be mapped to an abstract concept space, so it can avoid the semantic deviation from a direct translation to the target language, and it can solve the problem of word polysemy and unknown words partially. Therefore, in this paper, we propose a multilingual information retrieval model based on latent interlingua semantics by applying the theory of extended Partial Least Squares: We exploit parallel documents in the bilingual corpus in the unified framework, so as to extract the latent interlingua semantic correspondences between languages, and to retrieve information in a rendered latent interlingua semantic space. Multilingual information can be retrieved using such a space.
     The main works in this paper are as follows:
     Firstly, we have analyzed and in-depth studied the problem of translation ambiguity from the multilingual information retrieval based on dictionary translation. In view of the problem of translation ambiguity, we take the parallel bilingual corpus into account to propose a semantic corresponding model by applying the theory of extended Partial Least Squares;
     Secondly, based on a Chinese-English parallel corpus built by ourselves and an English-French parallel corpus from the University of Montreal, exploiting parallel documents, we have constructed a Chinese-English and an English-French cross-language information retrieval model, then taking English as a transition language, we have constructed a Chinese-French cross-language information retrieval model;
     Thirdly, we have carried out experiments of trilingual cross-language information retrieval (namely Chinese, English and French) on TREC5&9 and TREC3’s AP&SDA data sets. After making comparison with mono-lingual information retrieval, it turns out that our retrieval model performs better.
引文
[1] Story R.E. An Explanation of the Effectiveness of Latent Semantic Index by Means of a Bayesian Regression Model. Information Processing and Management.1996, 32(3):329-344.
    [2]戴维民.二十一世纪图书馆学情报学.北京图书馆出版社,2002.
    [3]张玥杰等.基于机器翻译实现跨语言信息检索.小型微型计算机系统. 2004(7).
    [4]赖茂生.情报检索技术与方法的研究综述.情报学进展(第五卷). 2002: 166-167.
    [5]陆宝益,陆宝忠.论跨语言网络信息检索技术系统:以Mulinex为例.情报科学,2001(8).
    [6]王进等.基于本体的跨语言信息检索模型.中文信息学报,2004(3).
    [7]金千里,赵军,徐波.弱指导的统计隐含语义分析及其在跨语言信息检索中的应用全国第七届计算语言学联合学术会议北京:清华大学2003-08-01 527-533.
    [8] Jianfeng Gao, Jianyun Nie. A study of statistical models for query translation: finding a good unit of translation, SIGIR, 2006.
    [9] Lide Wu, Xuanjing Huang, etc. FDU at TREC-9: CLIR, QA and Filtering Tasks. In: The Ninth Text Retrieval Conference (TREC 9), 2000.
    [10] Lisa Ballesteros, W. Bruce Croft. Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications, 1996, 791-801.
    [11] M. Davis, T. Dunning, Query translation using evolutionary programming for multilingual information retrieval, Proc. Of the 4th Annual Conf. on Evolutionary Programming, 1995.
    [12] P. F. Brown, J.Cocke, S.Della Pietra. A Statistical Approach to Machine Translation [J], Computational Linguistics, 1990, 16(2).
    [13] Jianyun Nie, Michel Simard, Pierre Isabelle et al. Cross-language information retrieval based on parallel texts and automatic mining parallel texts from the Web. In: Conference on Research and Development in Information Retrieval, ACM SIGIR’99, August 1999, 74 - 81.
    [14] Jiang Chen, Jianyun Nie. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In: Proc. of the6th Applied Natural Language Processing Conference, Seattle, 2000, 21 - 28.
    [15]聂建云,陈江.利用平行网页建立中英文统计翻译模型.中文信息学报. 2001, 1-12.
    [16]王惠文.偏最小二乘回归方法及其应用,北京:国防工业出版社,1999.
    [17] Wold H. Partial least squares [M]. New York: Kotz S. and Johnson N.L., Encyclopedia of Statistical Science, Wiley, 1985.
    [18] Michael L. Littman, Susan T. Dumais, Thomas K. Landauer. Automatic Cross-Language Information Retrieval using Latent Semantic Indexing. Working notes of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, 1997.
    [19] Bob Rehder, Michael L. Littman, Susan Dumais, and Thomas K. Landauer. Automatic 3-language cross-language information retrieval with latent semantic indexing. In The Sixth Text Retrieval Conference Notebook Papers (TREC6), 1997.
    [20]黄国斌,王明文,叶浩.一种新的基于中间语义的跨语言信息检索模型第四届全国信息检索与内容安全学术会议北京:清华大学2008-11-15 267.
    [21] Wenxia Bi, Mingwen Wang, Yuansheng Luo, Hao Ye. A New Cross Language Text Categorization Based on Interlingua Semantic. Journal of Computational Information Systems. 2008, Vol.4 No.1 105-110.
    [22] Mingwen Wang, Hao Ye, Guobin Huang, Wenxia Bi. A Cross Language Retrieval Model Based On Interlingua Semantics. Journal of Computational Information Systems. 2007, Vol.3 No.4 1555-1560.
    [23]孙建军,成颖,等著.信息检索技术.北京:科学出版社, 2004.
    [24] L M de Campos. Independency relationships and learning algorithm for singly connected networks [J].J Exp. Theor Artif Intell, 1998, 10(4): 511-549.
    [25] David Ellis. New Horizons in Information Retrieval. 1990. 1-25.
    [26] V. Gudivada, V. Raghavan, W. Grosky and R. Kasangagottu. Information retrieval on the World Wide Web. IEEE Internet Computing, Oct-Nov, 1997.
    [27] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing. Communications of the ACM November, 1975, 18(11).
    [28] T.F. Tsang, R. W. P. Luk, and K. F. wong. In Proceedings of the Hybrid term indexing using words and bi-grams. In Proceedings of the Information Retrieval with Asian Languages 1999 Conference, 1999, 112-117.
    [29] Robertson S E, Walker S, Hancock-Beaulieu M, et al. Okapi at TREC-3. In: Proceedings of TREC-3, Gaithersburg MD, 1994-11:109.
    [30] Gerard Salton, Edward A. Fox, Harry Wu. Extended Boolean information retrieval. Communications of the ACM, Nov. 1983, 26(11): 1022-1036.
    [31] Salton G. The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice Hall Inc, Englewood Cliffs, NJ, 1971.
    [32] Eriv W Brown, James P Callan, W Bruce Crooft. Fast Incremental Indexing for Full-Text Information Retrieval[C] In: Proceedings of the 20th VLDB Conference, Chile, 1994.
    [33]王昊,跨语言信息检索实现方法与关键技术探讨,情报杂志, 2005-7.
    [34]侯艳飞,跨语言信息检索研究,硕士学位论文,北京:北京大学信息管理系, 2003.
    [35] Landauer T K, Littman M L. Fully automatic cross-language document retrieval using latent semantic indexing. In: Proc. of the 6th Annual Conf. of the UW Center for the New Oxford English Dictionary and Text Research, 1990. 31~38.
    [36] Christof Monz, Statistical Machine Translation and Cross-Language IR: QMUL at CLEF 2006.
    [37]黄璟,吕雅娟,刘群.基于信息检索方法的统计翻译系统训练数据选择与优化.中文信息学报, 2008, 40-46.
    [38] R Neches, R Fikes, T Finin, et al. Enabling Technology for Knowledge Sharing [J]. AI Magazine, 1991, 12(3): 36 - 56.
    [39] T. R. Gruber. A translation approach to portable ontologies [J]. Knowledge Acquisition, 1993, 5(2):199 -220.
    [40] T. Hedlund, H. Keskustalo, E. Airio, UTACLIR-An Extendable Query Translation System. In Workshop on Cross-Language Information Retrieval, A Research Roadmap[C], Organized at 22nd International Conference On Research and Development in Information Retrieval, SIGIR, Tampere, Finland, 2002.
    [41] Liu X Y, Croft W B. Statistical Language Modeling for Information Retrieval. In the Annual Review of Information Science and Technology, 2004. vol. 39: 3-31.
    [42] Jinxi Xu, Ralph Weischedel, Chanh Nguyen. Evaluating a Probabilistic Model for Cross-lingual Information Retrieval. SIGIR’01, New Orleans, Louisiana, USA. September9-12, 2001, 105– 110.
    [43] Dumais S T, et al. Automatic Cross - Language Retrieval Using Latent Semantic Indexing. Working Notes of AAAI - 97 Spring Symposiums on Cross-Language Text and Speech Retrieval, 18– 24.
    [44] Carbonell J, Yang Y, Frederking R , Brown R , Geng Y, Lee D. A Realistic Evaluation of Translingual Information Retrieval Methods, personal communication, L TI, CMU.
    [45] W.F. Massy. Principal Components Regression in Explore theory Statistical Research. Journal of the American Statistical Association. 1965, 234-266.
    [46] Harville D.A. Matrix algebra from a statistician’s perspective. Springer, 1997.
    [47] Tenenhaus M. La Régreesion PLS. Paris: Théorie et Pratique,éditions Technip, 1998.
    [48] Robertson S E, Jones S K. Relevance Weighting of Search Terms, JASIS, 1976, 27:129-146.
    [49] Jianyun Nie. Towards A Unified Approach to CLIR and Multilingual IR[C]. Proceedings of A Workshop at SIGIR, Tampere, Finland, 2002-08-15.
    [50] Mayfield J, McNamee P. Three Principles to Guide CLIR Research[C]. Proceedings of A Workshop at SIGIR, Tampere, Finland, 2002-08-15.
    [51] Chen A, Gey F. Multilingual Information Retrieval Using Machine Translation, Relevance Feedback and Decompounding [J]. Information Retrieval, 2004, 7(1/2): 147-180.
    [52] L. Ballesteros, W.B. Croft, Resolving Ambiguity for Cross-Language Retrieval [A], Proceedings of ACM SIGIR[C], 1998, 64-71.
    [53]吴晨等.基于语言概念空间的跨语种信息检索模型.计算机工程, 2006, 9: 9-19.
    [54] Dan Wu, Daqing He, Huilin Wang, Chongde Shi, Chengzhi Zhang. Does Query Length Matter? A Comparison of Query Expansion Methods in English-Chinese Cross-Language Information Retrieval, Journal of Computational Information Systems, 2008, 4(3): 1213-1222.
    [55] Leif Azzopardi, Maarten de Rijke, Krisztian Balog. Building Simulated Queries for Known-Item Topics: An Analysis using Six European Languages. ACM SIGIR’2007, July, 2007, 455 - 462.
    [56] Wei Gao, Cheng Niu, Jianyun Nie, Ming Zhou, Jian Hu, Kam-Fai Wong, Hsiao-Wuen Hon. Cross-Lingual Query Suggestion Using Query Logs of Different Languages. ACM SIGIR’2007, July, 2007, 463 - 470.
    [57] Ming Feng Tsai, Yu Ting Wang, Hsin Hsi Chen. A Study of Learning a Merge Model for Multilingual Information Retrieval. ACM SIGIR’2008, July 2008, 195 - 202.
    [58] Yejun Wu, Douglas W. Oard. Bilingual Topic Aspect Classification with A Few Training Examples. ACM SIGIR’2008, July 2008, 203 - 210.
    [59] Tanuja Joshi, Joseph Joy, Tobias Kellner, Udayan Khurana, A Kumaran, Vibhuti Sengar. Crosslingual Location Search. ACM SIGIR’2008, July 2008, 211 - 218.
    [60]刘伟成,孙吉红.跨语言信息检索模型应用研究.情报杂志, 2007年第10期, 55-57.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700