潜在语义分析在跨语言信息检索中的应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着英特网的发展,人们越来越多的面临怎样有效地查找相关外语文件的问题。在互联网发展初期,网络内容以英文为主,上网用户也多来自美、英等发达国家,但此后,来自其他国家的网站和用户数逐渐增加,给传统的以英语为唯一语言的信息检索技术带来新的问题。为此研究直接用用户的母语进行信息检索成为必要,进而研究双语言或多语言的跨(交叉)语言信息检索也成为一个热门的话题。
     跨语言信息检索研究的是基于一种自然语言构造的查询搜索任意语言文档的方法,因为单一语言信息检索的研究已经比较成熟,而且已经实用化,因此目前跨语言信息检索技术的基本框架都是从单语言信息检索继承发展而来。但由于不同的语言背后隐藏着差别很大的文化背景和人文习惯,机器翻译的效果至今不能满足人们的要求,所以仅仅依靠单语言检索的方法不能满足解决跨语言信息检索中的语义匹配等深层次问题。
     本文首先介绍了跨语言信息检索的研究内容和相关技术及其国际评测标准,接着分析了潜在语义分析的原理和建模方法以及相关的应用。然后根据潜在语义分析的语言无关性等特点,用其分析双语文本,建立词语翻译模型,并引入双向翻译思想,提高翻译准确率。随后针对传统跨语言信息检索中查询扩展方法的缺陷,结合k-means聚类和潜在语义分析模型对文本和词语表示的优势,提出一种新的扩展方法,减轻翻译出错或翻译歧义对查询结果的影响,最后更新了传统的查询词权重计算公式,提高了检索的平均准确率。
With the development of Internet, more and more people face the problem of retrieving foreign language information effectively. In the early days of Internet, web pages were English, and most casual users came from developed countries such as America or England. Subsequently, the gradual increase of the websites and users from non-English speaking countries brings new problems for traditional English-only information retrieval system. Therefore, it's necessary to study how to use our native languages to get foreign language information. So cross-language information retrieval became a hot topic.
     The goal of cross-language information retrieval is to get foreign language information from native language. Because the effectiveness of the monolingual information retrieval is pretty good, most researchers take the technology of monolingual information for reference during research on cross-language information retrieval. But the effectiveness of machine translation is poor because of cultural difference. So far, the technology of cross-language information retrieval can't satisfy with the requirement at the semantic level.
     In this paper, we introduce the main technology of cross-language information retrieval and relative international evaluation standards at first, and then describe the principle and modeling of latent semantic analysis and its applications. After that, we propose a translation model based on latent semantic analysis combining the theory of bi-directional translation. The experimental results show that the precision is better than traditional vector space model. Subsequently, to circumvent the defects of traditional cross-language information retrieval query expansion, we propose a new method for cross-language query expansion based on k-means clustering and latent semantic analysis. The method can relieve the negative influence of wrong translation or the ambiguity of words in translation. At last, we update the weightings of each word in new query. The results show the improvement of average precision.
引文
[1]http://www.cnnic.net.cn
    [2]http://www.global-reach.biz/globstats
    [3]http://www.glreach.com/globstats
    [4]Salton G..Automatic processing of foreign language documents[J].Journal of the American Society for information Science.1970,21(21):187-194
    [5]Salton G.Experiments in multi-lingual information retrieval.Information Processing Letters,1973,2(1):6-11
    [6]Pevzner B R.Comparative evaluation of the operation of the Russian and English variants of the "Pusto-Nepusto-2" system.Automatic Documentation and Mathematical Linguistics.1972,6(2):71-74
    [7]Pevzner B R.Automatic translation of English text to the language of the Pusto-Nepusto-2system.Automatic Documentation and Mathematical Linguistics。1969,3(4):40-48
    [8]Douglas W.Oard,Bonnie J.A Survey of Multilingual Text Retrieval Technical Report.UMLACS-TR-96-19 University of Maryland.Institute for Advanced Computer Studies
    [9]Ted E.Dunning,Mark W.Davis Multi-Language Information Retrieval Memoranda In Cognitive and Computer Science MCCS-93-252 New Mexico State University Computing Research Laboratory,1993
    [10]P.Sheridan,J.P Ballerini.Experiments in Multilingual Information Retrieval using the SPIDER system.Proceedings of the 19th annual international ACM SIGIR,1996,58-65
    [11]Bonnie J.Dorr,Douglas W.Oard.Evaluating Resources for Query Translation in Cross-Language Information Retrieval.1998.http://umd.edu/pub/bonnie/granadapsa.ps
    [12]Jianyun Nie,Michel Simard et al.Cross-language information retrieval based on parallel texts automatic parallel texts from the web.In:Conference on Research and Development In Information Retrieval.ACM SIGIR'99,August 1999,74-81
    [13]Chen,Hsin-His,Lee,Je Chang.Identification and Classification of Proper Nouns in Chinese Texts.Proceedings of 16th International Conference on Computational Linguistics,Copenhagen,Denmark.1996:222-229.
    [14]C.H Lin,H.Chen.An automatic indexing and neural network approach to concept retrieval and classification of multilingual(Chinese-English)documents.IEEE Transactions on Systems,Man andCybernetics.1996,26(1):75-88
    [15]聂建云,陈江。利用平行网页建立中英文统计翻译模型。中文信息学报,2001,15(1):1-12
    [16]Kwok K L English-Chinese Cross-Language Retrieval based on translation package.In:Conference on Research and Development In Information Retrieval.ACM SIGIR,1999
    [17]Kwok K L Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval.In:Conference on Research and Development In Information Retrieval.ACM SIGIR,2000
    [18]汤艳莉。赖茂生.Ontology在自然语言检索中的应用研究.信息检索技术,2005(2):33-36
    [19]王进等.基于本体的跨语言信息检索模型.中文信息学报,2004,18(3):1-8
    [20]DumaisST,LandauerT K.LittmanM L.Automatic cross-language information retrieval using latent semantic indexing.SIGIR96 Workshop On Cross-Linguistic Information Retrieval,1996
    [21]S.Dumais,Improving the retrieval of information from external sources,Behavior Research Methods Instruments & Computers.vol.23,no.2,1991,229-236.
    [22]Salton G,McGill M J.Introduction to Modern Information Retrieval.McGraw-Hill,1983
    [23]HAN J,KAMBER M.数据挖掘概念与技术[M].范明,盂小峰译.北京,机械工业出版社,2001.
    [24]Robertson S,Sparck-Jones K.Relevance Weighting of Search Terms.Journal of American Society for Information Science,1976,3(27):129-146
    [25]Kwok K L.Experiments with a component theory of probabilistic information retrieval based on single terms as document components,ACM Trans.Inf.Sys.,1990,8(4),363-386
    [26]Robertson S E.The parametric description of retrieval tests,Journal of Documentation,1969,25,1-27
    [27]Savoy J.Searching information in legal hypertext systems.Artif,Intell.Law 1994,2,205-232
    [28]张华平.语言浅层分析与句子级新信息检测研究[D].北京:中国科学院研究生院, 2005.
    [29]Rocchio J,"Relevance feedback in information retrieval," In:The Smart Retrieval System-Experimentsin Automatic Document Processing[M],1971,313-323.
    [30]Voorhees,E.and Harman,D.Overview of the Sixth Text Retrieval Conference[A].In:proceedings of the Sixth Text Retrieval Conference(TREC-6)[C],1998.
    [31]Cleverdon C W,Mills J,Keen M.Factors Determining the Performance of Indexing Systems,Volume I Design.ASLIB Cranfield Project,Creadield,1996
    [32]Douglas W.Oard,Bonnie J.Oorr.A survey of multilingual text retrieval Technical Report UMIACS-TR-96-19.University of Maryland.Institute for Advanced Computer Studies,http//www.ee.urnd.edu/medlab/filter/papers/mlir.ps
    [33]David A.Hull,Gregory Grefenstette.Experiments in multilingual information retrieval.In:Proceedings of the 19th International ACM SIGIR Conference On Research and Development in Information Retrieval,1996.
    [34]Mark Davis.New Experiments on Crass-language information retrieval at NMSU's Computing Research Lab[M].The Fifth Text Information Retrieval Conference,1996.
    [35]M.W.Davis,W C.Ogden QUILT:Implementing a large scale cross language text retrieval system Proceedings of the 20th ACM SIGIR conference on researchand development in information retrieval.1997:92-98
    [36]L Ballesteros,W.B.Croft.Resolving ambiguity for cross-language retrieval,Proceedings of the 21st ACM SIGIR conference on research and development in information retrieval,1998:64-71
    [37]J.Allan,H.Raghavan,Using part-of-speech patterns to reduce query ambiguity Information Processing andManagement。2001,37(6):769-787
    [38]K Yamabana,K.Muraki,S.Doi,S.Kamei.A language con version front-end for cross-language information retrieval G.Grefenstette(Ed.),Cross-language information retrieval Boston。MA:Kluwer.1998:93-104
    [39]Jianfeng Gao,Jianyun Nie,Jian Zhang,et al,TREC-9 CLIR Experiments at MSRCN,In:Proc.of the 9th Text Retrieval Evaluation Conf,National Institute of Standards and Technology,2000.
    [40]Landauer T.K,et al.A Solution to Plato's Problem:The Latent Semantic Analysis Theory of the Acquisition,Induction,and Representation of Knowledge[J].Psychological Review,1997,104:211-240.
    [41]Walter Kintsch.Predication[J].Cognitive Science,2001,25:173-202.
    [42]LIN Hongfei,et al.Text Browsing Based on Latent Semantic Indexing[J].Journal of Chinese Information Processing,2000,14(5):49-56.
    [43]林鸿飞,姚天顺.基于潜在语义索引的文本浏览机制[J].中文信息学报,2000,14(5):49-56
    [44]Thomas Hofmann.Probabilistic Latent Semantic Indexing[C],SIGIR'99 1999
    [45]Mikko Kurimo,Chafic Mokbel.Latent Semantic Indexing by Self-Organizing Map[C].2000
    [46]Mikko Kurimo.Fast Latent Semantic Indexing of Spoken Documents by using Self-Organizing Maps[C].Proc.of ICASSP,Istanbul,Turkey,June,2000.3781-3794.
    [47]Noriaki Kawamae,Latent Semantic Indexing Based on Factor Analysis[Z].2001
    [48]G Salton.Automatic Information Organization and Retrieval[M].NewYork:McGraw Hill,1968.
    [49]Kintsch E,Steinhart D,Stahl G.Developing Summarization Skills Through the Use of LSA based Feedback[J].Interactive Learning Environments,2000,8(2):87-109.
    [50]Foltz P W,et al.Automated Essay Scoring:Applications to Educational Technology[C].Proceedings of Media'99.1999.
    [51]林鸿飞.基于示例的文本标题分类机制[J].计算机研究与发展,2001,38(9):1132-1136.
    [52]Malcolm Slaney,Dulce Ponceleon.Hierarchical Segmentation Using Latent Semantic Indexing in Scale Space[Z].2001.
    [53]Bob Rehder,et al.Automatic 3-Language Cross-Language Information Retrieval with Latent Semantic Indexing[Z].1997.
    [54]Walter Kintsch.On the Notions of Theme and Topic in Psychological Process Models of text Comprehension[J].Thematics:Interdisciplinary Studies,2002:157-170.
    [55]Rajah S,Craig S D,Gholson B,et al.Auto Tutor:Incorporating Backchannel Feedback and Other Humanlike Conversational Behaviors into an Intelligent Tutoring System[J].International Journal of Speech Technology,2001,4:117-126.
    [56]Foltz P W,et al.The Measurement of Textual Coherence with Latent Semantic Analysis[J].Discourse Processes,1998,25:285-307.
    [57]万小军,杨建武,陈晓鸥,文档聚类中k-means算法的一种改进算法,计算机工程,2003,29(2),102-157
    [58]http://www.eqie.com/sc/sc11.htm
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.