跨语言信息检索中双语主题模型及算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着因特网的快速发展和全球化进程的加快,因特网所提供的信息资源不再集中于英语等少数几种语言上,人们使用母语去查询不同语言表示的信息的需求不断增加。跨语言信息检索(Cross-language Information Retrieval, CLIR)是一种表示、存储、组织和存取多语言信息资源的快速有效手段,是信息检索中一个富有挑战性和前沿的研究领域。
     跨语言信息检索重点解决如何使用一种语言表示的查询去搜索另外一种语言表示的信息的问题,其关键问题之一是采取不同方法建立双语语义对应关系。近年在机器学习、信息检索和自然语言处理等领域备受关注的主题模型已成为一种有效的CLIR方法。本文选题来源于国家自然科学基金项目《基于潜在语义对偶空间的跨语言信息检索理论和算法研究》(项目编号:60963014)和江西省教育厅青年科学基金项目《面向检索的平行语料库构建及跨语言检索模型研究》(项目编号:GJJ101168),系统地深入研究了基于双语主题空间的跨语言检索模型、跨语言文本分类方法和跨语言文本聚类方法,在不借助于机器翻译和双语词典等跨语言资源的情况下,可以有效地解决CLIR中词汇翻译的多对多问题,部分解决未登录词问题。本文的主要工作包括如下几点:
     (1)基于双语主题空间的跨语言信息检索总体框架研究
     从自然语言理解的角度来看,多语言文字是语言描述对象赋予有意义的不同语言符号系统的多视图表示。本质上,这些视图是语义等价的。本文假设双语平行文档享有相同的语义信息,运用偏最小二乘(Partial Least Square,PLS)数据统计分析理论,从双语平行语料库中提取平行文档的共有语义信息,构建具有双语对应关系的主题空间,由此建立一种基于双语主题空间的跨语言信息检索总体框架。
     在统一的框架下,从双语平行语料库抽取出一系列的主题构成每种语言的主题空间。每种语言的主题空间独立存在,且通过双语语义对应关系建立双语主题空间。双语主题空间反映了文档与文档、文档与词、词与词的语义对应关系,揭示了语言之间、语言内部的固有结构和内在联系,是抽象的概念空间,是各语言原始文档的中间表示。表示方式可以是线性或者非线性。从数学原理上来说,两个主题空间是近似等价的。我们将查询和文档投影到双语主题空间上,不需要直接翻译,可以实现跨语言的检索、分类和聚类。
     (2)建设面向跨语言信息检索的中英平行语料库
     语料库是一种十分重要的跨语言信息检索基础数据资源。CLIR可以使用语料库进行性能评测、翻译、建立双语词典和词义消歧等工作。
     本文从华尔街日报、金融时报和香港政府新闻网等网站搜集中英新闻网页,按照确定平行网页、文件预处理、段落对齐、文档类别标注、建立检索查询集和文档相关性判断等流程,自行建立了中英平行语料库、CLIR评测语料库、跨语言文本分类评测语料库。通过应用Google API1.0接口程序翻译TREC-9文档集建立了TREC-9中英双语平行语料库。
     (3)基于主题对偶空间的跨语言检索模型研究
     跨语言的潜在语义索引模型(Cross-Language Latent Semantic Indexing,CL-LSI)将每对双语文档串接成一个文档,利用双语词汇的共现特征获取双语之间的语义联系,而没有充分考虑各语言的固有特性和双语语义相关性。本文假设在双语平行语料库中,两种语言文档集隐含的主题内容相同,使用线性语义对偶空间表示双语主题,由此提出一种基于主题对偶空间的跨语言检索模型(TopicDual Space model,TDS)。TDS模型能够通过获取双语词项在平行文档中的共现信息,建立它们的统计依赖关系,构建它们的翻译关系、相关性等。
     在本文建立的CLIR评测语料库上进行的实验结果显示,TDS模型能够进行有效的词语翻译,提取具有主题特征、双语语义关联的双语主题,其文档配对搜索、跨语言检索性能优于CL-LSI模型。在TREC-5&6、TREC-9上的跨语言和单语言的实验结果显示,TDS模型总体性能优于CL-LSI。
     (4)跨语言中的双语主题相关性检索模型研究
     如何通过双语平行语料库提取语言之间的语义对信息,对改善跨语言信息检索的性能有着十分重要的意义。在TDS模型中,两种语言的文档矩阵是一种预测分析关系,是一种非对称的方法,没有平等对待两种语言;其时间和空间复杂度与双语文档数量成正比,不能有效处理大规模文档集。本文假设双语平行文档拥有相同的主题,这些双语主题在具体模型上可体现为语义相关。我们将双语平行文档看作同一语义内容的两种语言表示,从双语平行语料库构造每种语言的潜在语义空间,从而提出双语主题相关性模型(Bilingual Topic Correlation,BiTC)。
     在中英双语新闻语料集上进行的实验结果显示,新模型的文档配对搜索和伪查询跨语言信息检索性能显著优于跨语言潜在语义索引模型;在使用Google翻译得到的TREC-9双语平行语料库上,新模型也获得了较好的检索性能。
     (5)基于双语语义对应分析的跨语言文本分类/聚类方法研究
     双语文本对应分析在处理多语言文本数据、克服语言障碍等方面有着重要的作用,跨语言潜在语义索引方法没有充分考虑双语的语义相关性和文档类别结构信息。本文将双语平行文档看作同一语义内容的两种语言表达,运用偏最小二乘方法构建双语文本的语义相关性,为每种语言建立单独的潜在语义空间,并在这两个空间上实现跨语言的分类和聚类任务。
     在本文建立的跨语言文本分类评测语料库上进行的实验结果显示,在本文方法构造的双语主题空间上完成的跨语言和单语言的文本分类性能接近或优于原始特征空间的单语言分类,跨语言文本聚类性能也接近或优于单语言文档聚类,并具有良好的稳健性。
     本文的主要创新点如下:
     (1)提出一种基于主题对偶空间跨语言检索模型(TDS)。针对跨语言的潜在语义索引模型简单串接双语平行文档带来的双语语义“混合”问题,提出了一种线性的语义对偶空间表示双语主题空间的方法。TDS模型能够获取平行文档中双语词项的共现信息来建立双语语义信息的统计依赖关系,由此实现了翻译和查询扩展等功能。
     (2)提出一种跨语言中的双语主题相关性检索模型(BiTC)。模型假设双语平行文档拥有语义相关的主题,从双语平行语料库构造每种语言的潜在语义空间,从而建立双语语义关联。新模型克服了CL-LSI模型没有充分考虑双语语义联系的不足和TDS模型不能有效处理大规模数据的问题。
     (3)提出一种基于双语语义对应分析的跨语言文本分类/聚类方法。针对跨语言潜在语义索引方法没有充分考虑双语的语义多重相关性和文档结构信息问题,本文为每种语言建立单独的低维主题空间,建立双语语义对应关系,其跨语言文本分类/聚类性能接近或优于单语言分类/聚类。
With the rapid development of the Internet and the acceleration of globalization,information resource in the Internet is no longer expressed by English and othercommon languages. The need of searching information in non-native language isincreasing. The Internet having multi-language resource and the users not beingskilled in non-native language inevitably bring language barriers to the Internet users.Cross-language Information Retrieval (CLIR) is an effective way to represent, store,organize and access multi-language information. It is a challenging and cutting-edgefield in information retrieval (IR).
     Cross-language Information Retrieval addresses the search problem in whichretrieving the documents in one language by querying in another language. The key tothe problem is how to build the semantic relationship between the query in sourcelanguage and the document in target language. The topic model has become aneffective method in CLIR. It also has drawn attention to researchers in machinelearning, information retrieval, nature language processing and so on in recent years.The thesis focused on CLIR model, cross-language text categorization method (CLTC)and cross-language text clustering method (or multi-language text clustering) basedon bilingual topic. These models or methods can effectively address the problems ofmulti-meaning in translation and partly solve the problem of unknown wordtranslation. The main research findings of this thesis can be summarized as follows:
     (1) A CLIR framework based on bilingual topic space
     Natural language is regarded as meaning symbol strings to describe semanticobjects in real world. Multi-language text is multiple views for the object. The viewsare semantically equivalent. Based on the assumption that the topics in a parallel textshare the same semantic meanings across languages, the topics are sampled from thesame topic document distribution. We propose a CLIR framework based on bilingualtopic space. In the framework, the semantic meanings shared by parallel documents are extracted based on partial least square (PLS) method and topic space is built tomodel the semantic relationship cross languages.
     The topic space for each language is constituted of the topics extracted frombilingual parallel corpus. Each topic space is independent. The bilingual topic spacemodels the semantic relationship between languages. The space is a abstract conceptspace. It reveals that the relationships of semantic correspondence between documents,between documents and terms, between terms. It also uncovers that the inherentconstruction and internal relations in corpus. Mathematically, the two topic spaces areapproximately equivalent. The tasks of cross-language information retrieval, cross-language text classification and cross-language text clustering can be conductedwithout directly translating or bilingual dictionary after query or document isprojected onto the bilingual topic space.
     (2) Construction of a Chinese-English parallel corpus for CLIR
     Corpus is an important basic data resource for CLIR. It is used for evaluation,translation and construction of bilingual dictionary for CLIR.
     We collected bilingual news stories from Websites of Wall Street Journal,Financial Times and Hong Kong government news to construct CLIR evaluationcorpus, bilingual parallel corpus and CLTC evaluation corpus. The steps forconstructing corpus include selecting parallel webpages, pretreating document,aligning passage, labeling classes of documents, building query set and judgingdocument relevance. TREC-9document set for CLIR was translated by Google API1.0interface program to create bilingual parallel corpus of TREC-9.
     (3) A CLIR model based on topic dual space
     In cross-language latent semantic indexing model (CL-LSI), each pair ofdocument is concatenated into a dual document and the semantic relationship betweenlanguages is captured by exploiting co-occurrence of terms cross languages. However,the mixture of documents does not fully consider inherent feature and semanticcorrelation cross language. Based on the assumption that the topics in a paralleldocuments share the same topics, we present a method to represent bilingual topicspace using a linear latent semantic dual space. The two topic spaces in the bilingual topic space are linear function space and dual. Each pair of topic is semanticallyindependent. So we propose a topic dual space model for CLIR (TDS). TDS modelcan get the co-occurrences terms in parallel documents and build statisticaldependencies.
     Experiments on self-designed bilingual corpus demonstrate that TDS model cansearch97.00%of translated counterparts and correctly translated words. Experimentalresults on in-house dataset indicate that TDS outperforms CL-LSI in mate search andcross-language information retrieval. TDS is a language-independent model in mono-and cross-lingual retrieval, and can extract bilingual topics having themecharacteristic and bilingual semantic relationship. Evaluations on the bilingual corpusTREC-5&6and TREC-9show that our model in mono-and cross-lingual retrievaltasks outperforms CL-LSI.
     (4) A CLIR model based on bilingual topic correlation
     How to extract cross-language semantic meaning from bilingual paralleldocuments is important to improve CLIR. The matrices for the two languages in TDSmodel are regarded as predictive relationship. They are asymmetric and were notequally treated. Its time and space complexity are proportional to the number ofdocuments. Therefore, TDS model cannot effectively process large-scale documentset. Bilingual parallel documents share the same topics, which are semanticallycorrelative. We propose a new bilingual topic correlation model (BiTC) for CLIR. Themodel views the parallel documents as two different lingual representations for thesame semantic contents and builds a single topic space for each language frombilingual parallel corpus. Cross-lingual information retrieval is conducted in thesenew topic spaces. The new model overcomes the deficiency of the CL-LSI that doesnot fully take into account bilingual semantic relationship.
     Experimental results on the aligned Chinese-English news collection show thatBiTC significantly outperforms CL-LSI in mate search and cross-lingualpseudo-query retrieve and better performs on TREC-9bilingual parallel corpustranslated by Google Translation.
     (5) A cross-lingual text categorization/clustering method based on bilingual semantic corresponding analysis
     Bilingual text corresponding analysis can help to bridge the language barrier ofcross-lingual corpora. Cross-lingual latent semantic indexing corpus-based does notfully take into account bilingual semantic relationship. The paper proposes a newmethod building semantic relationship of bilingual parallel document via partial leastsquares. In this method, the parallel documents are viewed as two different lingualrepresentations for the same semantic content, such that a unify latent semantic spacecan be constructed for two languages. The task of cross-lingual text categorization isperformed in the new bilingual latent semantic spaces.
     The Chinese-English document-aligned dataset for evaluating is collected fromthe Hong Kong government news website. Experimental results on the task of mono-and cross-lingual classification show that performance of the presented method isover or near to mono-lingual classification in the original feature spaces.
     The contributions of the thesis can be summarized as follows.
     (1) We propose a CLIR model based on topic dual space model (TDS). Themodel uses a linear semantic dual space to construct bilingual topic space to addressthe problem that each pair of document is concatenated into a dual document inCL-LSI. TDS model can get the co-occurrences terms in parallel documents and buildstatistical dependencies to translate and query expansive.
     (2) We present a bilingual topic correlation model in CLIR (BiTC). It is assumethat bilingual parallel document shares semantic correlated topics. BiTC modelconstructs a single topic space for each language from bilingual parallel corpus tobuild bilingual semantic relationship. The new model addresses the problems of notfully considering bilingual semantic relationship in CL-LSI and not effectivelyprocess large-scale data.
     (3) We propose a cross-lingual text categorization/clustering method based onbilingual semantic corresponding analysis (BiSCAN). To address the problem of notfully considering multiple correlations and construction information in CL-LSI,BiSCAN constructs a single low-dimension topic space for each language and buildbilingual semantic corresponding relationship. The performance of CLTC and MLDC using BiSCAN is over or near to mono-lingual classification in the original featurespaces.
引文
[1] W3techs. Usage of content languages for websites [EB/OL].2013, April.http://w3techs.com/technologies/overview/content_language/all.
    [2] I. W. Stats. Language internet word users by languages [EB/OL].2011, May31(2012, March18). http://www.internetworldstats.com/stats7.htm.
    [3]科技部.国家重点基础研究发展计划和重大科学研究计划2013年重要支持方向[EB/OL].2012年,2月10日. http://www.most.gov.cn/fggw/zfwj/zfwj2012/201202/W020120210626443434599.doc.
    [4] R. Baeza-Yates,B. Riberiro-Neto.现代信息检索[M].第2版.黄萱菁,张奇,邱锡鹏.北京:机械工业出版社,2012.
    [5] C. D. Manning, P. Raghavan,H. Sch tze.信息检索导论[M].第1版.王斌.北京:人民邮电出版社,2010.
    [6]闵金明,孙乐,张俊林.重新审视跨语言信息检索[J].中文信息学报,2006,20(4):33-40.
    [7] D. W. Oard,B. J. Dorr. A survey of multilingual text retrieval, UMIACS-TR-96-19CS-TR-3615[R]. College Park, MD, USA: University of Maryland at College Park,1996.
    [8]刘挺,秦兵,张宇,等.信息检索系统导论[M].北京:机械工业出版社,2008.
    [9] D. A. Grossman,O. Frieder.信息检索:算法与启发式方法(第2版)[M].第1版.张华平,李恒训,刘治华,等.北京:人民邮电出版社,2010.
    [10] K. Kishida. Technical issues of cross-language information retrieval: A review [J].Information Processing&Management,2005,41(3):433-455.
    [11] D. W. Oard,A. R. Diekema. Cross-language information retrieval [J]. Annual Review ofInformation Science and Technology,1998,33:223-256.
    [12] D. Zhou, M. Truran, T. Brailsford, et al. Translation techniques in cross-language informationretrieval [J]. ACM Comput. Surv.,2012,45(1):1-44.
    [13] G.-A. Levow, D. W. Oard,P. Resnik. Dictionary-based techniques for cross-languageinformation retrieval [J]. Information Processing&Management,2005,41(3):523-547.
    [14] J. Gao, J.-Y. Nie, E. Xun, et al. Improving query translation for cross-language informationretrieval using statistical models [C]. Proceedings of the24th annual international ACMSIGIR conference on Research and development in information retrieval (Sigir'01), NewOrleans, Louisiana, USA, September9-12,2001:96-104.
    [15]聂建云,陈江.利用平行网页建立中英文统计翻译模型[J].中文信息学报,2001,15(1):1-12.
    [16] Y. Zhang,P. Vines. Using the web for automated translation extraction in cross-languageinformation retrieval [C]. Proceedings of the27th annual international ACM SIGIRconference on Research and development in information retrieval (Sigir '04), Sheffield,United Kingdom, July25–29,2004:162-169.
    [17] C.-J. Lee, J. S. Chang, J.-S. R. Jang. Alignment of bilingual named entities in parallel corporausing statistical models and multiple knowledge sources [J]. ACM Transactions on AsianLanguage Information Processing,2006,5(2):121-145.
    [18] C.-C. Hsu,C.-H. Chen. Mining synonymous transliterations from the world wide web [J].2010,9(1):1-28.
    [19] Z. Wang, J. Li, Z. Wang, et al. Cross-lingual knowledge linking across wiki knowledge bases[C]. Proceedings of the21st international conference on World Wide Web (WWW '12), Lyon,France, April16-20,2012:459-468.
    [20] B. Herbert, G. Szarvas, I. Gurevych. Combining query translation techniques to improvecross-language information retrieval [C]. Proceedings of the33rd European Conference onAdvances in Information Retrieval (ECIR'11), Dublin, Ireland, April18-21,2011:712-715.
    [21] A. Shakery, C. Zhai. Leveraging comparable corpora for cross-lingual information retrieval inresource-lean language pairs [J]. Information Retrieval,2013,16(1):1-29.
    [22] R. Rahimi, A. Shakery. A language modeling approach for extracting translation knowledgefrom comparable corpora [C]. Proceedings of the35th European conference on Advances inInformation Retrieval, Moscow, Russia,2013:606-617.
    [23] W. Magdy, G. J. F. Jones. An efficient method for using machine translation technologies incross-language patent search [C]. Proceedings of the20th ACM international conference onInformation and knowledge management (CIKM'11), Glasgow, Scotland, UK, October24–28,2011:1925-1928.
    [24] J. Zhu, H. Wang. The effect of translation quality in mt-based cross-language informationretrieval [C]. Proceedings of the21st International Conference on Computational Linguisticsand the44th annual meeting of the Association for Computational Linguistics (COLING&ACL '06), Sydney, Australia, July17-21,2006:593-600.
    [25]张玥杰,郭依昆,连理,等.基于英汉机译实现跨语言信息检索[J].小型微型计算机系统,2004,25(7):1135-1140.
    [26] K. Parton, K. R. Mckeown, J. Allan, et al. Simultaneous multilingual search for translingualinformation retrieval [C]. Proceeding of the17th ACM conference on Information andknowledge management (CIKM'08), Napa Valley, California, USA, October26–30,2008:719-728.
    [27] V. Nikoulina, S. Clinchant. Domain adaptation of statistical machine translation models withmonolingual data for cross lingual information retrieval [C]. Proceedings of the35thEuropean conference on Advances in Information Retrieval, Moscow, Russia,2013:768-771.
    [28] D. A. Hull, G. Grefenstette. Querying across languages: A dictionary-based approach tomultilingual information retrieval [C]. Proceedings of the19th annual international ACMSIGIR conference on Research and development in information retrieval (Sigir'96), Zurich,Switzerland, August18-22,1996:49-57.
    [29] L. Ballesteros, W. B. Croft. Dictionary methods for cross-lingual information retrieval [C].Proceedings of the7th International Conference on Database and Expert SystemsApplications (DEXA '96), Zurich, Switzerland, September9-13,1996:791-801.
    [30] M. Federico,N. Bertoldi. Statistical cross-language information retrieval using n-best querytranslations [C]. Proceedings of the25th annual international ACM SIGIR conference onResearch and development in information retrieval (Sigir'02), Tampere, Finland, August11-15,2002:167-174.
    [31] L. Ballesteros, W. B. Croft. Resolving ambiguity for cross-language retrieval [C].Proceedings of the21st annual international ACM SIGIR conference on Research anddevelopment in information retrieval (Sigir'98), Melbourne, Australia, August24-28,1998:64-71.
    [32] J. Gao, M. Zhou, J.-Y. Nie, et al. Resolving query translation ambiguity using a decayingco-occurrence model and syntactic dependence relations [C]. Proceedings of the25th annualinternational ACM SIGIR conference on Research and development in information retrieval(Sigir '02), Tampere, Finland, August11-15,2002:183-190.
    [33]林建方.词搭配抽取及在信息检索中的应用研究[D]:博士学位论文.哈尔滨:哈尔滨工业大学,2010.
    [34] Y. Liu, R. Jin, J. Y. Chai. A maximum coherence model for dictionary-based cross-languageinformation retrieval [C]. Proceedings of the28th annual international ACM SIGIRconference on Research and development in information retrieval (Sigir'05), Salvador, Brazil,August15–19,2005:536-543.
    [35] W.-H. Lu, L.-F. Chein, H.-J. Lee. Anchor text mining for translation extraction of query terms[C]. Proceedings of the24th annual international ACM SIGIR conference on Research anddevelopment in information retrieval (Sigir'01), New Orleans, Louisiana, USA, September9-13,2001:388-389.
    [36] L. Ballesteros, W. B. Croft. Phrasal translation and query expansion techniques forcross-language information retrieval [C]. Proceedings of the20th annual international ACMSIGIR conference on Research and development in information retrieval (Sigir'97),Philadelphia, Pennsylvania, United States, July27-31,1997:84-91.
    [37] P. Mcnamee, J. Mayfield. Comparing cross-language query expansion techniques bydegrading translation resources [C]. Proceedings of the25th annual international ACMSIGIR conference on Research and development in information retrieval (Sigir'02), Tampere,Finland, August11-15,2002:159-166.
    [38] D. He, D. Wu. Enhancing query translation with relevance feedback in translingualinformation retrieval [J]. Information Processing&Management,2011,47(1):1-17.
    [39] M. K. Chinnakotla, K. Raman,P. Bhattacharyya. Multilingual prf: English lends a helpinghand [C]. Proceeding of the33rd international ACM SIGIR conference on Research anddevelopment in information retrieval (Sigir'10), Geneva, Switzerland, July19–23,2010:659-666.
    [40] M. K. Chinnakotla, K. Raman, P. Bhattacharyya. Multilingual pseudo-relevance feedback:Performance study of assisting languages [C]. Proceedings of the48th Annual Meeting of theAssociation for Computational Linguistics (ACL'10), Uppsala, Sweden, July11-16,2010:1346-1356.
    [41] V. M. Orengo, C. Huyck. Relevance feedback and cross-language information retrieval [J].Information Processing&Management,2006,42(5):1203-1217.
    [42]吴丹,何大庆,王惠临.基于伪相关反馈的跨语言查询扩展[J].情报学报,2010,29(2):232-239.
    [43] J. Gao, J.-Y. Nie. A study of statistical models for query translation: Finding a good unit oftranslation [C]. Proceedings of the29th annual international ACM SIGIR conference onResearch and development in information retrieval (Sigir'06), Seattle, Washington, USA,August6–11,2006:194-201.
    [44] D. Zhou, M. Truran, T. Brailsford, et al. Gcon: A graph-based technique for resolvingambiguity in query translation candidates [C]. Proceedings of the2008ACM symposium onApplied computing (SAC '08), Fortaleza, Ceara, Brazil, March16-20,2008:1566-1573.
    [45] W. Gao, C. Niu, J.-Y. Nie, et al. Cross-lingual query suggestion using query logs of differentlanguages [C]. Proceedings of the30th annual international ACM SIGIR conference onResearch and development in information retrieval (Sigir'07), Amsterdam, The Netherlands,July23–27,2007:463-470.
    [46]胡蓉. Web日志和子空间聚类挖掘算法研究[D]:博士学位论文.武汉:华中科技大学,2008.
    [47]王进,陈恩红,张振亚,等.基于本体的跨语言信息检索模型[J].中文信息学报,2004,18(3):1-8.
    [48]郑德权,李生,赵铁军,等.结合本体论和统计方法的跨语言信息检索模型[J].哈尔滨工业大学学报,2008,40(1):77-80.
    [49] G. D. Melo, S. Siersdorfer. Multilingual text classification using ontologies [C]. Proceedingsof the29th European Conference on Advances in Information Retrieval (ECIR'07), Rome,Italy, April2-5,2007:541-548.
    [50] F. C. Gey. Search between chinese and japanese text collections [C]. Proceedings ofNTCIR-6Workshop, Tokyo, Japan, May15-18,2007.
    [51] T. Gollins, M. Sanderson. Improving cross language retrieval with triangulated translation [C].Proceedings of the24th annual international ACM SIGIR conference on Research anddevelopment in information retrieval (SIGIR'01), New Orleans, Louisiana, USA, September09-12,2001:90-95.
    [52]徐戈,王厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1436.
    [53] D. M. Blei. Probabilistic topic models [J]. Communications of the ACM,2012,55(4):77-84.
    [54] Q. Wang, J. Xu, H. Li, et al. Regularized latent semantic indexing [C]. Proceedings of the34th international ACM SIGIR conference on Research and development in InformationRetrieval (Sigir '11), Beijing, China, July24–28,2011:685-694.
    [55] T. Hofmann. Probabilistic latent semantic indexing [C]. Proceedings of the22nd annualinternational ACM SIGIR conference on Research and development in information retrieval(Sigir '96), Berkeley, California, USA, August18-22,1999:50-57.
    [56] D. Zhang, Q. Mei, C. Zhai. Cross-lingual latent topic extraction [C]. Proceedings of the48thAnnual Meeting of the Association for Computational Linguistics (ACL'10), Uppsala,Sweden, July11-16,2010:1128-1137.
    [57] T. Muramatsu, T. Mori. Integration of plsa into probabilistic clir model [C]. Proceedings ofNTCIR-04, Tokyo, April2003-June2004,2004.
    [58]金千里,赵军,徐波.弱指导的统计隐含语义分析及其在跨语言信息检索中的应用[C].语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集,哈尔滨,黑龙江,8月9-11日,2003:527-533.
    [59] D. M. Blei, A. Y. Ng, M. I. Jordan. Latent dirichlet allocation [J]. The Journal of MachineLearning Research,2003,3:993-1022.
    [60] M. Rosen-Zvi, T. Griffiths, M. Steyvers, et al. The author-topic model for authors anddocuments [C]. Proceedings of the20th conference on Uncertainty in artificial intelligence(UAI'04), Banff, Canada, July7-11,2004:487-494.
    [61] D. M. Blei,J. D. Lafferty. A correlated topic model of science [J]. The Annals of AppliedStatistics,2007,1(1):17-35.
    [62] S. Negi, V. Kunj. Mining bilingual topic hierarchies from unaligned text [C]. Proceedings ofthe5th International Joint Conference on Natural Language Processing (JCNLP'11),Chiang Mai, Thailand, November8-13,2011:992–1000.
    [63] D. M. Blei, J. D. Lafferty. Dynamic topic models [C]. Proceedings of the23rd internationalconference on Machine learning (ICML'06), Pittsburgh, Pennsylvania, USA, June25-29,2006,2006:113-120.
    [64] D. Mimno, H. M. Wallach, J. Naradowsky, et al. Polylingual topic models [C]. Proceedings ofthe2009Conference on Empirical Methods in Natural Language Processing (EMNLP'09),Singapore, August6-7,2009:880-889.
    [65] I. Vuli, W. D. Smet, M.-F. Moens. Cross-language information retrieval models based onlatent topic models trained with document-aligned comparable corpora [J]. InformationRetrieval,2012:1-38.
    [66] I. Vulic, W. D. Smet, M.-F. Moens. Cross-language information retrieval with latent topicmodels trained on a comparable corpus [C]. Proceedings of the7th Asia conference onInformation Retrieval Technology (AIRS'11), Dubai, United Arab Emirates, December18-20,2011:37-48.
    [67] I. Vulic, W. D. Smet, M.-F. Moens. Identifying word translations from comparable corporausing latent topic models [C]. Proceedings of the49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies (ACL&HLT'11), Portland,Oregon, USA, June19-24,2011:479-484.
    [68] S. T. Dumais, T. A. Letsche, M. L. Littman, et al. Automatic cross-language retrieval usinglatent semantic indexing, Proceedings of AAAI-97Spring Symposium Series:Cross-Language Text and Speech Retrieval, AAAI Technical Report SS-97-05[R].Providence, Rhode Island, USA,1997.
    [69] M. W. Berry, P. G. Young. Using latent semantic indexing for multilanguage informationretrieval [J]. Computers and the Humanities,1995,29(6):413-429.
    [70] T. K. Landauer, M. L. Littman. Fully automatic cross-language document retrieval usinglatent semantic indexing [C]. In Proceedings of the Sixth Annual Conference of the UWCentre for the New Oxford English Dictionary and Text Research, University of Waterloo,Waterloo, Ontario, Canada, October28-30,1990:31-38.
    [71] M. L. Littman, S. T. Dumais, T. K. Landauer. Automatic cross-linguistic information retrievalusing latent semantic indexing [C]. In Proceedings of SIGIR'96Workshop onCross-Linguistic Information Retrieval, Zurich, Switzerland, October10,1996:16-24.
    [72] J. C. Platt, K. Toutanova, W.-T. Yih. Translingual document representations fromdiscriminative projections [C]. Proceedings of the2010Conference on Empirical Methods inNatural Language Processing (EMNLP'10), Cambridge, Massachusetts, USA, October9-11,2010:251-261.
    [73] H. Wang, H. Huang, F. Nie, et al. Cross-language web page classification via dual knowledgetransfer using nonnegative matrix tri-factorization [C]. Proceedings of the34th internationalACM SIGIR conference on Research and development in Information (Sigir'11), Beijing,China, July24-28,2011:933-942.
    [74] J. Pan, G.-R. Xue, Y. Yu, et al. Cross-lingual sentiment classification via bi-viewnon-negative matrix tri-factorization [C]. Proceedings of the15th Pacific-Asia conference onAdvances in knowledge discovery and data mining-Volume Part I, Shenzhen, China, May24-27,2011:289-300.
    [75]宁健,林鸿飞.基于改进潜在语义分析的跨语言检索[J].中文信息学报,2010,24(3):105-111.
    [76] R. Udupa, M. Khapra. Transliteration equivalence using canonical correlation analysis [C].Proceedings of the32nd European Conference on Advances in Information Retrieval (ECIR'10), Milton Keynes, UK, March28-31,2010:75-86.
    [77] B. Fortuna, J. Rupnik, B. Pajntar, et al. Cross-lingual search over22european languages [C].Proceedings of the31st annual international ACM SIGIR conference on Research anddevelopment in information retrieval (Sigir '08), Singapore, Singapore, July20-24,2008:883-883.
    [78] A. Vinokourov, J. Shawe-Taylor, N. Cristianini. Inferring a semantic representation of text viacross-language correlation analysis [C]. Advances in neural information processing systems(NIPS '03), Vancouver, British Columbia, Canada, December8-13,2003:1473-1480.
    [79] Y. Li, J. Shawe-Taylor. Using kcca for japanese–english cross-language information retrievaland document classification [J]. Journal of Intelligent Information Systems,2006,27(2):117-133.
    [80] Y. Li, J. Shawe-Taylor. Advanced learning algorithms for cross-language patent retrieval andclassification [J]. Information Processing&Management,2007,43(5):1183-1199.
    [81] N. Bel, C. Koster, M. Villegas. Cross-lingual text categorization [C]. Proceedings of the25thEuropean Conference on Advances in Information Retrieval (ECIR'03), Pisa, Italy, April14-16,2003:126-139.
    [82] J. S. Olsson, D. W. Oard, J. Hajic. Cross-language text classification [C]. Proceedings of the28th annual international ACM SIGIR conference on Research and development ininformation retrieval (Sigir'05), Salvador, Brazil, August15-19,2005:645-646.
    [83] A. Gliozzo, C. Strapparava. Exploiting comparable corpora and bilingual dictionaries forcross-language text categorization [C]. Proceedings of the21st International Conference onComputational Linguistics and the44th annual meeting of the Association for ComputationalLinguistics (COLING&ACL'06), Sydney, Australia, July17–21,2006:553-560.
    [84] C.-P. Wei, Y.-T. Lin, C. C. Yang. Cross-lingual text categorization: Conquering languageboundaries in globalized environments [J]. Information Processing&Management,2011,47(5):786-804.
    [85] K. Wu, X. Wang, B. Lv. Cross language text categorization using a bilingual lexicon [C]. InProceedings of the Third International Joint Conference on Natural Language Processing(IJCNLP'08), Hyderabad, India, January7-12,2008:165-172.
    [86] K. Wu, B.-L. Lu. A refinement framework for cross language text categorization [C].Proceedings of the4th Asia information retrieval conference on Information retrievaltechnology (AIRS'08), Harbin, China, January15-18,2008:401-411.
    [87]高影繁,王惠临,徐红姣.基于跨语言文本分类的跨语言特征提取方法研究[J].情报学报,2011,30(12):1242-1248.
    [88] L. Rigutini, M. Maggini, B. Liu. An em based training algorithm for cross-language textcategorization [C]. The Proceedings IEEE/WIC/ACM International Conference on WebIntelligence (WI'05), Compiègne University of Technology, France, September19-22,2005:529-535.
    [89] B. Wei,C. Pal. Cross lingual adaptation: An experiment on sentiment classifications [C].Proceedings of the ACL2010Conference Short Papers (ACL'10), Uppsala, Sweden, July11-16,2010:258-262.
    [90] P. Prettenhofer, B. Stein. Cross-lingual adaptation using structural correspondence learning[J]. ACM Transactions on Intelligent Systems and Technology,2011,3(1):1-22.
    [91] P. Prettenhofer, B. Stein. Cross-language text classification using structural correspondencelearning [C]. Proceedings of the48th Annual Meeting of the Association for ComputationalLinguistics (ACL'10), Uppsala, Sweden, July11-16,2010:1118-1127.
    [92] Y. Zhang, F. S. Tsai, A. T. Kwee. Multilingual sentence categorization and novelty mining [J].Information Processing&Management,2011,47(5):667-675.
    [93] B. Lu, C. Tan, C. Cardie, et al. Joint bilingual sentiment classification with unlabeled parallelcorpora [C]. Proceedings of the49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL&HLT '11), Portland, Oregon, USA, June19-24,2011:320-330.
    [94] Z. Lin, S. Tan, X. Cheng. Language-independent sentiment classification using three commonwords [C]. Proceedings of the20th ACM international conference on Information andknowledge management (CIKM '11), Glasgow, Scotland, UK, October24–28,2011:1041-1046.
    [95] W. D. Smet, J. Tang, M.-F. Moens. Knowledge transfer across multilingual corpora via latenttopics [C]. Proceedings of the15th Pacific-Asia conference on Advances in knowledgediscovery and data mining (PAKDD'11), Shenzhen, China, May24-27,2011:549-560.
    [96] X. Ni, J.-T. Sun, J. Hu, et al. Cross lingual text classification by mining multilingual topicsfrom wikipedia [C]. Proceedings of the fourth ACM international conference on Web searchand data mining (WSDM'11), Hong Kong, China, February9-12,2011:375-384.
    [97] Y. Wu, D. W. Oard. Bilingual topic aspect classification with a few training examples [C].Proceedings of the31st annual international ACM SIGIR conference on Research anddevelopment in information retrieval (Sigir'08), Singapore, Singapore, July20–24,2008:203-210.
    [98] C.-H. Lee, H.-C. Yang. Construction of supervised and unsupervised learning systems formultilingual text categorization [J]. Expert Systems with Applications,2009,36(2, Part1):2400-2410.
    [99] F. Zhuang, P. Luo, C. Du, et al. Triplex transfer learning: Exploiting both shared and distinctconcepts for text classification [C]. Proceedings of the sixth ACM international conferenceon Web search and data mining (WSDM2013), Rome, Italy, February4-8,2013:425-434.
    [100]熊超,王明文,吴福英,等.基于潜在语义对偶空间的跨语言文本分类研究[J].广西师范大学学报(自然科学版),2010,28(1):157-160.
    [101] X. Wan. Bilingual co-training for sentiment classification of chinese product reviews [J].Computational Linguistics,2011,37(3):587-616.
    [102] X. Wan. Co-training for cross-lingual sentiment classification [C]. Proceedings of the47thAnnual Meeting of the ACL and the4th IJCNLP of the AFNLP, Suntec, Singapore, August2-7,2009:235-243.
    [103] M. R. Amini, N. Usunier, C. Goutte. Learning from multiple partially observed views--anapplication to multilingual text categorization [C]. Advances in Neural InformationProcessing Systems (NIPS'09), Vancouver, British Columbia, Canada, December7-10,2009:28-36.
    [104] M.-R. Amini, C. Goutte. A co-classification approach to learning from multilingual corpora[J]. Machine learning,2010,79(1-2):105-121.
    [105] M. R. Amini, C. Goutte, N. Usunier. Combining coregularization and consensus-basedself-training for multilingual text categorization [C]. Proceeding of the33rd internationalACM SIGIR conference on Research and development in information retrieval (Sigir'10),Geneva, Switzerland, July19–23,2010:475-482.
    [106] Y. Guo, M. Xiao. Cross language text classifcation via subspace co-regularized multi-viewlearning [C]. Proceedings of the29th International Conference on Machine Learning (ICML2012), Edinburgh, Scotland, UK, June26-July1,2012.
    [107] W. Dai, Y. Chen, G.-R. Xue, et al. Translated learning: Transfer learning across differentfeature spaces [C]. Proceedings of the Advances in Neural Information Processing Systems(NIPS'08), British Columbia, Canada, December8-11,2008:353-360.
    [108] X. Ling, G.-R. Xue, W. Dai, et al. Can chinese web pages be classified with english datasource?[C]. Proceeding of the17th international conference on World Wide Web (WWW'08),Beijing, China, April21–25,2008:969-978.
    [109] C. Wan, R. Pan, J. Li. Bi-weighting domain adaptation for cross-language text classification[C]. The Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI'11)Barcelona, Spain, July16-22,2011:1535-1540.
    [110] L. Shi, R. Mihalcea, M. Tian. Cross language text classification by model translation andsemi-supervised learning [C]. Proceedings of the2010Conference on Empirical Methods inNatural Language Processing (EMNLP'10), Cambridge, Massachusetts, USA, October9-112010:1057-1067.
    [111] B. Mathieu, R. Besan on, C. Fluhr. Multilingual document clusters discovery [C].Recherche d'Information Assistée par Ordinateur Proceedings (RIAO2004), Avignon, France,April26-28,2004:1-10.
    [112] H.-H. Chen, C.-J. Lin. A multilingual news summarizer [C]. Proceedings of the18thconference on Computational linguistics-Volume1, Saarbrucken, Germany,2000:159-165.
    [113] D. K. Evans, J. L. Klavans. A platform for multilingual news summarization,[R]. NewYork: C. U. Department of Computer Science,2003.
    [114] N. K. Kumar, K. G. S. Santosh, V. Varma. Multilingual document clustering using wikipediaas external knowledge [C]. Proceedings of the Second international conference onMultidisciplinary information retrieval facility (IRFC'11), LNCS6653, Vienna, Austria,2011:108-117.
    [115] B. Pouliquen, R. Steinberger, C. Ignat, et al. Multilingual and cross-lingual news topictracking [C]. Proceedings of the20th international conference on Computational Linguistics(COLING '04), Geneva, Switzerland,2004:959.
    [116]唐国瑜,夏云庆,张民,等.基于跨语言广义向量空间模型的跨语言文档聚类方法[J].中文信息学报,2012,(2):116-120.
    [117] H.-H. Chen, J.-J. Kuo, T.-C. Su. Clustering and visualization in a multi-lingualmulti-document summarization system [C]. Proceedings of the25th European conference onIR research, Pisa, Italy,2003:266-280.
    [118] K. Kishida. Double-pass clustering technique for multilingual document collections [J].Journal of Information Science,2011,37(3):304-321.
    [119] S. Montalvo, R. Martinez, A. Casillas, et al. Multilingual document clustering: An heuristicapproach based on cognate named entities [C]. Proceedings of the21st InternationalConference on Computational Linguistics and the44th annual meeting of the Association forComputational Linguistics, Sydney, Australia,2006:1145-1152.
    [120] S. Montalvo, R. Martinez, A. Casillas, et al. Multilingual news clustering: Featuretranslation vs. Identification of cognate named entities [J]. Pattern Recognition Letters,2007,28(16):2305-2311.
    [121] N. K. Kumar, G. S. K. Santosh, V. Varma. A language-independent approach to identify thenamed entities in under-resourced languages and clustering multilingual documents [C].CLEF'11Proceedings of the Second international conference on Multilingual and multimodalinformation access evaluation, Amsterdam, The Netherlands,2011:74-82.
    [122] X. Wang, B. Qian, I. Davidson. Improving document clustering using automated machinetranslation [C]. Proceedings of the21st ACM international conference on Information andknowledge management, Maui, Hawaii, USA, October29-November2,2012:645-653.
    [123] G. Tholpadi, M. K. Das, C. Bhattacharyya, et al. Cluster labeling for multilingualscatter/gather using comparable corpora [C]. Proceedings of the34th European conference onAdvances in Information Retrieval, Barcelona, Spain,2012:388-400.
    [124] C.-P. Wei, C. C. Yang, C.-M. Lin. A latent semantic indexing-based approach to multilingualdocument clustering [J]. Decision Support Systems,2008,45(3):606-620.
    [125] Y.-M. Kim, M.-R. Amini, C. Goutte, et al. Multi-view clustering of multilingual documents[C]. Proceeding of the33rd international ACM SIGIR conference on Research anddevelopment in information retrieval, Geneva, Switzerland,2010:821-822.
    [126] Y. Jiang, J. Liu, Z. Li, et al. Collaborative plsa for multi-view clustering [C].201221stInternational Conference on Pattern Recognition (ICPR), Tsukuba, Japan,11-15Nov,2012:2997-3000.
    [127] J. M. Ponte, W. B. Croft. A language modeling approach to information retrieval [C].Proceedings of the21st annual international ACM SIGIR conference on Research anddevelopment in information retrieval, Melbourne, Australia, August24-28,1998:275-281.
    [128] C. Zhai. Statistical language models for information retrieval [M]. Morgan&ClaypoolPublishers,2008.
    [129] G. Yuhong, X. Min. Transductive representation learning for cross-lingual text classification[C]. Data Mining (ICDM),2012IEEE12th International Conference on,10-13Dec.2012,2012:888-893.
    [130] I. Vuli, M.-F. Moens. A unified framework for monolingual and cross-lingual relevancemodeling based on probabilistic topic models [M]. Springer, ECIR2013, LNCS7814,2013,98-109.
    [131] M. Rogati, Y. Yang. Resource selection for domain-specific cross-lingual ir [C]. Proceedingsof the27th annual international ACM SIGIR conference on Research and development ininformation retrieval (Sigir'04), Sheffield, United Kingdom, July25-29,2004:154-161.
    [132] J.-Y. Nie. Cross-language information retrieval [J]. Synthesis Lectures on Human LanguageTechnologies,2010,3(1):1-125.
    [133] S. Deerwester, S. T. Dumais, G. W. Furnas, et al. Indexing by latent semantic analysis [J].Journal of the American society for information science,1990,41(6):391-407.
    [134] T. Mori, T. Kokubu, T. Tanaka. Cross-lingual information retrieval based on lsi withmultiple word spaces [C]. In Proceedings of the2nd NTCIR Workshop Meeting onEvaluation of Chinese&Japanese Text Retrieval and Text Summarization, Tokyo: NationalInstitute of Informatics, May2000-March2001,2001:
    [135] B. Rehder, M. L. Littman, S. Dumais, et al. Automatic3-language cross-languageinformation retrieval with latent semantic indexing [J]. NIST SPECIAL PUBLICATION SP,1998:233-240.
    [136] W.-T. Yih, K. Toutanova, J. C. Platt, et al. Learning discriminative projections for textsimilarity measures [C]. Proceedings of the Fifteenth Conference on Computational NaturalLanguage Learning (CoNLL'11), Portland, Oregon, USA, June23–242011:247-256.
    [137]黄国斌,王明文,叶浩.一种新的基于中间语义的跨语言信息检索模型[J].中文信息学报,2009,(2):77-82.
    [138]邹小芳,王明文,左家莉,等.新的基于中间语义的多语言信息检索模型[J].小型微型计算机系统,2010,(04):696-701.
    [139] X.-Q. Zeng, M.-W. Wang, J.-Y. Nie. Text classification based on partial least square analysis
    [C]. Proceedings of the2007ACM symposium on Applied computing (SAC'07), Seoul,Korea, March11-15,2007:834-838.
    [140] P. A. Chew, B. W. Bader, T. G. Kolda, et al. Cross-language information retrieval usingparafac2[C]. Proceedings of the13th ACM SIGKDD international conference onKnowledge discovery and data mining (SigKDD'08), San Jose, California, USA, August12–15,2007:143-152.
    [141] M. L. Littman, F. Jiang, G. A. Keim. Learning a language-independent representation forterms from a partially aligned corpus [C]. Proceedings of the Fifteenth InternationalConference on Machine Learning, Madison, Wisconsin, USA, July24-27,1998:314-322.
    [142]王惠文,吴载斌,孟洁.偏最小二乘回归的线性与非线性方法[M].第1版.北京:国防工业出版社,2006.
    [143]王桂增,叶昊.主元分析与偏最小二乘法[M].第1版.北京:清华大学出版社,2012.
    [144] V. Esposito Vinzi,G. Russolillo. Partial least squares algorithms and methods [J]. WileyInterdisciplinary Reviews: Computational Statistics,2013,5(1):1-19.
    [145] G. H. Golub, C. F. V. Loan.矩阵计算(第3版)[M].袁亚湘等.北京:科学出版社,2011.
    [146] L. Li, X. Jin, M. Long. Topic correlation analysis for cross-domain text classification [C].Twenty-Sixth AAAI Conference on Artificial Intelligence,2012.
    [147] W.-X. Bi, M.-W. Wang, Y.-S. Luo, et al. A new cross language text categorization based oninterlingua semantic [J]. Journal of Computational Information Systems,2008,4(1):105-110.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700