随着因特网的快速发展和全球化进程的加快,因特网所提供的信息资源不再集中于英语等少数几种语言上,人们使用母语去查询不同语言表示的信息的需求不断增加。跨语言信息检索(Cross-language Information Retrieval, CLIR)是一种表示、存储、组织和存取多语言信息资源的快速有效手段,是信息检索中一个富有挑战性和前沿的研究领域。
     从自然语言理解的角度来看,多语言文字是语言描述对象赋予有意义的不同语言符号系统的多视图表示。本质上,这些视图是语义等价的。本文假设双语平行文档享有相同的语义信息,运用偏最小二乘(Partial Least Square,PLS)数据统计分析理论,从双语平行语料库中提取平行文档的共有语义信息,构建具有双语对应关系的主题空间,由此建立一种基于双语主题空间的跨语言信息检索总体框架。
     本文从华尔街日报、金融时报和香港政府新闻网等网站搜集中英新闻网页,按照确定平行网页、文件预处理、段落对齐、文档类别标注、建立检索查询集和文档相关性判断等流程,自行建立了中英平行语料库、CLIR评测语料库、跨语言文本分类评测语料库。通过应用Google API1.0接口程序翻译TREC-9文档集建立了TREC-9中英双语平行语料库。
     跨语言的潜在语义索引模型(Cross-Language Latent Semantic Indexing,CL-LSI)将每对双语文档串接成一个文档,利用双语词汇的共现特征获取双语之间的语义联系,而没有充分考虑各语言的固有特性和双语语义相关性。本文假设在双语平行语料库中,两种语言文档集隐含的主题内容相同,使用线性语义对偶空间表示双语主题,由此提出一种基于主题对偶空间的跨语言检索模型(TopicDual Space model,TDS)。TDS模型能够通过获取双语词项在平行文档中的共现信息,建立它们的统计依赖关系,构建它们的翻译关系、相关性等。
     如何通过双语平行语料库提取语言之间的语义对信息,对改善跨语言信息检索的性能有着十分重要的意义。在TDS模型中,两种语言的文档矩阵是一种预测分析关系,是一种非对称的方法,没有平等对待两种语言;其时间和空间复杂度与双语文档数量成正比,不能有效处理大规模文档集。本文假设双语平行文档拥有相同的主题,这些双语主题在具体模型上可体现为语义相关。我们将双语平行文档看作同一语义内容的两种语言表示,从双语平行语料库构造每种语言的潜在语义空间,从而提出双语主题相关性模型(Bilingual Topic Correlation,BiTC)。
With the rapid development of the Internet and the acceleration of globalization,information resource in the Internet is no longer expressed by English and othercommon languages. The need of searching information in non-native language isincreasing. The Internet having multi-language resource and the users not beingskilled in non-native language inevitably bring language barriers to the Internet users.Cross-language Information Retrieval (CLIR) is an effective way to represent, store,organize and access multi-language information. It is a challenging and cutting-edgefield in information retrieval (IR).
     Cross-language Information Retrieval addresses the search problem in whichretrieving the documents in one language by querying in another language. The key tothe problem is how to build the semantic relationship between the query in sourcelanguage and the document in target language. The topic model has become aneffective method in CLIR. It also has drawn attention to researchers in machinelearning, information retrieval, nature language processing and so on in recent years.The thesis focused on CLIR model, cross-language text categorization method (CLTC)and cross-language text clustering method (or multi-language text clustering) basedon bilingual topic. These models or methods can effectively address the problems ofmulti-meaning in translation and partly solve the problem of unknown wordtranslation. The main research findings of this thesis can be summarized as follows:
     (1) A CLIR framework based on bilingual topic space
     Natural language is regarded as meaning symbol strings to describe semanticobjects in real world. Multi-language text is multiple views for the object. The viewsare semantically equivalent. Based on the assumption that the topics in a parallel textshare the same semantic meanings across languages, the topics are sampled from thesame topic document distribution. We propose a CLIR framework based on bilingualtopic space. In the framework, the semantic meanings shared by parallel documents are extracted based on partial least square (PLS) method and topic space is built tomodel the semantic relationship cross languages.
     The topic space for each language is constituted of the topics extracted frombilingual parallel corpus. Each topic space is independent. The bilingual topic spacemodels the semantic relationship between languages. The space is a abstract conceptspace. It reveals that the relationships of semantic correspondence between documents,between documents and terms, between terms. It also uncovers that the inherentconstruction and internal relations in corpus. Mathematically, the two topic spaces areapproximately equivalent. The tasks of cross-language information retrieval, cross-language text classification and cross-language text clustering can be conductedwithout directly translating or bilingual dictionary after query or document isprojected onto the bilingual topic space.
     (2) Construction of a Chinese-English parallel corpus for CLIR
     Corpus is an important basic data resource for CLIR. It is used for evaluation,translation and construction of bilingual dictionary for CLIR.
     We collected bilingual news stories from Websites of Wall Street Journal,Financial Times and Hong Kong government news to construct CLIR evaluationcorpus, bilingual parallel corpus and CLTC evaluation corpus. The steps forconstructing corpus include selecting parallel webpages, pretreating document,aligning passage, labeling classes of documents, building query set and judgingdocument relevance. TREC-9document set for CLIR was translated by Google API1.0interface program to create bilingual parallel corpus of TREC-9.
     (3) A CLIR model based on topic dual space
     In cross-language latent semantic indexing model (CL-LSI), each pair ofdocument is concatenated into a dual document and the semantic relationship betweenlanguages is captured by exploiting co-occurrence of terms cross languages. However,the mixture of documents does not fully consider inherent feature and semanticcorrelation cross language. Based on the assumption that the topics in a paralleldocuments share the same topics, we present a method to represent bilingual topicspace using a linear latent semantic dual space. The two topic spaces in the bilingual topic space are linear function space and dual. Each pair of topic is semanticallyindependent. So we propose a topic dual space model for CLIR (TDS). TDS modelcan get the co-occurrences terms in parallel documents and build statisticaldependencies.
     Experiments on self-designed bilingual corpus demonstrate that TDS model cansearch97.00%of translated counterparts and correctly translated words. Experimentalresults on in-house dataset indicate that TDS outperforms CL-LSI in mate search andcross-language information retrieval. TDS is a language-independent model in mono-and cross-lingual retrieval, and can extract bilingual topics having themecharacteristic and bilingual semantic relationship. Evaluations on the bilingual corpusTREC-5&6and TREC-9show that our model in mono-and cross-lingual retrievaltasks outperforms CL-LSI.
     (4) A CLIR model based on bilingual topic correlation
     How to extract cross-language semantic meaning from bilingual paralleldocuments is important to improve CLIR. The matrices for the two languages in TDSmodel are regarded as predictive relationship. They are asymmetric and were notequally treated. Its time and space complexity are proportional to the number ofdocuments. Therefore, TDS model cannot effectively process large-scale documentset. Bilingual parallel documents share the same topics, which are semanticallycorrelative. We propose a new bilingual topic correlation model (BiTC) for CLIR. Themodel views the parallel documents as two different lingual representations for thesame semantic contents and builds a single topic space for each language frombilingual parallel corpus. Cross-lingual information retrieval is conducted in thesenew topic spaces. The new model overcomes the deficiency of the CL-LSI that doesnot fully take into account bilingual semantic relationship.
     Experimental results on the aligned Chinese-English news collection show thatBiTC significantly outperforms CL-LSI in mate search and cross-lingualpseudo-query retrieve and better performs on TREC-9bilingual parallel corpustranslated by Google Translation.
     (5) A cross-lingual text categorization/clustering method based on bilingual semantic corresponding analysis
     Bilingual text corresponding analysis can help to bridge the language barrier ofcross-lingual corpora. Cross-lingual latent semantic indexing corpus-based does notfully take into account bilingual semantic relationship. The paper proposes a newmethod building semantic relationship of bilingual parallel document via partial leastsquares. In this method, the parallel documents are viewed as two different lingualrepresentations for the same semantic content, such that a unify latent semantic spacecan be constructed for two languages. The task of cross-lingual text categorization isperformed in the new bilingual latent semantic spaces.
     The Chinese-English document-aligned dataset for evaluating is collected fromthe Hong Kong government news website. Experimental results on the task of mono-and cross-lingual classification show that performance of the presented method isover or near to mono-lingual classification in the original feature spaces.
     The contributions of the thesis can be summarized as follows.
     (1) We propose a CLIR model based on topic dual space model (TDS). Themodel uses a linear semantic dual space to construct bilingual topic space to addressthe problem that each pair of document is concatenated into a dual document inCL-LSI. TDS model can get the co-occurrences terms in parallel documents and buildstatistical dependencies to translate and query expansive.
     (2) We present a bilingual topic correlation model in CLIR (BiTC). It is assumethat bilingual parallel document shares semantic correlated topics. BiTC modelconstructs a single topic space for each language from bilingual parallel corpus tobuild bilingual semantic relationship. The new model addresses the problems of notfully considering bilingual semantic relationship in CL-LSI and not effectivelyprocess large-scale data.
     (3) We propose a cross-lingual text categorization/clustering method based onbilingual semantic corresponding analysis (BiSCAN). To address the problem of notfully considering multiple correlations and construction information in CL-LSI,BiSCAN constructs a single low-dimension topic space for each language and buildbilingual semantic corresponding relationship. The performance of CLTC and MLDC using BiSCAN is over or near to mono-lingual classification in the original featurespaces.
