基于双语LDA的跨语言文本相似度计算方法研究

英文篇名：A cross-lingual document similarity calculation method based on bilingual LDA
作者：程蔚 ; 线岩团 ; 周兰江 ; 余正涛 ; 王红斌
英文作者：CHENG Wei;XIAN Yan-tuan;ZHOU Lan-jiang;YU Zheng-tao;WANG Hong-bin;School of Information Engineering and Automation,Kunming University of Science and Technology;Key Laboratory of Intelligent Information Processing,Kunming University of Science and Technology;
关键词：双语LDA ; 跨语言文本相似度 ; 余弦相似度 ; 主题频率-逆文档频率
英文关键词：bilingual LDA;;cross-lingual document similarity calculation;;cosine similarity;;topic frequency-inverse document frequency
中文刊名：JSJK
英文刊名：Computer Engineering & Science
机构：昆明理工大学信息工程与自动化学院;昆明理工大学智能信息处理重点实验室;
出版日期：2017-05-15
出版单位：计算机工程与科学
年：2017
期：v.39;No.269
基金：国家自然科学基金(61363044,61462054);; 云南省科技厅面上项目(2015FB135);; 云南省教育厅科学研究基金(2014Z021);; 昆明理工大学省级人培项目(KKSY201403028)
语种：中文;
页：JSJK201705026
页数：6
CN：05
ISSN：43-1258/TP
分类号：161-166

摘要

基于双语主题模型思想分析双语文本相似性,提出基于双语LDA跨语言文本相似度计算方法。先利用双语平行语料集训练双语LDA模型,再利用该模型预测新语料集主题分布,将新语料集的双语文档映射到同一个主题向量空间,结合主题分布使用余弦相似度方法计算新语料集双语文档的相似度,使用从类别间和类别内的主题分布离散度的角度改进的主题频率-逆文档频率方法计算特征主题权重。实验表明,改进后的权重计算对于基于双语LDA相似度算法的召回率有较大提高,算法对类别不受限且有较好的可靠性。
Based on the idea of bilingual topic model,we analyze similarity of bilingual documents and propose a cross-lingual document similarity calculation method based on bilingual LDA.Firstly we use the bilingual parallel documents to train the bilingual LDA model and then use the trained model to predict the topic distribution of the new corpus.The new corpus' s bilingual documents are mapped to the vector space of the same topic.We use the cosine similarity method and topic distribution combined to calculate the similarity of the bilingual documents of the new corpus.We improve the topic frequency inverse document frequency method from the aspect of the dispersion of in-category and the between-category topic distribution,and utilize the improved method to calculate feature topic weights.Experimental results show that the improved weight calculation method can enhance the recall rate,enable the LDA similarity calculation algorithm not limited to certain categories,and it is reliable.

引文

[1]Steinberger R,Pouliquen B,Hagman J.Cross-lingual document similarity calculation using the multilingual thesaurus EUROVOC[C]∥Proc of the 3rd Conference on Computational Linguistics and Intelligent Text Processing,2002:415-424.
    [2]He Wen-lei.Research on WordNet based Chinese-English cross language similarity measurement[D].Shanghai:Shanghai Jiao Tong University,2011.(in Chinese)
    [3]Pottast M,Stein B,Anderka M.A wikipedia-based multilingual retrieval model[C]∥Proc of European Conference on Advances in Information Retrieval,2008:522-530.
    [4]Wang Hong-jun,Shi Shui-cai,Yu Shi-wen,et al.Cross-language similar document retrieval[J].Journal of Chinese Information Processing,2007,21(1):30-37.(in Chinese)
    [5]Alberto B C,Paolo R,David P,et al.On cross-lingual plagiarism analysis using a statistical model[C]∥Proc of ECAI2008 Workshop on Uncovering Plagiarism,Authorship,and Social Software Misuse(PAN08),2008:9-13.
    [6]Uszkoreit J,Ponte J M,Popat A C,et al.Large scale parallel document mining for machine translation[C]∥Proc of the23rd International Conference on Computational Linguistics,2010:1101-1109.
    [7]Hasan M M,Matsumoto Y.Multilingual document alignment-a study with Chinese and Japanese[C]∥Proc of the 6th Natural Language Processsing Pacific Rim Symposium(NLPRS2001),2001:617-623.
    [8]Maike E,Andrew F,Kotaro N.Calculating wikipedia article similarity using machine translation evaluation metrics[C]∥Proc of Workshops of International Conference on Advanced Information Networking and Applications,2011:620-625.
    [9]Preiss J.Identifying comparable corpora using LDA[C]∥Proc of 2012Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,2012:558-562.
    [10]Ivan V,WimDe S,Moens M F.Identifying word translations from comparable corpora using latent topic models[C]∥Proc of Annual Meeting of the Association for Computational Linguistics,2011:479-484.
    [11]Wang Zhen-zhen,He Ming,Du Yong-ping.Text similarity computing based on topic model LDA[J].Computer Science,2013,40(12):229-232.(in Chinese)
    [12]Sun Yuan,Zhao Qian.Tibetan-Chinese cross language text similarity calculation based on LDA topic model[J].The Open Cybernetics&Systemics Journal,2015,9(1):2911-2919.
    [13]Ni X,Sun J T,Hu J,et al.Mining multilingual topics from wikipedia[C]∥Proc of the 18th International Conference on World Wide Web,2009:1155-1156.
    [14]Mimno D,Wallach H,Naradowsky J,et al.Polylingual Topic Models[C]∥Proc of the EMNLP,2009:880-889.
    [15]Blei D M,Ng A Y,Jordan M I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003(3):993-1022.
    [2]何文垒.基于WordNet的中英文跨语言文本相似度研究[D].上海:上海交通大学,2011.
    [4]王洪俊,施水才,俞士汶,等.跨语言相似文档检索[J].中文信息学报,2007,21(1):30-37.
    [11]王振振,何明,杜永萍.基于LDA主题模型的文本相似度计算[J].计算机科学,2013,40(12):229-232.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700