摘要
【目的】在英汉跨语言剽窃文档中检索翻译对应内容。【方法】基于双语词典进行相似分析,合并整理词典以提高词语级匹配的准确率和效率,利用整体词频分布、匹配位置特征等解决歧义和多重匹配问题,根据词的对应情况、词的位置信息等综合加权计算句子及段落的相似度。【结果】在真实翻译语料上的实验结果表明,检索的准确率为0.841,召回率为0.748。【局限】未登录词的翻译关系不易根据词典判定。【结论】基于双语词典检索跨语言相似内容的方法简单易行,适用面广。
[Objective] Translation correspondence in English-Chinese cross-lingual plagiarism documents is studied. [Methods] Similarity analysis is taken according to bilingual lexicons. To improve the precision and efficiency of corresponding words recognition, this study merges and sorts several bilingual lexicons. As to the problems of disambiguation and multiple matching, the paper proposes a method which applies word distribution and matching location to select the proper translation items. Similarities between sentences and paragraphs are defined on the stratified complex features such as word matching category, position of words and so on. [Results] Experiments on real translation documents show that precision and recall of retrieval reach 0.841 and 0.748 respectively. [Limitations] Out of Vocabulary(OOV) correspondence is still hard to judge by lexicons. [Conclusions] The approach of cross-lingual similarity detection based on bilingual lexicons is easy to implement and has a wide range of application.
引文
[1]Alzahrani S M,Salim N,Abraham A.Understanding Plagiarism Linguistic Patterns,Textual Features and Detection Methods[J].IEEE Transactions on Systems,Man and Cybernetics,Part C:Applications and Reviews,2012:42(2):133-149.
[2]Potthast M,Eiselt A,Barrón-Cedeno A,et al.Overview of the3rd International Competition on Plagiarism Detection[C].In:Proceeding of CLEF 2011 Labs and Workshop,Notebook Papers,Amsterdam,The Netherlands.2011:19-22.
[3]Pereira R C,Moreira V P,Galante R.A New Approach for Cross-language Plagiarism Analysis[C].In:Proceedings of the 2010 International Conference on Multilingual and Multimodal Information Access Evaluation:Cross-language Evaluation Forum(CLEF’10).Berlin,Heidelberg:SpringerVerlag,2010:15-26.
[4]Barrón-Cedeno A,Rosso P,Agirre E,et al.Plagiarism Detection across Distant Language Pairs[C].In:Proceedings of the 23rd International Conference on Computational Linguistics(COLING’10).Stroudsburg:Association for Computational Linguistics,2010:37-45.
[5]吕雅娟,赵铁军,李生.单语句法分析指导的双语结构对齐[J].计算机研究与发展,2003,40(7):970-976.(Lv Yajuan,Zhao Tiejun,Li Sheng.Bilingual Structure Alignment Based on Monolingual Parsing[J].Journal of Computer Research and Development,2003,40(7):970-976.)
[6]刘非凡,赵军,徐波.大规模非限定领域汉英双语语料库建设及句子对齐研究[C].见:全国第七届计算语言学联合学术会议论文集.2003:339-345.(Liu Feifan,Zhao Jun,Xu Bo.Building Large-Scale Domain Independent ChineseEnglish Bilingual Corpus and the Researches on Sentence Alignment[C].In:Proceedings of the 7th National Conference on Computational Linguistics.2003:339-345.)
[7]邓丹,刘群,俞鸿魁.基于双语词典的汉英词语对齐算法研究[J].计算机工程,2005,31(16):45-47.(Deng Dan,Liu Qun,Yu Hongkui.Research of Chinese-English Word Alignment Algorithm Based on Bilingual Dictionary[J].Computer Engineering,2005,31(16):45-47.)
[8]Chen J.A Lexical Knowledge Base Approach for EnglishChinese Cross-Language Information Retrieval[J].Journal of the American Society for Information Science and Technology,2006,57(2):233-243.
[9]Yarowsky D,Florian R.Evaluating Sense Disambiguation Across Diverse Parameter Spaces[J].Natural Language Engineering,2002,8(4):293-310.