摘要
双语平行语料库是多语种自然语言处理的重要资源,已被广泛地应用于机器翻译、机助人译、翻译知识抽取与跨语言信息检索等领域中。本文针对汉语-印尼语平行语料的自动对齐与可比语料的自动提取问题,提出了基于锚点和词典相结合的段落对齐方法,并在此基础上采用基于置信区间的长度模型实现句子对齐,同时,为了快速提高汉语-印尼语平行语料库的构建效率,还提出了基于跨语言文档相似度的可比语料提取方法。实验结果表明,本文提出的平行语料对齐方法和可比语料提取方法的准确率较传统方法有显著的提高,说明本文提出方法是有效的、可行的。
Bilingual parallel corpus is an important resource for multilingual natural language processing.It has been widely used in the fields of machine translation,machine-assisted translation,translation knowledge extraction and cross-language information retrieval.In this paper,the automatic alignment of Chinese-Indonesian parallel corpus and the automatic extraction of comparable corpus are proposed.Firstly,aparagraph alignment method based on the combination of anchor point and dictionary is proposed.On this basis,the length alignment model based on confidence interval is used to achieve sentence alignment.At the same time,in order to quickly improve the construction efficiency of the Chinese-Indonesian parallel corpus,a comparable corpus extraction method based on the similarity of cross-language documents is proposed.The experimental results show that the accuracy of parallel corpus alignment method and comparable corpus extraction method is significantly higher than that of traditional methods,which indicates that the proposed method is effective and feasible.
引文
[1]林政,吕雅娟,刘群,等.Web平行语料挖掘及其在机器翻译中的应用[J].中文信息学报,2010,24(5):85-91.
[2]郭华伟,张帆,杨小敏,等.英汉平行语料库在跨语言信息检索中的应用分析[J].医学信息学杂志,2012,33(3):39-43.
[3] CHEN J,NIE J.Automatic construction of parallel english-chinese corpus for cross-language information retrieval[C]//Proceedings of the 6th Applied Natural Language Processing Conference.Seattle, WA:Applied Natural Language Processing Conference,2000:21-28.
[4] PHILIP R,SMITH N A.The Web as a parallel corpus[J].Computational Linguistics,2003,29(3):349-380.
[5] ZHANG Y,WU K,GAO J,et al.Automatic acquisition of Chinese–English parallel corpus from the Web[C]//European Conference on Information Retrieval,Berlin.Heidelberg:Springer,2006:420-431.
[6]MOORE R C.Fast and accurate sentence alignment of bilingual corpora[J].Lecture Notes in Computer Science,2002,2499:135-144.
[7] VARGA D,HALCSY P,KORNAI A,et al.Parallel corpora for medium density languages[J].Amsterdam Studies in the Theory and History of Linguistic Science Series 4,2007,292:247.
[8] MA X Y.Champollion:a robust parallel text sentence aligner[C]//Proceedings of the 5th International Conference on Language Resources and Evaluation(LREC).[S.l.]:LREC,2006:489-492.
[9]向露.基于网络的翻译知识自动获取方法研究与实现[D].北京:中国科学院大学,2014.
[10] BROWN P F,LAI J C,MERCER R L.Aligning sentences in parallel corpora[C]//Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics(ACL).Stroudsburg PA:ACL,1991:169-176.
[11] GALE W A,CHURCH K W.A program for aligning sentences in bilingual corpora[J].Computational Linguistics,1993,19(1):75-102.
(1)http://wwwi.cl.pku.edu.cn/icl_groups/parallel/default.htm
(2)http://ccl.pku.edu.cn:8080/ccl_corpus/index_bij.sp
(3)http://conferences.unite.un.org/UNCorpus/
(4)http://www.statmt.org/europarl/
(5)http://wwwl.inguateca.pt/COMPARA/Welcome.html
(6)http://homepagesi.nf.ed.ac.uk/s0787820/bible/
(7)https://www.korpus.cz/
(1)https://translate.google.com/