用户名: 密码: 验证码:
汉语-印尼语平行语料自动对齐方法研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Study on the Automatic Alignment of Mandarin-Indonesian Bilingual Texts
  • 作者:郑铿涛 ; 林楠铠 ; 付颖 ; 王连喜 ; 蒋盛益
  • 英文作者:ZHENG;Kengtao;LIN Nankai;FU Yingwen;WANG Lianxi;JIANG Shengyi;School of Information Science and Technology,Guangdong University of Foreign Studies;Eastern Language Processing Center,Guangdong University of Foreign Studies;
  • 关键词:平行语料 ; 语料库构建 ; 可比语料 ; 段落对齐 ; 句对齐
  • 英文关键词:parallel corpus;;corpus construction;;comparable corpus;;paragraph alignment;;sentence alignment
  • 中文刊名:GXSF
  • 英文刊名:Journal of Guangxi Normal University(Natural Science Edition)
  • 机构:广东外语外贸大学信息科学与技术学院;广州市非通用语种智能处理重点实验室(广东外语外贸大学);
  • 出版日期:2019-01-10
  • 出版单位:广西师范大学学报(自然科学版)
  • 年:2019
  • 期:v.37
  • 基金:国家自然科学基金(61572145);; 国家社会科学基金青年项目(17CTQ045);; 广东省教育厅基础研究重大项目及应用研究重大项目(2017KZDXM031);; 2018年广东大学生科技创新培育专项资金(pdjhb0177)
  • 语种:中文;
  • 页:GXSF201901010
  • 页数:9
  • CN:01
  • ISSN:45-1067/N
  • 分类号:93-101
摘要
双语平行语料库是多语种自然语言处理的重要资源,已被广泛地应用于机器翻译、机助人译、翻译知识抽取与跨语言信息检索等领域中。本文针对汉语-印尼语平行语料的自动对齐与可比语料的自动提取问题,提出了基于锚点和词典相结合的段落对齐方法,并在此基础上采用基于置信区间的长度模型实现句子对齐,同时,为了快速提高汉语-印尼语平行语料库的构建效率,还提出了基于跨语言文档相似度的可比语料提取方法。实验结果表明,本文提出的平行语料对齐方法和可比语料提取方法的准确率较传统方法有显著的提高,说明本文提出方法是有效的、可行的。
        Bilingual parallel corpus is an important resource for multilingual natural language processing.It has been widely used in the fields of machine translation,machine-assisted translation,translation knowledge extraction and cross-language information retrieval.In this paper,the automatic alignment of Chinese-Indonesian parallel corpus and the automatic extraction of comparable corpus are proposed.Firstly,aparagraph alignment method based on the combination of anchor point and dictionary is proposed.On this basis,the length alignment model based on confidence interval is used to achieve sentence alignment.At the same time,in order to quickly improve the construction efficiency of the Chinese-Indonesian parallel corpus,a comparable corpus extraction method based on the similarity of cross-language documents is proposed.The experimental results show that the accuracy of parallel corpus alignment method and comparable corpus extraction method is significantly higher than that of traditional methods,which indicates that the proposed method is effective and feasible.
引文
[1]林政,吕雅娟,刘群,等.Web平行语料挖掘及其在机器翻译中的应用[J].中文信息学报,2010,24(5):85-91.
    [2]郭华伟,张帆,杨小敏,等.英汉平行语料库在跨语言信息检索中的应用分析[J].医学信息学杂志,2012,33(3):39-43.
    [3] CHEN J,NIE J.Automatic construction of parallel english-chinese corpus for cross-language information retrieval[C]//Proceedings of the 6th Applied Natural Language Processing Conference.Seattle, WA:Applied Natural Language Processing Conference,2000:21-28.
    [4] PHILIP R,SMITH N A.The Web as a parallel corpus[J].Computational Linguistics,2003,29(3):349-380.
    [5] ZHANG Y,WU K,GAO J,et al.Automatic acquisition of Chinese–English parallel corpus from the Web[C]//European Conference on Information Retrieval,Berlin.Heidelberg:Springer,2006:420-431.
    [6]MOORE R C.Fast and accurate sentence alignment of bilingual corpora[J].Lecture Notes in Computer Science,2002,2499:135-144.
    [7] VARGA D,HALCSY P,KORNAI A,et al.Parallel corpora for medium density languages[J].Amsterdam Studies in the Theory and History of Linguistic Science Series 4,2007,292:247.
    [8] MA X Y.Champollion:a robust parallel text sentence aligner[C]//Proceedings of the 5th International Conference on Language Resources and Evaluation(LREC).[S.l.]:LREC,2006:489-492.
    [9]向露.基于网络的翻译知识自动获取方法研究与实现[D].北京:中国科学院大学,2014.
    [10] BROWN P F,LAI J C,MERCER R L.Aligning sentences in parallel corpora[C]//Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics(ACL).Stroudsburg PA:ACL,1991:169-176.
    [11] GALE W A,CHURCH K W.A program for aligning sentences in bilingual corpora[J].Computational Linguistics,1993,19(1):75-102.
    (1)http://wwwi.cl.pku.edu.cn/icl_groups/parallel/default.htm
    (2)http://ccl.pku.edu.cn:8080/ccl_corpus/index_bij.sp
    (3)http://conferences.unite.un.org/UNCorpus/
    (4)http://www.statmt.org/europarl/
    (5)http://wwwl.inguateca.pt/COMPARA/Welcome.html
    (6)http://homepagesi.nf.ed.ac.uk/s0787820/bible/
    (7)https://www.korpus.cz/
    (1)https://translate.google.com/

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700