基于混合策略的汉蒙机器翻译及相关技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：A Study on Chinese-Mongolian Hybrid Machine Translation and the Related Technologies
作者：王斯日古楞
论文级别：博士
学科专业名称：中国少数民族语言文学
中文关键词：混合策略 ; 汉蒙机器翻译 ; 形态分析 ; 汉蒙调序规则 ; 汉蒙量词翻译 ; 蒙古语句子自动切分
英文关键词：Hybrid approach ; Chinese-Mongolian machine translation ; Morphological Analysis ; Chinese-Mongolian reordering rules ; Chinese-Mongolian quantifier translation ; Mongolian sentence automatic segmentation
学位年度：2009
导师：那顺乌日图
学科代码：050107
学位授予单位：内蒙古大学
论文提交日期：2009-12-05

摘要

不同的机器翻译方法有各自的优点和局限性。基于混合策略的翻译方法的研究目的就是充分利用各种机器翻译方法的优势,避免每一种翻译方法的不足,达到翻译结果的最优化,从而提高机器翻译系统整体性能。
     本文在吸收和借鉴以往机器翻译研究的理论与方法的基础上,结合蒙古文信息处理的现状,并充分利用了其相关的资源,研究和实现了基于混合策略的汉蒙机器翻译系统。我们利用已有开源工具搭建了一个基于短语的汉蒙统计机器翻译系统,同时建立了汉蒙机器翻译系统自动评测平台需要的语言资源。我们在该研究中,为了提高基于短语的汉蒙统计机器翻译的性能,本文从以下几个方面进行了研究和实验：
     (1)通过加入汉蒙双语词典和对蒙古语名词的格、复数及领属等形式附加成分的形态分析,解决了译文中出现的大量未登录词问题。
     (2)提出了基于蒙古语语序的汉语句子调序方法,解决了基于短语统计机器翻译中出现的大量的语序错误。首先把汉语句子进行句法分析；然后根据调序规则进行调序,让汉语句子的语序尽量接近蒙古语句子的语序；最后把调序后的汉语句子送到统计解码器中进行单调解码。
     (3)为了解决汉蒙机器翻译中的量词翻译错误,我们对汉语和蒙古语中的量词翻译进行研究的基础上,提出了使用量词表进行翻译,总结出了一对一、多对一、一对零和一对多等汉语量词到蒙古语量词翻译的对应关系,给出了各种对应中的翻译方法。
     机器翻译的评测对机器翻译技术的研究具有重要的推动作用。在CWMT2009机器翻译评测中,我们为汉蒙日常用语评测任务提供了训练语料,开发集和测试集。为了准备这些语料开发了基于规则的蒙古语句子自动切分程序和蒙古文拉丁转写到UTF-8编码的转换程序,在此,还介绍了研制这些程序的方法与过程。最后,我们给出了基于混合策略汉蒙机器翻译系统的实验及结果分析。
Each of the Machine Translation (MT) methods has its own advantages and limitations. The purpose of hybrid MT method is to make full use of various MT's advantages, avoid their shortages, optimize translation result, and improve the whole performance of the MT system.
     In view of the resources available to us, after referencing the research achievements of related researchers, this paper studies about the Chinese-Mongolian hybrid machine translation system. We built a phrase-based Chinese-Mongolian SMT system using the existing open source tools. At the same time, we have established language resources for automatic evaluation platform of Chinese-Mongolian machine translation system. In order to improve the performance of the phrase-based Chinese-Mongolian SMT system, this thesis made the following research and experiments:
     (1)By joining the Chinese-Mongolian bilingual dictionary and doing morphological analysis for additional components of which Mongolian nouns cases, nouns plural forms and genitive cases to solve unknown word problems in translation.
     (2) We proposed Chinese sentence reordering method based on the Mongolian word order and handled the large number of word order errors that appeared in the phrase-based SMT. First of all, syntactic analysis needs to be done to the Chinese sentences. Then, according to the reordering rules Chinese sentences have to be converted to a form which similar to Mongolian sentence word order. Finally, the reordered Chinese sentence is sent to statistical decoder for monotonous decoding.
     (3) In the study of phrase-based Chinese-Mongolian SMT, we noticed that there are some errors in the Chinese-Mongolian quantifier translation. We compared Chinese-Mongolian quantifier translation methods and concluded one-to-one, many-to-one, one-to-zero and one-to-many relationships of translation between Chinese quantifier and Mongolian quantifier.
     Machine Translation Evaluation plays an important role to the development of Machine Translation technology. We provided the training corpus, development set and test set for Chinese-Mongolian daily evaluation task to CWMT2009 machine translation evaluation. In order to prepare corpus we have developed a rule-based Mongolian sentences automatic segmentation program and a converting program from Mongolian Latin to UTF-8 code. This thesis introduced the method and the process for developing these programs. Finally, we present the Chinese-Mongolian hybrid machine translation system's experiments and results analysis.

引文

[1]Brown R. and Frederking R. Applying Statistical English Language Modeling to Symbolic Machine Translation. In:Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-95),. Leuven, Belgium.1995, pages. 221-239.
    [2]Chi-Ho Li, Dongdong Zhang, Mu Li, Ming Zhou, Minghui Li, Yi Guan,A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation, Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 720-727, Prague, Czech Republic, June 2007.
    [3]David Chiang.2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL 2005, pages 263-270, Ann Arbor, Michigan, June.
    [4]Daniel Marcu, Wei Wang, Abdessamad Echihabi, and Kevin Knight.2006. SPMT:Statistical machine translation with syntactified target language phrases. In Proceedings of EMNLP 2006, pages 44-52.
    [5]D. Wu.1995. Stochastic inversion transduction grammars, with application to segmentation, bracketing, and alignment of parallel corpora. In Proc. of the 14th International Joint Conf. on Artificial Iritelligence(IJCAI), pages 1328-1334, Montreal, August.
    [6]D. Wu.1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics,23(3):377-403,September.
    [7]Einat Minkov, Kristina Toutanova, Hisami Suzuki, Generating Complex Morphology for Machine Translation [A]. In:Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics(ACL-07).Prague,2007, pages 128-135.
    [8]Fei Xia, and Michael McCord 2004. Improving a Statistical MT System with Automatically Learned Rewrite Patterns. Proceedings for COLING 2004.
    [9]Franz Josef Och, Hermann Ney, Discriminative Training and Maximum Entropy Models for Statistical Machine Translation [A], ACL2002 pp.295-302.
    [10]Frederking R. and Nirenburg S., Three Heads are Better than One, In:Proceedings of the
    Fourth Conference on Applied Natural Language Processing (ANLP-94), Stuttgart, Germany, 1994, pages 95-100.
    [11]Hogan C. and Frederking R., An Evaluation of Multi-engine MT Architecture, In:Third Conference of the Association for Machine Translation in Americas (AMTA'98),, Langhorne, PA. USA, Oct.1998, published as:Machine Translation and the Information Soup, Springer, pages 113-123.
    [12]Ibrahim Badr, Rabih Zbib, James Glass, Segmentation for English-to-Arabic Statistical Machine Translation Proceedings of ACL-08:HLT, Short Papers (Companion Volume), pages 153-156, Columbus, Ohio, USA, June 2008.
    [13]Kenji Yamada and Kevin Knight.2001. A syntax-based statistical translation model. In Proceedings of ACL 2001, pages 523-530.
    [14]Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, Bleu:a Method for Automatic Evaluation of Machine Translation, IBM Research Division, IBM Research Report RC22176 (W0109-022) September 17,2001。
    [15]Kristina Toutanova, Hisami Suzuki, Achim Ruopp, Applying Morphology Generation Models to Machine Translation, Proceedings of ACL-08:HLT, pages 514-522.
    [16]Liu Qun, Chang Baobao, Zhan Weidong, Zhou Qiang, A News-oriented Chinese-English Machine Translation System, In:International Conference on Chinese Computing (ICCC2001), Singapore,2001
    [17]Liu Qun, A Chinese-English Machine Translation System Based on Micro-engine Architecture, In:Chan Sin-Wai eds., Translation and Information Technology, The Chinese University Press, Hong Kong,2002, page 23-30.
    [18]Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause Restructuring for Statistical MachineTranslation. Proceedings for ACL 2005.
    [19]Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang; and Ignacio Thayer.2006. Scalable inference and training of context-rich syntactic translation models. In Proceedings of COLING/ACL 2006, pages 961-968, Sydney, Australia, July.
    [20]Michel Galley, Mark Hopkins,Kevin Knight, and Daniel Marcu.2004. What's in a translation rule? In Proceedings of HLT/NAACL 2004, pages273-280, Boston, USA, May.
    [21]Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin, A Statistical Approach to Machine Translation [J], Computational Linguistics,1990.
    [22]Peter. F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer, The Mathematics of Statistical Machine Translation:Parameter Estimation [J], Computational Linguistics, Vol 19, No.2,1993.
    [23]Philipp Koehn. (2004). Pharaoh:a beam search decoder for phrase-based statistical machine translation models. In Proceedings of the Sixth Conference of the Association for Machine Translation in the Americas, pp.115-124.
    [24]Philipp Koehn and Hieu Hoang,2007. Factored Translation Models. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp.868-876, Prague, June 2007.
    [25]Sharon Goldwater and David McClosky.2005. Improving statistical MT through morphological analysis. In EMNLP.
    [26]Sonja Nieβen and Hermann Ney.2004. Statistical machine translation with scarce resources using morpho-syntactic information. Computational Linguistics,30(2):181-204.
    [27]Su, Keh-Yih.2005. To have linguistic tree structures in statistical machine translation? In proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE), Wuhan,China.2005.
    [28]Wahlster W., Mobile Speech-to-Speech Translation of Spontaneous Dialogs:An Overview of the Final Verbmobil System, In Wolfgang Wahlster eds., Verbmobil:Foundations of Speech-to-Speech Translation, Springer,2000, ISBN 3-540-67783-6, pp 3-21
    [29]Ye-Yi Wang and Alex Waibel.1998. Modeling with structures in statistical machine translation. In Proceedings of COLING/ACL 1998, pages 1357-1363, Montreal, Quebec, Canada.
    [30]Young-Suk Lee.2004. Morphological analysis for statistical machine translation. In HLT-NAACL.
    [31]Zhang Min, Choi Key-Sun, Multi-Engine Machine Translation:Accomplishment of MATES/CK System, Proceedings of TMI99, pages:228-238.
    [32]Microsoft Corporation著,Win 32程序员参考大全,北京：清华大学出版社,1995年4月.
    [33]敖其尔,从英文到蒙文的机器翻译,内蒙古大学学报(哲学版),1988年第三期。
    [34]巴达玛敖德斯尔,面向机器翻译的汉蒙短语转换规则研究,内蒙古教育出版社,2006年3月。
    [35]陈小荷,现代汉语自动分析：Visual C++实现,北京语言文化大学出版社,2000年3月；
    [36]达胡白乙拉.蒙古语基本动词短语自动识别研究.内蒙古大学博士学位论文,2005。
    [37]冯志伟,自然语言机器翻译新论,语文出版社,1994年。
    [38]冯志伟,机器翻译研究,中国对外翻译出版公司,2004年。
    [39]侯宏旭,刘群,那顺乌日图,基于实例的汉蒙机器翻译,中文信息学报,2007,第4期,P65-72。
    [40]胡冠龙、李淼等,基于逐次筛选法的多引擎汉民机器翻译系统,民族语言文字信息技术研究,2007年2月,P187-191。
    [41]华沙宝,现代蒙古语语料库拼写规则,内蒙古大学蒙古学学院内部资料。
    [42]黄昌宁,李涓子,语料库语言学,商务印书馆,2002年冯志伟,《自然语言机器翻译新论》,语文出版社,1995年版。
    [43]吉日木图,基于模板的英蒙机器翻译系统的研究,内蒙古大学硕士论文,2005年。
    [44]李俊,统计机器翻译中解码算法的研究,哈尔滨工业大学2006年硕士论文。
    [45]刘洋,树到串统计翻译模型研究,中国科学院研究生院2007年博士学位论文。
    [46]刘群,汉英机器翻译若干关键技术研究,北京大学2004年博士研究生论文。
    [47]刘群,统计机器翻译综述,中文信息学报,Vol.17, No.4, pp.1-12,2003.7。
    [48]刘群,张华平,俞鸿魁等,基于层叠隐马模型的汉语词法分析,计算机研究与发展,2004年41卷8期。
    [49]刘颖,计算语言学,清华大学出版社,2002。
    [50]娜步青,基于统计的蒙汉机器翻译系统研究,内蒙古农业大学学报(社会科学版),2006年第2期。
    [51]那顺乌日图,蒙古语语法信息词典框架设计,内蒙古大学2000年博士学位论文。
    [52]那顺乌日图、刘群、巴达玛放德斯尔,《关于汉蒙机器辅助翻译系统》,《阿尔泰学报》第11号,2001年,汉城。
    [53]那顺乌日图、刘群、巴达玛放德斯尔,《面向机器翻译的蒙古语生成》,《自然语言理解与机器翻译》,清华大学出版社,2001年。
    [54]那顺乌日图等,《信息技术信息处理用蒙古文词语标记集》国家标准,2008年11月。
    [55]清格尔泰,蒙古语语法,内蒙古人民出版社,1991年。
    [56]确精扎布,蒙古文编码,内蒙古大学出版社,2000年8月。
    [57]淑琴、那顺乌日图,2006年6月,面向EBMT系统的汉蒙双语语料库的构建,内蒙古社会科学。
    [58]孙乐,机器翻译研究进展—第五届全国机器翻译研讨会论文集,2009年10月。
    [59]图格木乐,蒙古文资源库建设相关技术研究,内蒙古大学硕士论文,2007年。
    [60]王斯日古楞,英蒙机器翻译系统的设计与实现,内蒙古大学硕士论文,2002年。
    [61]乌达巴拉,基于混合策略的蒙_英机器翻译系统的研究,内蒙古大学内蒙古大学硕士论文,2007,7。
    [62]王小捷等,自然语言处理技术基础,北京邮电大学出版社,2002。
    [63]熊德意,基于括号转录语法和依存语法的统计机器翻译研究,中国科学院研究生院2007年博士学位论文。
    [64]雪艳,汉蒙词语对齐及其相关技术研究,内蒙古大学2009年博士论文。
    [65]雪艳、应玉龙,基于阿拉伯数字中介的汉蒙数词对齐策略,中国少数民族自然语言处理技术研究与进展—第二届全国少数民族自然语言处理学术研讨会论文集,2008年10月,P248-256。
    [66]姚天顺,自然语言理解——一种让机器懂得人类语言的研究(第2版),清华大学出版社,2002。
    [67]俞士汶等著,现代汉语语法信息词典详解,清华大学出版社,1998年。
    [68]詹卫东,面向中文信息处理的现代汉语短语结构规则研究,清华大学出版社,2000年。
    [69]张浩,刘群,白硕,2002,结构上下文相关的概率句法分析,第一届学生计算语言学研讨会(SWCL2002).
    [70]赵铁军,机器翻译原理,哈尔滨工业大学出版社,2000年。
    [71]宗成庆,统计自然语言处理,清华大学出版社,2008年5月。
    [72]宗成庆,机器翻译研究进展—第四届全国机器翻译研讨会论文集,2008年11月。
    [73]周强,《汉语语料库的短语自动划分和标注研究》,北京大学博士学位论文,2002年9月。
    [79]http://www.nlp.org.cn,基于短语的统计机器翻译系统“丝路”1.0版(SilkRoadV1.0)设计与使用说明,2006年10月。
    [80]http://www.isi.edu/licensed-sw/pharaoh/
    [81]http://www.statmt.org/moses/
    [82]http://icl.pku.edu.cn/doubtfire/
    [83]http://ww.icl.pku.edu.cn/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700