非连续短语模板抽取及短语合并在统计机器翻译中的应用

英文题名：Discontinuous Phrase Template Extraction and Phrase Combination in Phrase-Based Statistical Machine Translation
作者：段楠
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：基于短语的统计机器翻译 ; 非连续短语模板 ; 短语合并 ; 短语翻译表
英文关键词：Phrase-Based SMT ; Discontinuous Phrase Template ; Phrase Combination ; Phrase Translation Table
学位年度：2007
导师：何丕廉 ; 李沐
学科代码：081203
学位授予单位：天津大学
论文提交日期：2007-06-01

摘要

机器翻译(MT)就是利用计算机将一种自然语言的文本或对话转换为另一种自然语言的文本或对话,同时保持语意的一致性。在给定源语言的情况下,机器翻译的过程就是寻找与源语言在语意上最为匹配的目标语言的决策过程。在各种不同的机器翻译系统中,基于短语的统计机器翻译(Phrased-Based SMT)无疑是最为有效的方法。
     基于短语的统计机器翻译方法允许源语言和目标语言词语之间存在多对多的关联,从对齐矩阵中抽取出来的短语被放置在短语翻译表中。这样,词语的上下文信息在翻译模型中就可以被考虑进来,并且在把源语言翻译成目标语言过程中所发生的单词之间位置顺序的改变也可以显式的获得。在汉-英机器翻译系统中,基于短语的统计翻译模型较之单纯基于单词的统计翻译模型,翻译效果有着显著的提高。
     但是,这种方法同时也存在着一些问题。由于短语长度的限制,一些在中文中间隔较远的固定结构并不能被完整的抽取出来。这些结构在中文句子中不连续,而其对应翻译却在英文句子中连续。并且,对短语各个部分分别进行翻译拼凑起来的结果并不等价于将其做为一个整体翻译而获得的结果。
     本文通过在短语翻译表中加入非连续短语模板和短语合并项来增强机器翻译的效果。短语模板抽取和短语合并过程并不涉及任何的语法信息,仅仅只是从双语对齐语料中获得。本文将简要的介绍抽取和合并的算法细节,并以BLEU做为翻译结果的评测标准,在2002年至2005年NIST (National Institute of Standards and Technology)标准测试语料集上进行对比实验。实验结果表明,加入短语模板和短语合并项后,翻译质量与先前系统相比有了一定程度的提高。
Machine Translation (MT) is the use of a computer to translate texts or ulterances of a natural language into another natural language while maintaining the meanings unchanged. The process of MT is a decision problem where we have to decide on the best of target language text matching a source language text. During various kinds of different MT systems, Phrase-Based Statistical Machine Translation (SMT) is the best one undoubtfully.
     The Phrase-Based SMT approach allows for general many-to-many relations between words. Phrases which are extracted from alignment matrixs are listed in phrase translation table. Thereby, the context of words is taken into account in the translation model, and local changes in words order from source to target language can be learned explicitly. On the Chinese-English translation task, the Phrase-Based SMT obtains significantly better performance than the Single-Word-Based one.
     However, this approach also has some shortcomings at the same time. Due to the restriction of the allowed maximum length of a Chinese phrase, some fixed structures which are separated in a relative long distance can not be extracted as a whole unit. These structures devide in Chinese but their translations are continuous in English. What’s more, the union of each part’s translation is unequal the one which is obtained by translating the structure as a whole unit.
     We add discontinuous phrase templates and merged phrases in phrase translation table to enhance the quality of the Phrase-Based SMT. Extracted templates and merged phrases are learned from a bitext without any syntactic information. In this paper, we will introduce the algorithms of extraction and combination in details and take a series of comparative experiments using BLEU as a metric in 2002-2005 NIST test data. The evaluation results show that the quality of the translations achieves a relative improvement over the baseline Phrase-Based SMT.

引文

[1] Peter F. Brown, Della Pietra, L. Mercer“The Mathematics of Statistical Machine Translation: Parameter Estimation”, 1993 Association for Computational Linguistics
    [2] I.Dan Melamed“Models of Translational Equivalence among Words”, 2000 Association for Computational Linguistics
    [3] Franz Josef Och, Hermann Ney“A Systematic Comparison of Various Statistical Alignment Models”, 2003 Association for Computational Linguistics
    [4] Philipp Koehn“Noun Phrase Translation”, Ph.D thesis, University of Southern California
    [5] Knight, Kevin“A Statistical MT Tutorial Workbook”, Available at http://www.isi.edu/natural-language/mt/wkbk.rtf
    [6] Och, J. Franz“Giza++: Training of Statistical Translation Models”, Available at http://www-i6.informatik.rwth-aachen.de/~och/software/GIZA++.html
    [7] Franz Josef Och“An efficient Method for Determining Bilingual Word Classes”, Proceedings of EACL 99
    [8] Christoph Tillmann“A Projection Extension Algorithm for Statistical Machine Translation”, Proceeding of the 2003 Conference on Emprical Methods in Natural Language Processing, pp.1-8
    [9] Yonggang Deng, William Byrne“HMM Word and Phrase Alignment for Statistical Machine Translation”, HLT-EMNLP 2005
    [10] Alexandre Bouchard, John DeNero, Dan Gillick“Improving Phrase-Based Machine Translation”, Dec 20, 2005
    [11] Franz Josef Och, Hermann Ney“A Comparison of Alignment Models for Statistical Machine Translation”The 18th Conf. on Computational Linguistics, pp.1086-1090, Aug.2000
    [12] Franz Josef Och“mkcls: Training of Word Classes for Language Modeling”, 2000, http://www-i6.informatik.rwth-aachen.de/~och/software/mkcls.html
    [13] Franz Josef Och, Hermann Ney“Improved Statistical Alignment Models”, The 38th Annual Meeting of the Association for Computational Linguistics, pp.440-447, HongKong, Oct.2000
    [14] Franz Josef Och, Hermann Ney“Statistical machine translation”, In Proc. of Workshop of the European Association for Machine Translation, pp.39-46, May2000.
    [15] Franz Josef Och, Hermann Ney“Discriminative Training and Maximum Entropy Models for Statistical Machine Translation”, The 40th Annual Meeting of the Association for Computational Linguistics, pp.295-302, July 2002
    [16] Franz Josef Och, Hermann Ney“The Alignment Template Approach to Statistical Machine Translation”, 2004 Association for Computational Linguistics
    [17] Franz Josef Och, Hermann Ney“Improved Statistical Alignment Models”
    [18] Franz Josef Och“Minimum Error Rate Training in Statistical Machine Translation”, the 41st Annual Meeting of the Association for Computational Linguistics, 2003
    [19] Yamada, Kenji, Kevin Knight“A Syntax-Based Statistical Translation Model”In Proc. of the 39th Annual Meeting of the Association for Linguistics, France, 2001
    [20] Yamada, Kenji, Kevin Knight“A Decoder for Syntax-Based MT”In Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, 2002
    [21] David Chiang“A Hierarchical Phrase-Based Model for Statistical Machine Translation”In Proc. of the 43th Annual Meeting of the Association for Linguistics, pp.263-270, Ann Arbor 2005
    [22] Philipp Koehn, Franz Josef Och, Daniel Marcu“Statistical Phrase-Based Translation”In Proceedings of HLT-NAACL, pp.127-133, 2003
    [23] Philipp Koehn“Pharaoh: A Beam search decoder for Phrase-Based Statistical machine translation models”In Proceedings of the Sixth Conference of the Association for Machine Translation in Americas, pp115-124
    [24] Philipp Koehn“Statistical significance tests for machine translation evaluation”In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.388-395
    [25] Philipp Koehn“Statistical Machine Translation: the basic, the novel, and the speculative”SMT tutorial 2006
    [26] Franz Josef Och, Ueffi ng, Ney“An efficient search algorithm for Statistical Machine Translation”In Data-Driven MT Workshop A?
    [27] Philipp Koehn, K Knight“ChunkMT: Machine translation with richer linguistic knowledge”, Unpublished, 2002
    [28] Philipp Koehn, K Knight“Frature-rich translation of Noun Phrases”, in 41st Annual Meeting of Association of Computational Linguistics
    [29]Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne“Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation”, 2005
    [30]Andreas Stolcke“SRILM-AN extensible Language Modeling toolkit”
    [31]Jeff A. Bilmes“A gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models”
    [32]Kishore Papineni, Salim Roukos, Todd Ward“BLEU: a Method for Autoamtic -6 1 -Evaluation of Machine Translation”Proceedings of the 40th Annual Meeting of the Association for Computational Lingustics (ACL), pp.311-318, July 2002
    [33]Adam L.Berger, Stephen Della Pietra, Vincent Della Pietra“A Maximum Entropy Approach to Natural Language Processing”, Association for Computational Linguistics, 1996
    [34] Peter F. Brown, John Cocke, Della Pietra“A Statistical Approach to Machine Translation”, Computational Linguistics Volume 16, Number2, June 1990
    [35] Liu Xiaoyue, Yang Muyun, Zhao tiejun and Lv Yajuan“Research on Parallel Corpus Based Chinese-English Lexicon builder”
    [36]Ying Zhang, Stephan Vogel, Alex Waibel“Integrated phrase segmentation and alignment algorithm for statistical machine translation”, 0-7803-7902-0/03 @2003 IEEE
    [37]Stephan Vogel, Hermann Ney, Christooh Tillmann“HMM-Based Word Alignment in Statistical Translation”
    [38] Robert C. Moore“Association-Based Bilingual Word Alignment”, Proceedings of the ACL workshop on Building and Using Parallel Texts, pp1-8
    [39]Robert C. Moore, Wen-Tau Yih, Andreas Bode“Improved Discriminative Bilingual Word Alignment”, Proceedings of the 21st International Conference on Computational Lingustics and 44th Annual of the ACL,pp513-520
    [40] Fei Xia“The Part-Of-Speech Tagging Guidelines for the Penn Chinese TreeBank (3.0)”, this paper is posted at ScholarlyCommons@Penn
    [41]Dan Klein, Christopher D. Manning“Fast Exact Inference with a Factored Model for Natural Language Parsing”, NLPS 2002
    [42] Roger Levy, Christopher Manning“Is it harder ro parse Chinese, or the Chinese Treebank?”, 2003 Association for Computational Linguistics
    [43] Dan Klein, Christopher D. Manning“Accurate Unlexicalized Parsing”, 2003 Association for Computational Linguistics
    [44] Marie-Catherine de Marneffe, Bill MacCartney, Christopher D. Manning“Genereting Typed Dependency Parses from Phrase Structure Parses”, 2006
    [45] Sriram Venkatapathy, Aravind K. Joshi“Using Information about Mulit-Word Expression for the Word-Alignment Task”, Proceedings of the WorkShop on MultiWord Expressions, pp.20-27, Sydney, July 2006
    [46] Katrin Kirchhoff, Mei Yang“Improved Language Modeling for Statistical Machine Translation”, Proceedings of the ACL WorkShop on Building andUsing Parallel Texts, pp.125-128,2005
    [47] Marian Olteanu, Pasin Suriyentrakorn and Dan Moldovan“Language Model and Reranking for Machine Translation”, Proceedings of the WorkShop on Statistical Machine Translation, pp.150-153, June, 2006
    [48] Ying Zhang, Almut Silja Hildebrand, Stephan Vogel“Distributed Language Modeling for N-Best list Reranking”

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700