统计机器翻译中树到串对齐模板模型系统实现和比较研究

英文题名：Implementation and Analysis of Tree to String Alignment Template Model in Statistical Machine Translation
作者：张春越
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：树到串对齐模板模型 ; 统计机器翻译 ; 解码器 ; 命名实体翻译 ; 音译
英文关键词：tree to string alignment template model ; statistiacl machine translation ; decoder ; named entity translation ; transliteration
学位年度：2010
导师：赵铁军
学科代码：081202
学位授予单位：哈尔滨工业大学
论文提交日期：2010-06-01

摘要

统计机器翻译使用统计方法自动地把一种自然语言的文本转换成另一种自然语言的文本。最近,统计机器翻译研究者开始关注融合语言学信息的翻译模型。在这些模型中,基于树到串对齐模板的翻译模型是一种很好的代表。
     首先,本文对受句法指导的树到串对齐模板模型进行了较为全面的论述,并实现了基于树到串对齐模板模型的解码器。详细讨论了树到串对齐模板模型的形式化定义、参数估计和解码方法。同时,为了加速树到串对齐模板模型的解码速度,使用了立方体剪枝策略。
     其次,对树到串对齐模板模型进行了实证分析。将树到串对齐模板模型和短语模型在三个方面上进行了详细地对比。第一,树到串对齐模板模型的生成能力更强,能够表达语言中常见的非连续搭配问题。第二,树到串对齐模板模型在处理长距离调序问题上比短语模型更有优势。第三,树到串对齐模板模型不能表达非句法连续短语。最后,使用Moses做为对比系统在NIST-2005和NIST-2008 MT测试集上对解码器进行了实验验证。
     最后,对基于统计方法的音译汉英外国人名进行了探索。第一,讨论了常见的统计音译方法分类,详细介绍了基于序列化标注模型和基于噪声信道模型的两种音译模型。第二,通过充分的实验比较得出结论:对基于噪声信道方法的音译模型而言,汉语应该以汉字为基本单位,通过音节化英文人名能够在低阶语言模型上获得更好的翻译性能。第三,通过重排序的方法可以极大地提升模型的性能。
Statistical machine translation is the task of automatically translating a text from one natural language into another by using statistical methods. Currently, linguistic-based translation model has become a dominant issue by more and more statistical MT researchers. Among many existed linguistic models, tree to string alignment template model is a classical representative.
     In this thesis, firstly we describe in detail tree to string alignment template model, which is directed by linguistic syntax, from formal definition, free parameters estimation to decoding method. We implement a decoder with respect to the model. In order to accelerate the decoding speed, we use the cube-pruning method to prune hypothesises, so time cost of decoding is decreased significantly.
     Secondly, we compare tree to string alignment template model with phrase model on 3 points as follows. Tree to string alignment template model has better generation ability than phrase model, especially on exploiting non-continuous custom collocation. And tree to string alignment template model can reorder long distance distortion better. Although tree to string alignment template model has many advantages compared with phrase model, it can not express continuous non-syntax phrase. At last we get our decoder’s performance on NIST 2005 and NIST 2008 MT evaluation set with Moses as a baseline system.
     Finally, statistical-based transliteration is discussed on Chinese to English person name. We classify the-state-of-art statistical-based transliteration method, and introduce two transliteration models: sequence label-based transliteration model and noisy channel-based transliteration model. According to sufficient experiments, we get some useful conclusions as follows: in noisy channel-based transliteration model, the basic unit of Chinese should be Chinese character and syllable-English sequence can improve significant performance under the condition of low-order language model. We can get better performance with reranking method.

引文

1赵铁军.机器翻译原理.哈尔滨工业大学出版社, 2001: 1-21
    2 W. Weaver. Translation. In William N. Locke and A. Donald Booth, editors, Machine Translation of Languages: Fourteen Essays, chapter1, pages15-23. MIT Press, 1955. Reprint of 1949 memorandum.
    3 P. F. Brown. Statistical Approach To Machine Translation. Computational Linguistics. 1990, 16(2): 79-85
    4 R.O. Duda, P.E. Hart, D.G. Stork: Pattern Classification. John Wiley and Sons, New York, NY, 2nd edition, 2000.
    5 P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, Jun 1993.
    6 A. L. Berger, P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, J. R. Gillett, J. D. Lafferty, R. L. Mercer, H.Printz, and L. Ure?. The Candide system for machine translation. In Proc. of the ARPA Workshop on Human Language Technology, pages 157–162, Mar 1994
    7 F. J. Och and H. Ney. Discriminative training and maximum entropy models for machine translation. In Proc. of ACL, pages 156–163, Jul 2002.
    8 A. Lopez. Statistical Machine Translation. In ACM Computing Surveys 40(3),: Article 8, pages 1–49, August 2008.
    9 F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, Mar 2003.
    10 F. J. Och, C. Tillman, and H. Ney. Improved alignment models for statistical machine translation. In Proc. of EMNLP-VLC, pages 20–28, Jun 1999.
    11 P. Koehn, F. J. Och, and D. Marcu. Statistical Phrase-Based Translation. In Proceedings of the Human Language Technology Conference and the North American Association for Computational Linguistics (HLT-NAACL). Edmonton, Canada, 2003: 127-133.
    12 P. Koehn. Pharaoh: A Beam Search Decoder for Phrase-based Statistical Machine Translation Models. Proceedings of the Sixth Conference of the Association for Machine Translation in the Americas, 2004.
    13 P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.
    14 R. Zens and H. Ney. A comparative study on reordering constraints in statistical machine translation. In Proc. of ACL, pages 144–151, Jul 2003.
    15 C. Tillmann, S. Vogel, H. Ney, and A. Zubiaga. A DP-based search using monotone alignments in statistical translation. In Proc. of ACL-EACL, pages 289–296, 1997.
    16 D. Xiong, Q. Liu and S. Lin. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation. In Proceedings of ACL, 2006: 521-528
    17 D. Wu. Stochastic inversion transduction grammars, with application to segmentation, bracketing, and alignment of parallel corpora. In Proc. of IJCAI, pages 1328–1335, Aug 1995.
    18 H. Alshawi. Head Automata and Bilingual Tiling: Translation with Minimal Representations. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics. Santa Cruz, California, 1996.
    19 D. Chiang. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201–228, 2007.
    20 D. Chiang. A hierarchical phrase-based model for statistical machine translation. In Proc. of ACL, pages 263–270, June 2005.
    21 I. D. Melamed. Multitext grammars and synchronous parsers. In Proc. of HLT-NAACL, pages 79–86, May 2003.
    22 I. D. Melamed. Statistical machine translation by parsing. In Proc. of ACL, pages 654–661, Jul 2004.
    23熊德意,刘群,林守勋.基于句法的统计机器翻译综述.中文信息学报, 22(2), 2008.
    24 Y. Liu, Q. Liu, and S. Lin. 2006. Tree-to-String Alignment Template for Statistical Machine Translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pages 609-616, Sydney, Australia, July.
    25 Y. Liu, Y. Huang, Q. Liu, and S. Lin. 2007. Forest-to-string statistical translation rules. In Proc. of ACL07.
    26 H. Mi and L.Huang. Forest-based Translation Rule Extraction. in Proceedings of EMNLP 2008 ,Honolulu, Hawaii.
    27 K. Yamada and K. Knight. A Syntax-Based Statistical Translation Model. In Proceedings of the 39th Annual Meeting of the Association on Computational Linguistics. 2001.
    28 M.Galley, M.Hopkins, K. Knight and D. Marcu. What’s in a translation rule? In Proc. of HLT-NAACL, pages 273–280, May 2004.
    29 K. Knight, J. Graehl. An Overview of Probabilistic Tree Transducers for Natural Language Processing, Proc. of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Lecture Notes in Computer Science, copyright Springer Verlag, 2005.
    30 M. Galley, J. Graehl, K. Knight, D. Marcu, S. DeNeefe, W. Wang, and I. Thayer. Scalable inference and training of context-rich syntactic translation models. In Proc. of ACL, pages 961–968, Jun 2006.
    31 D. Marcu, W. Wang, A. Echihabi, and K. Knight. SPMT: Statistical machine translation with syntactified target language phrases. In Proc. of EMNLP, pages 44–52, Jul 2006.
    32 S. S., Y. Schabes, and F. Pereira. Principles and implementation of deductive parsing. Journal of Logic Programming 24:3-36, 1995.
    33 J. Eisner. Learning non-isomorphic tree mappings for machine translation. In ACL’03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 205–208, Morristown, NJ, USA, 2003.
    34 H. Jiang, M. Yang, T. Zhao, S. Li, B. Wang.A statistical machine translation model based on a synthetic synchronous grammar.Proceedings of the ACL-IJCNLP 2009 Conference Short Papers.Suntec, Singapore.2009
    35 F. Huang, S. Vogel and A. Waibel.Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-feature Cost Minimization, in the Proceedings of the 41st Annual Conference of the Association for Computational Linguistics (ACL'03), Workshop on Multilingual and Mixed-language Named Entity Recognition, Sapporo, Japan, July 2003.
    36 H. Z. Li, M. Zhang and J. Su. A Joint Source Channel Model for Machine Transliteration. In Proceedings of 42nd ACL, 2004: 159-166
    37 J. Oh, K. Choi, and H. Isahara. A comparison of different machine transliteration models. Journal of Artificial Intelligence Research (JAIR), 27:119–151.2006
    38陈钰枫.汉英命名实体翻译及对齐方法研究.中国科学院自动化研究所博士论文. 2008: 13-21
    39 F. Ren, M. Zhu , H. Wang, J. Zhu.Chinese-English organization name translation based on correlative expansion.Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration.2009
    40 P. Koehn. Pharaoh: A Beam Search Decoder for Phrase-based Statistical Machine Translation Models. Proceedings of the Sixth Conference of the Association for Machine Translation in the Americas, 2004.
    41 D. Marcu and W. Wong. A phrase-based, joint probability model for statistical machine translation. In Proc. of EMNLP, pages 133–139, Jul 2002.
    42 A. Stolcke. Srilm - an extensible language modeling tookit. In Proceedings of the International Conference on Spoken Language Processing, volume 2, pages 901–904, Denver, Colorado, USA, September 2002.
    43 S.F Chen et.al.An Empirical Study of Smoothing Techniques for Language Modeling.1998
    44 K. Papineni, S. Roukos, T. Ward and W. J. Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL, Philadelphia, PA, 2002: 311-318
    45 F. J. Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL03.
    46 H. Tseng, P. Chang, G. Andrew, D. Jurafsky and C. Manning. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. 2005.
    47 D. Klein and C. D. Manning. 2003. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, pp. 3-10.
    48 Z. Li, C. Callison-Burch, C. Dyer, J. Ganitkevitch, S. Khudanpur, L. Schwartz,W. Thornton, J. Weese and O. Zaidan, 2009. Joshua: An Open Source Toolkit for Parsing-based Machine Translation. In Proceedings of the Workshop on Statistical Machine Translation (WMT09).
    49 K. Knight and J. Graehl. Machine Transliteration. Computational Linguistics. 1998: 24(4)
    50庞薇.基于多模型融合的人名翻译系统.中文信息学报.2009
    51 J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001
    52 H. Wallach. Efficient training of conditional random fields.2002
    53 L. Jiang, M. Zhou, F. C. Lee, and C. Niu. Named Entity Translation with Web Mining and Transliteration. In Proceedings of IJCAI, 2007
    54 L. Shen, U. A. Sarkar, S. Fraser U. F J. Och. Discriminative Reranking for. Machine Translation. USC/ISI. May 4, 2004
    55 J. Oh, K. Uchimoto.Machine Transliteration using Target-Language Grapheme and. Phoneme: Multi-engine Transliteration Approach. 2009

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700