文档级统计机器翻译的研究

英文题名：Research on Statistical Machine Translation at Document Level
作者：贡正仙
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：统计机器翻译 ; 文档级统计机器翻译 ; 缓存技术 ; 时态模型 ; 机器翻译自动评价
英文关键词：Statistical Machine Translation ; Document-level SMT ; Cache-based
英文关键词：Technology ; Tense Model ; Automatic Evaluation for MT
学位年度：2014
导师：周国栋
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2014-03-01

摘要

机器翻译是自然语言理解中的一个研究热点，能有效地促进信息共享，具有广泛的研究和应用价值。统计机器翻译（Statistical Machine Translation，SMT）是目前主流的机器翻译技术，但孤立进行句子翻译的SMT系统在翻译的过程中仅能利用当前句子的信息，完全忽略了前后句子的关联和文本的全局信息。然而，文档级别的信息，比如风格、主题、类别等，对机器翻译而言是极为有用的，它们不仅能引导翻译系统在词形、词义上进行正确的消歧，还能保持译文与原文在语言风格和关键内容上的一致。
     虽然早在1992年就有学者提出基于篇章的机器翻译概念，但机器翻译发展至今，绝大多数还停留在孤立的句子层面，这里面有一些客观的原因，比如语料的限制。但总体来看，进展的缓慢恰恰表示这项研究具有挑战性。本文对文档级统计机器翻译展开研究，主要内容包括：
     1．文档级统计机器翻译系统架构的研究。通过借鉴人类翻译者的活动过程，本文首先提出了基于多策略缓存的文档级SMT系统框架，该框架包括三类缓存，分别用以刻画文档的背景知识、主题和词汇衔接性。上述缓存中的各类信息被巧妙地设计成为SMT对数线性模型中的系统特征，能够指导传统的SMT系统灵活、有效地使用文档级知识。第二种架构是基于N-best列表的后处理方式，重在解决译文与原文的主题内容的一致性，其主要原理是借鉴文本摘要的生成方法，联合主题模型在N-best列表中选择一组更符合原文内容的翻译假设集合。实验表明，这两套系统架构都能够成功集成文档信息，第一种系统架构更具优势，其性能显著优于传统的句子级SMT系统。
     2．文档级统计机器翻译中的时态研究。时态研究是文档级SMT研究中的一个有效知识扩充，它建立在基于缓存的系统框架之上，能在翻译系统中融入更多的上下文知识。本文利用时态在文档内的延续性，提出了N-gram时态模型，该模型能反映句子内部和句子之间两个层次的时态变化规律。在此基础上，本文又提出了更具泛化能力的基于分类的时态模型。实验表明，两种时态模型各有优势，联合了时态模型的SMT系统能够显著改善翻译质量，最好的系统性能在BLEU值上提高0.97个点。
     3．文档级机器翻译自动评价方法的研究。本文主要从两个方面探索了文档级翻译的自动评价方法：第一，根据译文需要反映原文的关键内容出发，分别提出了中心句驱动的评价方法和基于主题模型的评价方法；第二，从文档级翻译需要保持词汇衔接性出发，提出了基于词汇链的评价方法。相关实验表明改进后的评价方法能不同程度地提高与文档级人工评分的相关系数。
     上述三个方面构成了此项研究的一个有机整体，比较全面地涵盖了文档级SMT亟待处理的几个核心问题。目前国内外的相关研究尚处于起步阶段，本文的研究亦属于探索性工作，上述研究内容创新性明显，相信会对今后的相关研究提供重要的参考价值。
Machine translation is a hot research topic in Natural Language Understanding. It caneffectively promote information sharing and thus has wide application and research value.Statistical Machine Translation (SMT) is the mainstream of machine translationtechnology in recent years. However, most of SMT systems translate documents sentenceby sentence under strict independence assumptions. Therefore they only utilize limitedsentence context while completely ignore the relationship between sentences and globalinformation of text. Nevertheless, the characteristics of text, such as style, subject andgenre, can serve to disambiguate word sense, keep consistent language style, andespecially convey key information of original texts during translating procedure.
     The idea of doing machine translation in discourse unit was early put forward in1992,however, most of machine translation systems still work at isolated sentence level. Thereasons are manifold, such as lack of document information in parallel corpus. But slowresearch progress just shows this is a tremendously challenging task. The main content ofthis dissertation includes:
     1. The research on designing reliable frameworks for document-level SMT.
     In order to closely simulate human translation process, we first present a cache-baseddocument-level SMT system. These caches fall into three categories and can describe thefollowing text characteristics, background, topic and lexical cohesion respectively.Furthermore, three kinds of feature for SMT log-linear model are designed to utilizeinformation in these caches. Our proposed framework can guide traditional SMT systemsto effectively use document-level knowledge. The second framework is based on N-bestlist produced by SMT system, so we call it as a post-processing procedure. The point ofthis way is to control consistency of topic models between source-and target-side texts.Inspired by the idea of extractive summarization, such system generates final hypothesis collection by dynamically selecting translation hypothesis from N-best list underconsistency assumption of topic model. Both of these frameworks can successfullyintegrate document-level knowledge into SMT systems, and the former can achieve moresignificant improvements according to the experimental results.
     2. The research on tense model for document-level SMT.
     Tense research is an effective knowledge expansion of document-level SMT. Thetense model is working on our cache-based SMT system and can integrate rich knowledgeof context. According to temporal continuity in one document, this paper puts forwardN-gram-based tense model, which can reflect tense variation of inter-sentences and intra-sentences. Furthermore, this paper proposes a classifier-based tense model which has moregeneralization abilities. Experiments show the joint of SMT and tense model caneffectively improve translation quality and the best SMT system can be improved0.97percent in BLEU score.
     3. The research on automatic evaluation metrics for document-level SMT.
     Translation results should reflect main content of original texts, so we first propose atopic-sentence-driven evaluation metric and a topic-model-based evaluation metricrespectively. Second, document-level translation should keep lexical cohesion and thus anevaluation metric based on lexical chain is proposed. Experimental results show ourproposed evaluation metrics can improve Spearman correlation to human assessments.
     This dissertation has a comprehensive coverage of core issues of document-levelSMT. Currently the related research at domestic and abroad is still in its infancy. Theresearch work has great innovation in SMT and exhibits a great reference value to thefuture research in document-level SMT.

引文

[1]宗成庆.统计自然语言处理[M].北京:清华大学出版社,2008.
    [2]潘海华.篇章表述理论概说[J].国外语言学,1996:17-26.
    [3] Beaugrande R. and Dressler W. Introduction to Text Linguistics[J]. AppliedLinguistics and Language Study,1981(2).
    [4]乐眉云. Cohension and the Teaching of EFL Reading，语言语篇语境[M].北京:清华大学出版社,1993:205-224.
    [5]董振东.机器翻译研究的展望[J].计算机世界报,1998(13):2-3.
    [6]史晓东,陈毅东.基于语篇的机器翻译前瞻[C].中国中文信息学会二十五周年学术会议,2006:34-44.
    [7]刘群.机器翻译技术现状与展望[J].集成技术,2012,1(1):48-54.
    [8]侯敏,孙建军.面向汉英机器翻译的句组研究[J].机器翻译研究进展,2002:51-59.
    [9] Foster G. and Kuhn R. Mixture-model Adaptation for SMT. In Proceedings of WMT2009:128-135.
    [10]Chan Y.S., Ng H.T. and Chiang D. Word Sense Disambiguation Improves StatisticalMachine Translation[C]. In Proceedings of ACL2007:33-40.
    [11]Brown P.F., Pietra S.D., Pietra V.J.D. and Mercer R. L. The Mathematic of StatisticalMachine Translation: Parameter Estimation[J]. Computational Linguistics,1993,19(2):263-311.
    [12]Och F.J. and Ney H. Discriminative Training and Maximum Entropy Models forStatistical Machine Translation[C]. In Proceedings of ACL2002:295-302.
    [13]Och F.J. Minimum Error Rate Training in Statistical Machine Translation[C]. InProceedings of ACL2003:160-167.
    [14]Och F.J. and Ney H. A Comparison of Alignment Models for Statistical MachineTranslation[C]. In Proceedings of COLING2000:1086–1090.
    [15]Koehn P., Och F. J. and Marcu D. Statistical Phrase-based Translation[C]. InProceedings of NAACL2003:48-54.
    [16]Chiang D. A Hierarchical Phrase-based Model for Statistical Machine Translation[C].In Proceedings of ACL2005:263-270.
    [17]Liu Y., Liu Q. and Lin S. Tree-to-String Alignment Template for Statistical MachineTranslation[C]. In Proceedings of ACL2006:609-616.
    [18]Liu Y., Huang Y., Liu Q. and Lin S. Forest-to-String Statistical Translation Rules[C].In Proceedings of ACL2007:704-711.
    [19]Huang L., Knight K. and Joshi A. Statistical Syntax-directed Translation withExtended Domain of Locality[C]. In Proceedings of the7th Biennial Conference of theAssociation for Machine Translation in the Americas,2006:66-73.
    [20]Quirk C., Menezes A. and Cherry C. Dependency Treelet Translation: SyntacticallyInformed Phrasal SMT[C]. In Proceedings of ACL2005:271-279.
    [21]Yamada K. and Knight K. A Syntax-based Statistical Translation Model[C]. InProceedings of ACL2001:523-530.
    [22]Zhang M., Jiang H., Aw A., Li H., Tan C.L., and Li S. A Tree SequenceAlignment-based Tree-to-Tree Translation Model[C]. In Proceedings of ACL2008:559-567.
    [23]Knight K. Decoding Complexity in Word-replacement Translation Models[J].Computational Linguistics,1999,25(4):607-615.
    [24]Och F.J., Ueffing N. and Ney H. An Efficient A*Search Algorithm for StatisticalMachine Translation[C]. In Proceedings of the Workshop on Data-driven Methods inMachine Translation,2001:55-62.
    [25]Koehn P. Pharaoh: a Beam Search Decoder for Phrase-based Statistical MachineTranslation Models[C]. In Proceedings of the6th Conference of the Association forMachine Translation in the Americas,2004:115-124.
    [26]Koehn P. and Hoang H. Factored Translation Models[C]. In Proceedings of EMNLP2007:868–876.
    [27]Salton G., Wong A. and Yang C.S. A Vector Space Model for Automatic Indexing[J].Communications of the ACM,1975,18(11):613-620.
    [28]Schenker A., Last M., Bunke H. and Kandel A. Graph Representations for WebDocument Clustering. LNCS Pattern Recognition and Image Analysis[C],2003,2652:935-942.
    [29]吴江宁,刘巧凤.基于图结构的中文文本表示方法研究[J].情报学报,2010,29(4):618-624.
    [30]Morris J. and Hirst G. Lexical Cohesion Computed by Thesaural Relations as anIndicator of the Structure of Text[J]. Computational Linguistics,1991,17(1):21–48.
    [31]刘铭,王晓龙,刘远超.基于词汇链的关键短语抽取方法的研究[J].计算机学报,2010,33(7):1246-1254.
    [32]Gruber T.R. towards Principles for the Design of Ontologies Used for KnowledgeSharing[J]. International Journal of Human-Computer Studies,1995,43:907-928.
    [33]杜小勇,李曼,王珊.本体学习研究综述[J].软件学报,2006,17(9):1873-1847.
    [34]黄曾阳.概念层次网络理论[M].北京：清华大学出版社,1998.
    [35]晋耀红,苗传江.一个基于语境框架的文本特征提取算法[J].计算机研究与发展,2004,41(4):582-586.
    [36]Deerwester S. C., Dumais S.T., Landauer T.K., Furnas G.W. and Harshman R.Indexing by Latent Semantic Analysis[J]. Journal of the American Society ofInformation Science,1990,41(6):391–407.
    [37]Hofmann T. Probabilistic Latent Semantic Indexing[C]. In Proceedings of SIGIR1999:50-57.
    [38]Blei D.M., Ng A.Y. and Jordan M.I. Latent Dirichlet Allocation[J]. Journal ofMachine Learning Research,2003:993–1022.
    [39]冯志伟.计算语言学基础[M].北京：商务印书馆,2001.
    [40]John R.P., John B.C. et al. Language and Machines-Computers in Translation andLinguistics[R]. ALPAC report, National Academy of Sciences, National ResearchCouncil, Washington, DC,1966.
    [41]王博.机器翻译系统的自动评价及诊断方法研究[D].哈尔滨工业大学博士论文,2010.
    [42]赵红梅,刘群.机器翻译及其评测技术简介[J].术语标准化与信息技术,2010(1):36-45.
    [43]Snover M., Dorr B. and Schwartz R. A Study of Translation Edit Rate with TargetedHuman Annotation[C]. In Proceedings of ACL2006:223-231.
    [44]Snover M., Madnani N. and Dorr B.J. Fluency, Adequacy, Or HTER? ExploringDifferent Human Judgments with a Tunable MT Metric[C]. In Proceedings of WMT2009:259-268.
    [45]Papineni K., Roukos S. and Ward T. BLEU: a Method for Automatic Evaluation ofMachine Translation[C]. In Proceedings of ACL2002:311-318.
    [46]Doddington G. Automatic Evaluation of Machine Translation Quality Using N-gramCo-Occurrence Statistics[C]. In Proceedings of the Second International Conferenceon Human Language Technology Research,2002:138-145.
    [47]Chan Y. S. and Ng H. T. MAXSIM: A Maximum Similarity Metric for MachineTranslation Evaluation[C]. In Proceedings of ACL2008:55-62.
    [48]Turian J., Shen L. and Melamed I. D. Evaluation of Machine Translation and itsEvaluation[C]. In Proceedings of MT Summit IX,2003:386-393.
    [49]Banerjee S. and Lavie A. METEOR: An Automatic Metric for MT Evaluation withImproved Correlation with Human Judgments[C]. In Proceedings of the ACLWorkshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translationand/or Summarization,2005:65-72.
    [50]Callison-Burch C., Osborne M. and Koehn P. Re-evaluating the Role of BLEU inMachine Translation Research[C]. In Proceedings of EACL2006:249-256.
    [51]Babych B. and Hartley A. Extending the BLEU MT Evaluation Method withFrequency Weightings[C]. In Proceedings of ACL2004:621-628.
    [52]Kulesza A. and Shieber S. M. A Learning Approach to Improving Sentence-Level MTEvaluation[C]. In Proceedings of the10th International Conference on Theoretical andMethodological Issues in Machine Translation,2004:75-84.
    [53]Russo-Lassner G., Lin J. and Resnik P. A Paraphrase-Based Approach to MachineTranslation Evaluation[R]. Technical Report, University of Maryland, College Park,2005.
    [54]Ye Y., Zhou M. and Lin C. Sentence Level Machine Translation Evaluation as aRanking Problem: One Step Aside From Bleu[C]. In Proceedings of WMT2007:240-247.
    [55]Yu S.W. Automatic Evaluation of Output Quality for Machine Translation Systems[J].Machine Translation,1993,8:117-126.
    [56]Zhou M., Wang B., Liu S. et al. Diagnostic Evaluation of Machine TranslationSystems Using Automatically Constructed Linguistic Check-points[C]. In Proceedingsof COLING2008:1121-1128.
    [57]Gamon M., Aue A. and Smets M. Sentence-level MT Evaluation without ReferenceTranslations: Beyond Language Modeling[C]. In10th EAMT Conference PracticalApplications of Machine Translation,2005:103-111.
    [58]Avramidis E., Popovi C.M., Vilar D. et al. Evaluate with Confidence Estimation:Machine Ranking of Translation Outputs Using Grammatical Features[C]. InProceedings of WMT2011:65-70.
    [59]Burchardt M.P.D.V. Evaluation without References: IBM1Scores as EvaluationMetrics[C]. In Proceedings of WMT2011:99-103.
    [60]Wong B.T.M., Pun C.F.K., Kit C. and Webster J.J. Lexical Cohesion for Evaluation ofMachine Translation at Document level[C]. In Proceedings of NLP-KE2011:238–242.
    [61]Wong B.T.M. and Kit C. Extending Machine Translation Evaluation Metrics withLexical Cohesion to Document Level[C]. In Proceedings of EMNLP2012:1060–1068.
    [62]Gim′enez J. and M`arquez L. Document-level Automatic MT Evaluation based onDiscourse Representations[C]. In Proceedings of WMT and MetricsMATR2010:333-338.
    [63]Marcu D., Carlson L. and Watanabe M. The Automatic Translation of DiscourseStructures[C]. In Proceedings of NAACL2000:9-17.
    [64]Mann W.C. and Thompson S.A. Rhetorical Structure Theory:Toward a FunctionalTheory of Text Organization [J]. Text,1988,8(3):243-281．
    [65]Halliday M.A.K. and Hasan R. Cohesion in English [M]. London:Longman,1976.
    [66]Gale W.A., Church K.W. and Yarowsky D. One Sense per Discourse[C]. InProceedings of the Workshop on Speech and Natural Language,1992:233-237.
    [67]Carpuat M. and Wu D. Word Sense Disambiguation vs. Statistical MachineTranslation. In Proceedings of ACL2005:387-394.
    [68]Carpuat M. One Translation per Discourse[C]. In Proceedings of the Workshop onSemantic Evaluations: Recent Achievements and Future Directions,2009:19-27.
    [69]Xiao T., Zhu J.B., Y S.J. and Zhang H. Document-Level Consistency Verification inMachine Translation[C]. In Proceedings of MT Summit XIII,2011:131-138.
    [70]Ture F., Oard D.W and Resnik P. Encouraging Consistent Translation Choices[C]. InProceedings of NAACL2012:417-426.
    [71]Tiedemann J. Context Adaptation in Statistical Machine Translation Using Modelswith Exponentially Decaying Cache[C]. In Proceedings of the2010Workshop onDomain Adaptation for Natural Language Processing (DANLP),2010:8-15.
    [72]Hardmeier C., Nivre J. and Tiedemann J. Document-wide Decoding for Phrase-basedStatistical Machine Translation[C]. In Proceedings of EMNLP2012:1179-1190.
    [73]Nagard R.L. and Koehn P. Aiding Pronoun Translation with Co-ReferenceResolution[C]. In Proceedings of WMT and MetricsMATR2010:252-261.
    [74]Liane G. Improving Pronoun Translation for Statistical Machine Translation[C]. InProceedings of EACL2012:1-10.
    [75]Hardmeier C. and Federico M. Modelling Pronominal Anaphora in Statistical MachineTranslation[C]. In Proceedings of IWSLT2010:283-289.
    [76]Meyer T. and Popescu-Belis A. Using Sense-labeled Discourse Connectives forStatistical Machine Translation[C]. In Proceedings of EACL2012:129–138.
    [77]Zhao B. and Xing E.P. BiTAM:Bilingual Topic Ad-Mixture Models for WordAlignment[C]. In Proceedings of COLING/ACL2006:969-976.
    [78]Tam Y.C., Lane I. and Schultz T. Bilingual LSA-based Adaptation for StatisticalMachine Translation [J]. Machine Translation,2007,28:187-207.
    [79]马红妹.汉英机器翻译中汉语上下文语境的表示与应用研究[D].国防科技大学博士论文，2002.
    [80]Gong Z.X., Zhang M. and Zhou G.D. Cache-based Document-level StatisticalMachine Translation[C]. In Proceedings of EMNLP2011:909-919.
    [81]Gong Z.X., Zhou G.D. and Li L.Y. Improve SMT with Source-Side “Topic-Document”Distributions[C]. Machine Translation Summit XIII,2011:496-501.
    [82]Xiong D.Y., Ding Y., Zhang M. and Tan C.L. Lexical Chain Based Cohesion Modelsfor Document-Level Statistical Machine Translation[C]. In Proceedings of EMNLP2013:1563-1573.
    [83]Tu M., Zhou Y. and Zong C.Q. A Novel Translation Framework Based on RhetoricalStructure Theory. In Proceedings of ACL2013(Short paper):370–374.
    [84]Mitkov R. Discourse-based Approach in Machine Translation[C]. In Proceedings ofthe International Symposium on Natural Language Understanding and ArtificialIntelligence,1992:13-15.
    [85]王厚峰.基于实例的机器翻译—方法和问题[J].术语标准化与信息技术,2003(02):33-36.
    [86]Kuhn R. and Mori R.D. A Cache-based Natural Language Model for SpeechRecognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1990,12(6):570–583.
    [87]Nepveu L., Guy L., Philippe L. and Foster G. Adaptive Language and TranslationModels for Interactive Machine Translation[C]. In Proceedings of EMNLP2004:190–197.
    [88]Nida E.A. Toward a Science of Translating [M]. Brill Academic Publishers,1964.
    [89]Xiao Z. and McEnery A. A Corpus-based Approach to Tense and Aspect inEnglish-Chinese Translation[C]. In Proceedings of International Symposium onContrastive and Translation Studies between Chinese and English,2002:1-29.
    [90]Dorr B.J. A Parameterized Approach to Integrating Aspect with Lexical-semantics forMachine Translation[C]. In Proceedings of ACL1992:257-264.
    [91]程节华,戴新宇,陈家骏,王启祥.汉英机器翻译中时体态处理[J].计算机应用研究,2004,21(3):79-80.
    [92]Lapata M. and Lascarides A. Learning Sentence-internal Temporal Relations[J].Journal of Artificial Intelligence Research,2006,27:85-117.
    [93]Dorr B.J. and Gaasterland T. Constraints on the Generation of Tense, Aspect, andConnecting Words from Temporal Expressions[R]. Technical Reports from UMIACS,2002.
    [94]Liu F.F, Liu F. and Liu Y. Learning from Chinese-English Parallel Data for ChineseTense Prediction[C]. In Proceedings of IJCNLP2011:1116-1124.
    [95]Yang Y. and Zhu Z.Z. Tense Tagging for Verbs in Cross-lingual Context: A CaseStudy[C]. In Proceedings of IJCNLP2005:885–895.
    [96]Och F.J., Gildea D. and Khudanpur S. A Smorgasbord of Features for StatisticalMachine Translation[C]. In Proceedings of NAACL2004:440-447.
    [97]Liu D. and Gildea D. Source-Language Features and Maximum Correlation Trainingfor Machine Translation Evaluation[C]. In Proceedings of NAACL2007:41-48.
    [98]姜秋霞,张柏然.整体概念与翻译[J].中国翻译,1996(6):15-18.
    [99]Blatz J., Fitzgerald E., Foster G., Gandrabur S., Goutte C., Kulesza A., Sanchis A. andUeffing N. Confidence Estimation for Machine Translation[R]. Technical Report,Natural Language Engineering Workshop Final Report, Johns Hopkins University,2003:101-103.
    [100] Reeder F. Measuring MT Adequacy Using Latent Semantic Analysis[C]. InProceedings of the7th Conference of the Association for Machine Translation of theAmericas,2006:176-184.
    [101] Rubino R., de Souza J.G.C., Foster J. and Specia L. Topic Models for TranslationQuality Estimation for Gisting Purposes[C]. In Proceedings of MT Summit XIV,2013:295-302.
    [102] Carpuat M. and Simard M. The Trouble with SMT Consistency[C]. InProceedings of WMT2012:442-449.
    [103] Barzilay R. and Lapata M. Modeling Local Coherence: An Entity-basedApproach[J]. Computational Linguistics,2008,34(1):1-34.
    [104] Galley M. and McKeown K. Improving Word Sense Disambiguation in LexicalChaining[C]. In Proceedings of IJCAI2003:1486-1488.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700