机器翻译是自然语言理解中的一个研究热点,能有效地促进信息共享,具有广泛的研究和应用价值。统计机器翻译(Statistical Machine Translation,SMT)是目前主流的机器翻译技术,但孤立进行句子翻译的SMT系统在翻译的过程中仅能利用当前句子的信息,完全忽略了前后句子的关联和文本的全局信息。然而,文档级别的信息,比如风格、主题、类别等,对机器翻译而言是极为有用的,它们不仅能引导翻译系统在词形、词义上进行正确的消歧,还能保持译文与原文在语言风格和关键内容上的一致。
Machine translation is a hot research topic in Natural Language Understanding. It caneffectively promote information sharing and thus has wide application and research value.Statistical Machine Translation (SMT) is the mainstream of machine translationtechnology in recent years. However, most of SMT systems translate documents sentenceby sentence under strict independence assumptions. Therefore they only utilize limitedsentence context while completely ignore the relationship between sentences and globalinformation of text. Nevertheless, the characteristics of text, such as style, subject andgenre, can serve to disambiguate word sense, keep consistent language style, andespecially convey key information of original texts during translating procedure.
     The idea of doing machine translation in discourse unit was early put forward in1992,however, most of machine translation systems still work at isolated sentence level. Thereasons are manifold, such as lack of document information in parallel corpus. But slowresearch progress just shows this is a tremendously challenging task. The main content ofthis dissertation includes:
     1. The research on designing reliable frameworks for document-level SMT.
     In order to closely simulate human translation process, we first present a cache-baseddocument-level SMT system. These caches fall into three categories and can describe thefollowing text characteristics, background, topic and lexical cohesion respectively.Furthermore, three kinds of feature for SMT log-linear model are designed to utilizeinformation in these caches. Our proposed framework can guide traditional SMT systemsto effectively use document-level knowledge. The second framework is based on N-bestlist produced by SMT system, so we call it as a post-processing procedure. The point ofthis way is to control consistency of topic models between source-and target-side texts.Inspired by the idea of extractive summarization, such system generates final hypothesis collection by dynamically selecting translation hypothesis from N-best list underconsistency assumption of topic model. Both of these frameworks can successfullyintegrate document-level knowledge into SMT systems, and the former can achieve moresignificant improvements according to the experimental results.
     2. The research on tense model for document-level SMT.
     Tense research is an effective knowledge expansion of document-level SMT. Thetense model is working on our cache-based SMT system and can integrate rich knowledgeof context. According to temporal continuity in one document, this paper puts forwardN-gram-based tense model, which can reflect tense variation of inter-sentences and intra-sentences. Furthermore, this paper proposes a classifier-based tense model which has moregeneralization abilities. Experiments show the joint of SMT and tense model caneffectively improve translation quality and the best SMT system can be improved0.97percent in BLEU score.
     3. The research on automatic evaluation metrics for document-level SMT.
     Translation results should reflect main content of original texts, so we first propose atopic-sentence-driven evaluation metric and a topic-model-based evaluation metricrespectively. Second, document-level translation should keep lexical cohesion and thus anevaluation metric based on lexical chain is proposed. Experimental results show ourproposed evaluation metrics can improve Spearman correlation to human assessments.
     This dissertation has a comprehensive coverage of core issues of document-levelSMT. Currently the related research at domestic and abroad is still in its infancy. Theresearch work has great innovation in SMT and exhibits a great reference value to thefuture research in document-level SMT.
