Web双语平行语料自动获取及其在统计机器翻译中的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
双语平行语料库在自然语言处理领域有很多重要应用,它为统计机器翻译模型提供不可或缺的训练数据,同时也是词典编纂和跨语言信息检索等应用的重要基础资源。但是大规模双语平行语料库的获取并不容易,现有的平行语料库在规模、时效性和领域的平衡性等方面还不能满足处理真实文本的实际需要。随着互联网的普及和飞速发展,越来越多的双语网站被创建,越来越多的信息以多语言的形式发布,这就为双语和多语语料库的建设提供了很大的来源。一些研究者提出了基于Web的双语或多语平行语料库自动挖掘方法,为双语或多语平行语料库的自动构建提出了有效的解决途径。本文致力于构建一个基于Web的大规模双语平行语料库自动获取系统。取得主要成果有以下几方面:
     1.研究了双语混合网页的自动发现和获取
     互联网上的双语平行资源主要分为两类:一类是双语资源分布于两个网页间,两个网页用不同语言描述内容上是互译的,我们称之为双语平行网页;另一类是双语资源位于同一网页内,我们称之为双语混合网页。以往的系统主要是基于双语平行网页的,但是通过观察,我们发现Web上存在大量的双语混合网页,而且双语混合网页上的双语资源对照更为工整,翻译质量较高,是非常宝贵的双语资源来源。
     双语平行网页存在地址或结构上的相似性,处理方法已经很成熟,但这些方法并不适用于双语混合网页。候选双语混合网页分布通常不确定,缺乏一些常见的启发信息,获取更为困难。本文提出了一种基于尝试下载策略的自动发现双语混合网页的方法,运用该方法获取候选混合网站具有较高的正确率。
     2.研究了从双语混合网页中抽取平行句对的方法
     从双语混合网页中抽取平行句对的主要任务可以分成三部分:网页噪声过滤、双语混合网页确认和句子对齐。本文研究并实现了两种网页去噪声方法:专用的基于模板的方法和通用的基于Html标签树的方法。对于双语混合网页的确认本文分两步实验,分别是基于双语字符数的粗判别和基于词典的细判别。最后,本文采用基于混合信息的句子对齐方法将篇章级的双语平行文本转化成双语平行句对。本文解决了上述三个难点问题,实现了一个基于双语混合网页的平行语料自动挖掘系统。
     3.研究了Web双语平行语料在实际中的应用
     本文将从Web上获取的双语平行句对应用于统计机器翻译的模型训练,提出了句对质量排序和领域信息检索两种不同的应用策略将Web平行语料加载到训练集中,实验证明本文提出的两种策略可以提高翻译系统性能,在IWSLT评测任务中BLEU值可以提高2到5个百分点。
There are many important applications of bilingual parallel corpora in natural language processing, which provides essential training data for statistical machine translation, and can be used in lexicography and cross-language information retrieval. However, access to a large-scale bilingual parallel corpus is not easy, the existing parallel corpora can not meet the actual needs in terms of the scale, timeliness and balance of the fields. With the popularity of the Internet and rapid development, more and more bilingual sites have been created, more and more information in multiple languages have been published, which can be the source of bilingual and multi-lingual corpus. Some researchers have proposed several effective solutions of Web-based bilingual or multilingual parallel corpora automatically mining for building the bilingual or multilingual parallel corpus. This paper aims to build a large-scale Web-based automatic acquisition system of bilingual parallel corpus. The main contributions are identified as follows:
     1. Study discovery and access to mixed-languages Web pages automatically.
     Bilingual parallel resources on the Internet can be divided into two categories:one category is a bilingual resource distribution between the two pages, two pages described in different languages with the same meaning, which are called bilingual parallel pages; the other is Bilingual resources located in the same page, which are called mixed-languages pages. Previous systems are mainly based on the first category, but through observation, we found that there are a large number of mixed-languages pages on the Web, and the parallel texts are neater and the translation quantity is higher, which are very valuable resources of bilingual corpus.
     The bilingual parallel pages exist address similarity or structural similarity and the treatments are already very mature, but these methods can not be applied to mixed-languages pages. The distribution of candidate mixed-languages pages is usually uncertain, and the lack of some common heuristic information makes the discovery more difficult. This paper presents a method of discovery the mixed-languages pages automatically based on the strategy of tentative download, using this method to get the eligible candidate mixed-languages pages close to accuracy of 100%.
     2. Study the method of extracting bilingual parallel sentence pairs from mixed-languages pages.
     The main tasks of extracting bilingual parallel sentence pairs from mixed-languages pages can be divided into three parts:Web-noise filtering, mixed-languages pages identifying and sentence alignment. In this paper, we realized two kinds of method to filter Web noise:a dedicated template-based approach and a common approach based on the Html tag tree. The identification of mixed-languages pages are performed through two-step experiments, respectively, the first step is based on the ratio of character number and the second is based on the ratio of translation. Finally, we convert the parallel passages to parallel sentences using the method of hybrid-information-based alignment.
     This paper solved these three difficult problems and realized an automatic mining system based on mixed-languages pages.
     3. Study the application of Web bilingual parallel corpus.
     We apply the bilingual parallel sentences obtained from Web to the training of statistical machine translation model, during which we proposed the sentence quality sorting method and information retrieval method to loaded the Web corpus into the training experiment. The result proves that the two strategies can improve the translation system performance. Experiments conducted on the IWSLT tasks show+2 to+5 BLEU gains over baseline.
引文
[1]Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin, A Statistical Approach to Machine Translation [J],Computational Linguistics,1990.
    [2]Peter. F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer, The Mathematics of Statistical Machine Translation:Parameter Estimation [J], Computational Linguistics,19,(2),1993.
    [3]F. J. Och, C. Tillmann, and H. Ney. Improved alignment models for statistical machine translation [A]. In Proc. of the Joint SIGDAT Conf. On Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20228, University of Maryland College Park, MD, J une 1999.
    [4]Franz Josef Och, Hermann Ney. What Can Machine Translation Learn from Speech Recognition? [A] In:proceedings of MT 2001 Workshop:Towards a Road Map for MT 26231, Santiago de Compostela,Spain, September 2001.
    [5]Franz Josef Och, Hermann Ney, Discriminative Training and Maximum Entropy Models for StatisticalMachine Translation [A], ACL2002.
    [6]Church, K. and Mercer, R., "Introduction to the Special Issue on Computational Linguistics Using Large Corpora," CL 19:1, pp.1-24,1993.
    [7]Philip Resnik. Parallel strands:a preliminary investigation into mining the Web for bilingual text. In:Proceeding of the Third Conference of the Association for Machine Translation. America, pages 72-2,1998.
    [8]Jiang Chen and Jian-Yun Nie. Automatic construction of parallel english-chinese corpus for cross-language information retrieval. In:Proceedings of the International Conference on Chinese Language Computing. San Francisco, pages 21-28,2000.
    [9]Philip Resnik. Parallel strands:a preliminary investigation into mining the Web for bilingual text. In:Proceeding of the Third Conference of the Association for Machine Translation. America, pages 72-82,1998.
    [10]Philip Resnik and Noah A. Smith. The Web as a parallel corpus. Computational Linguistics, volume 29, pages 349-380.
    [11]Lei Shi, Cheng Niu, Ming Zhou,,et al.A DOM Tree Alignment Model for Mining Parallel Data from the Web[C].Joint Pro-ceedings of the Association for Computational Linguistics and the International Conference on Computational Linguistics, Sydney, Australia,2006:489-496
    [12]Lei Shi, Ming Zhou:Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model[C]. EMNLP,2008:505-513
    [13]Long Jiang,Shiquan Yang,Ming Zhou,et al.Mining Bilingual Data from the Web with Adaptively Learnt Patterns[C]. Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing,2009: 870-878
    [14]Jisong Chen, Rowena Chau, and Chung-Hsing Yeh. Discovering parallel text from the World Wide WEB. In CRPIT'32:Proceedings of the second workshop on Australasian information security, Data Mining and Web
    [15]Ying Zhang, Ke. Wu, Jianfeng Gao, and P. Vines. Automatic acquisition of chinese-english parallel corpus from the Web. In:Proceedings of ECIR-06,28th European Conference on Information Retrieval. ACL,2006.
    [16]http://www. w3c.org/DOM/.
    [17]Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, et al.2006. A Survey of Web Information Extraction Systems. IEEE transactions on knowledge and data engineering, 18(10):1411-1428.
    [18]S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. DOM-based content extraction of HTML documents.In WWW,2003:207-214.
    [19]Ziegler, C.& Skubacz, M. Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features. In WI,2007:Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, IEEE Computer Society, 2007,242-249.
    [20]Gibson, J.; Wellner, B.& Lubar, S. Adaptive Web-page content identification WIDM'07: Proceedings of the 9th annual ACM international workshop on Web information and data management, ACM,2007,105-112.
    [21]Thomas Gottron. Content Code Blurring:A New Approach to Content Extraction. In DEXA,2008:29-33.
    [22]Lin, S.-H. and Ho, J.-M., Discovering Informative Content Blocks from Web Documents, In Proceedings of ACM SIGKDD'02,2002:588-593
    [23]Chen, J., Zhou, B., Shi, J., Zhang, H.-J., and Wu, Q., Function-Based Object Model Towards Website Adaptation, In Proceedings of the 10th International World Wide Web Conference,2001.
    [24]Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., VIPS:a visionbased page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79,2003.
    [25]Gale, William A. Kenneth W. Church. A program for aligning sentences in Bilingual corpora[J]. Computational Linguistics,1993, volume 19:75-102
    [26]Stanley F.Chen.Aligning Sentences in Bilingual Corpora Using Lexical Information[C]Proceedings of the 31st Annual Meeting of the Association for Computational Linguaistics,1993:9-16
    [27]DeKai Wu.Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C].Proceedings of the 32nd Annual Conference of the Association for Computational Linguaistics,1994:80-87
    [28]T.Utsuro,H.Ikeda.Bilingual Text Matching using Bilingual Dictionary and Statistics[C].In 15th COLING,1994:1076-1082
    [29]Kishore Papineni, Salim Roukos, Todd Ward, et al. BLEU:A Method for Automatic Evaluation of Machine Translation[C].Proceedings of the 40th Annual Meeting on Association for Computational Linguistics,2002:311-318
    [30]Stephen Robertson. Understanding Inverse Document Frequency:on Theoretical Arguments for IDF. Journal of Documentation,2004,60(5):503--520.
    [31]Tibor Kiss, Jan Strunk. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics,2006,32(4):485-525.
    [32]Simard M., Foster, G. and Isabelle, P. Using Cognates to Align Sentences in Bilingual Corpora. Proceedings of the 4th International Conference on Theoretical and Methodogical Issues in Machine translation (TMI92),1992:67-81.
    [33]Le Sun, et al. Word Alignment of English-Chinese Bilingual Corpus Based on Chunks.Proceedings of the 38th Annual Meeting on the ACL,2000:110-116.
    [34]K. A. Papineni, S. Roukos, and R. T. Ward. Feature2based language understanding [A]. In European Conf. on Speech Communication and Technology,1435-1438, Rhodes, Greece, September,1997.
    [35]K. A. Papineni, S. Roukos, and R. T. Ward. Maximum likelihood and discriminative training of direct translation models [A]. In Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, pages 189-192, Seattle, WA, May,1998.
    [36]A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to natural language processing [J]. Computational Linguistics,22 (1):39272,March 1996.
    [37]刘群.统计机器翻译综述[J].中文信息学报.2003:1-11
    [38]刘群.统计机器翻译新进展[J].创新求实.2005:9-11
    [39]常宝宝,詹卫东,张化瑞.面向汉英机器翻译的双语语料库的建设及其管理,《术语标准化与信息技术》,2003(1):28-31
    [40]常宝宝、柏晓静.北京大学汉英双语平行语料库标记规范,《汉语语言于计算学报》,2003.13(2):195-214
    [41]揭春雨,刘晓月,冼景炬,卫真道.从网络获取香港法律双语语料库.全国第八届计算语言学联合学术会议(JSCL-2005):193-199
    [42]孙乐,金友兵,杜林等.平行语料库中双语术语词典的自动抽取[J].中文信息学报,2005(6)[43]冯志伟.中国语料库研究的历史与现状[J].Journal of Chinese Language and Computing,2002,11 (2):127-136
    [44]叶莎妮,吕雅娟,黄赟等.基于Web的双语平行句对自动获取[J].中文信息学报,2008(5)
    [45]林政,吕雅娟,刘群等.基于双语混和网页的平行语料挖掘[C].全国第十届计算语言学会,烟台,2009:352-357
    [46]刘非凡,赵军,徐波.大规模非限定领域汉英双语语料库建设及句子对齐研究.全国第七届计算语言学联合学术会议[C],哈尔滨,2003:339-345

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700