高精度中文机构名称与地址机译策略研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
机器翻译简单地说就是用计算机将一种自然语言翻译为另一种自然语言。作为信息的主要承载者,命名实体的翻译质量对译文的整体翻译质量具有十分重要的影响,命名实体的翻译也成为研究者关注的焦点。
     在人名、地名的翻译任务利用音译技术基本完成之后,机构名称、地址等非音译信息的翻译成为命名实体翻译探索的重点。由于现有的机构名称及地址的汉英双语语料极其匮乏,导致当前主流的基于统计的机器翻译技术无法发挥优势。针对上述情况,本文构建了以基于表示模式的高精度切分方法为核心的机构名称翻译系统,以及面向机器翻译的中文机构地址切分方法和基于地址单元的翻译机制相结合的中文机构地址翻译系统。具体地讲,本文从如下几个方面进行了研究:
     1.通过分析大量的数据实例,采用上下文无关文法抽象出符合机构名称构成特点的表示模式,并设计了一种基于表示模式的高精度切分方法,通过融合机构独立切分模式和地址独立切分模式得到的两个切分结果,消除机构名称中的歧义。
     2.深入研究了中文地址的构成特点,给出了一个合法的地址单元的定义,构建了符合中文地址构成特点的地址识别知识库,实现了一种面向机器翻译的机构地址切分方法。实验证明,针对机构地址翻译这一特定任务,该方法十分有效。
     3.中文机构地址被切分为地址单元序列之后,需要相应的翻译机制相支撑,才能完成机构地址汉英翻译任务。因此,本文定制了一种基于地址单元的翻译方法,实现了对不同类型的地址单元的翻译。通过CTR的自动获取,解决了广泛存在于基于规则的翻译系统中的规则冲突问题。
     4.本文设计并实现了中文机构名称翻译系统和中文机构地址翻译系统。实验表明,在仅有几千条标准汉英双语语料的情况下,根据5分制评分标准,两个系统的翻译准确率分别为97.28%和91.26%,达到了实用化的翻译水平。
Machine Translation is to apply the computer into the translation of one natural language into another. As the main bearer of information, the translation quality of named entities has a very important impact on the text translation, and named entities translation also become a research focus.
     After the study of transliteration of person names and placename, the translation of address and organization name is the next issue to be resolved. At present, as the existing Chinese-English bilingual corpus of organization name and address is extremely scarce, the current main translation technology SMT can not play to its advantages. To address the above situation, we propose a Chinese organization names translation system which employs a model-based high-precision segmentation method, and a Chinese organization address translation system which combined a organization address segmentation method for MT and a unit-based translation mechanism. In detail, this thesis is arranged as the following:
     1. A CKY grammar is employed to format Chinese organization name, and we designed a high-precision segmentation method based on the grammar. Ambiguities in organization name are eliminated by combining the segmentation results of organization segmentation and address segmentation.
     2. A relevant structural features and knowledge base were obtained on a complete research of the organization address composition, and a segmentation approach for MT was proposed. The experimental results show that the performance of this method is efficient.
     3. After Chinese organization address has been divided into a series of address units, a corresponding translation mechanism is need for the translation task. Therefore, we proposed a unit-based translation approach to acquire the translation of different address unit. Through automated access to CTR, rule conflict which widely range exists in the rule-based translation system is solved.
     4. This paper designed a Chinese address translation system and a Chinese organization name translation system. The experiments show that, with the help of several thousand bilingual pairs, the two systems reach the 97.28% and 91.26% by 5-point scale score standard respectively.
引文
1.赵铁军.机器翻译原理.哈尔滨:哈尔滨工业大学出版社,2001.
    2.张晓艳,王挺,陈火旺.命名实体识别.计算机科学. 2005, 32(04): 44~48
    3.赵健.条件概率模型研究及其在中文名实体识别中的应用.哈尔滨工业大学博士论文.2006: 3~8 79~84
    4. G. R. Krupka and K. H. IsoQuest. Description of the NerOwl Extractor System as Used for MUC-7. Proceedings of the 7th Message Understanding Conference, Virginia. 1998: 21~28
    5. D. PALMER. A statistical profile of the named entity task [C] / /Proc of the 5th Conference on Applied Natural Language Processing. Washington D C: [ s. n. ], 1997:191-192.
    6.李丽双,黄德根,陈春荣,杨元生;基于支持向量机的中文文本中地名识别,大连理工大学学报, 2007, (03)
    7.郭秀婷.关于组织机构名称音译的探讨.世界标准化与质量管理. 2003,(7): 45~46
    8. F. J. Och. Statistical Machine Translation: From Single-Word Models to Alignment Templates. Ph.D. Thesis. RWTH-Aachen. 2002: 1~122
    9. J. Hutchins. Machine Translation: Past, Present, and Future. Ellis Horwood Ltd. 1993: 11~34
    10. M. Nagao. A Framework of A Mechanical Translation between Japanese and English by Analogy Principle. In: A. Elithorn and R. Banerji (eds.) Artificial and Human Intelligence, NATO Publications, 1984
    11. S. Sato and M. Nagao. Toward Memory-based Translation. In: Coling, 1990
    12. S. Sato. CTM: An Example-Based Translation Aid System. In: Coling, 1992
    13. P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, R. L. Mercer and P. S. Roossin. A Statistical Approach to Language Translation. In: Proc. of the 12th International Conference on Computational Linguistics (COLING-88), Budapest, Hungary, August 1988: 71~76
    14. P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer and P. S. Roossin. A Statistical Approach to Machine Translation. Computational Linguistics, 1990, 16(2): 79~85
    15.蒋龙,周明,简立峰.利用音译和网络挖掘翻译命名实体.中文信息学报. 2007, 21(1): 23~28
    16. W. Gao, K. F. Wong and W. Lam. Phoneme-based Statistical Transliteration of Foreign Names for OOV Problem. The Chinese University of Hong Kong. 2004
    17. W. Weaver. Translation (1949). In: Machine Translation of Languages, MIT Press, Cambridge, MA. 1955: 15~23
    18. N. Chomsky. Quine's Empirical Assumptions. In: D. Davidson and J. Hintikka, eds., Words and Objections: Essays on the Work of W. V. Quine, Reidel, Dordrecht, The Netherlands. 1969: 53~68
    19.刘群.统计机器翻译综述.中文信息学报. 2003,17(4): 1~12.
    20. K. Knight and J. Graehl. Machine Transliteration. Computational Linguistics, 1998, 24 (4): 599~612
    21. V. Paola and S. Khudanpur. Transliteration of Proper Names in Cross-lingual Information Retrieval. In: Proc. of the ACL Workshop on Multi-lingual Named Entity Recognition. 2003:57~64
    22. S. Wan and C. Verspoor. Automatic English-Chinese Name Transliteration for Development of Multilingual Resources. In: Proc. of COL IN G2ACL 1998:1352~1356
    23. W. H. Lin and H. H. Chen. Backward Machine Transliteration by Learning Phonetic Similarity. In: Proc. of the 6th CoNLL, 2002:139~145
    24. D. H. Feng, Y. Lv and M. Zhou. A New Approach for English-Chinese Named Entity Alignment. In: Proc. of EMNLP 2004: 372~379
    25. J. Kupiec. An Algorithm for Finding Noun-phrase Correspondences in Bilingual Corpora. In: Proc. of the 31st Annual Meeting of the ACL. 1993: 17~22
    26. P. Fung and L. Y. Yee. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In: Proc. of the 36th Annual Conference of the ACL, 1998: 414~420
    27. R. Rapp. Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proc. of ACL-99 , 1999: 519~526
    28.何彦青,周玉,宗成庆,王霞.基于“松弛尺度”的短语翻译对抽取方法.中文信息学报. 2007, (05)
    29.刘冬明,赵军,杨尔弘.汉英双语语料库中名词短语的自动对应.中文信息学报. 2003.17(5): 6~12.
    30. D. Feng, Y. Lv and M. Zhou. A New Approach for English-Chinese Named Entity Alignment, EMNLP 2004
    31. J. H. Wang, J. W. Teng, P. J. Cheng, W. H. Lu and L. F. Chien. Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach. In: Proc. of JCDL 2004: 108~116
    32. M. Nagata, T. Saito, and K. Suzuki. Using the Web as a Bilingual Dictionary. In: Proc. of ACL 2001 Workshop on Data-driven Methods in Machine Translation, 2001: 95~102
    33. P. J. Cheng, W. H. Lu, J. W. Teng and L. F. Chien. Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora. In: Proc. of ACL-04. 2004: 534~541
    34. Y. Al-Onaizan and K. Knight. Translating Named Entities Using Monolingual and Bilingual Resources. In: Proc. of ACL-02: 400~408
    35. Y. Zhang and P. Vines. Using the Web for Automated Translation Extraction in Cross-language Information Retrieval. In: Proc. of SIGIR2004: 162~169
    36. C. Chen and H. Chen. A High-Accurate Chinese-English NE Backward Translation System Combining Both Lexical Information and Web Statistics. ACL_06_poster
    37. S. Z. Named Entity Recognition in Biomedical Texts using an HMM Model. JNLPB, 2004
    38. A. Borthwick. Maximum Entropy Approach to Named Entity Recongnition. PhD Dissertation, New York University, 1999:18~25
    39. H. Isozaki, H. Kazawa. Efficient Support Vector Classifiers for Named Entity Recognition. COLING, 2002:953~959
    40.唐晋韬.2004年863NE评测研讨会报告. 2004
    41.陈霄,刘慧,陈玉泉.基于支持向量机方法的中文组织机构名的识别,中文信息学报, 2008, (02)
    42. E. Peterson, A Chinese Named Entity Extraction System.
    43.李军,王丁,王鑫.基于模板匹配的中文机构名识别,中文信息学报, 2008, (06)
    44. A. V. Aho and J. D. Ullman. Syntax directed translations and the pushdownassembler. Journal of Computer and System Sciences, 1969:37–56.
    45.张李义,李亚子.基于反序词典的中文逆向最大匹配分词系统设计.现代图书情报技术,2006(8):42-46.
    46.周蕾,朱巧明.基于统计和规则的未登录词识别方法研究.中文信息学报, 2007, (08)
    47.张华平,刘群.基于N-最短路径的中文词语粗分模型.中文信息学报, 2002 , 16 (5) : 1~7.
    48. D. Palmer. A trainable rule-based algorithm for word segmentation. The 35th Annual Meeting of the Association for Computational Linguistics (ACL’97) , Madrid , 1997.
    49. H. P. Zhang , Q. Liu , H. Zhang, et al.Automatic recognition of Chinese unknown words recognition. First SIGHAN Work-shop Attached with the 19th COLING, Taipei , 2002.
    50. J Sun , J F Gao , L Zhang , et al. Chinese named entity identification using class-based language model1 The 19th Int’l Conf on Computational Linguistics , Taipei , 2002.
    51. http://ictclas.org
    52.中华人民共和国行政区划代码,GB/T 2260-2002
    53.高红,黄德根,杨元生.汉语自动分词中中文地名识别.大连理工大学学报. 2006, 46(4): 576-581
    54. K. Papineni, S. Roukos. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the ACL, pages 311-318.
    55. E. Brill. A Rule-based Approach to Prepositional Phrase Attachment Disambiguation. Proc. of the 15th International Conference on Computational Linguistics, 1994:1198~1204
    56. F. J. Och and H. Ney. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 2003, 29(1): 19~51
    57.王松.中文机构名称及地址的汉英翻译方法研究.哈尔滨工业大学硕士论文. 2008:57~67

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700