用户名: 密码: 验证码:
基于规则和基于统计相结合的中英双语平行句对短语对齐方法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
双语短语对齐是当今双语信息检索和辅助机器翻译研究的热点和难点问题。在自然语言处理领域,“短语”一词没有一个统一的定义,根据研究方法的不同常常有不同的含义。有的研究者认为“短语”是短语结构语法中有一定结构和层次关系的语言单位,而有的则不要求“短语”有内部结构关系,只要是连续的有一定意义的词串就是“短语”,因此涵盖的范围更广。本文的“短语”是属于后者,部分短语有较简单的内部结构关系(指基本名词短语,Base Noun Phrase,以下简称BaseNP),而大部分也只是连续的词串而已。
     本文采取的方法是先对中英文句对进行分类,分为简单的短句和复杂的长句两类,对于简单的短句,使用本文提出的基于规则和基于统计相结合的方法进行对齐;对于较复杂的长句,先使用浅层句法分析将长句分为若干个短句,然后再使用短句的方法进行对齐。
     在短语识别阶段,首先利用汉英双语的“标记词”集合对汉英句子进行短语切分,得到“标记词”短语。然后,用基于双语语料的方法识别出基本名词短语。最后,将“标记词”短语和基本名词短语的识别结果归并起来,得到本文中所说的“混合”名词短语。
     在短语对齐阶段,第一步进行一对多的短语对齐。首先利用“锚点”词对齐得到“锚点”短语对齐;对于那些无法利用“锚点”词对齐信息进行对齐的短语,则利用词对齐生成其候选对齐,并利用最大熵排序模型对这些候选对齐进行打分排序,以得分最高的作为对齐结果。第二步是在一对多短语对齐的基础上得到多对多的短语对齐。
Phrase alignment and is an important research field as Well as tough problem for machine translation and cross-language information retrieval. Phrases can be structural (i.e.base noun phrases) or non-structural.This thesis argues a model for Chinese-English phrase alignment.This model consists of three interconnected parts.The first is a word alignment module.Based this module, we can finish the alignment work,we align Chinese words and English words using approaches based on knowledge and EM algorithm. The Second part is a phrase identification module based on Marker Hypothesis and shallow parsing (mainly refers to base noun phrase identification).The last part is a phrase alignment module. As for word alignment, this thesis uses knowledge-based approach and EM algorithm based approach.The knowledge-based method uses bilingual dictionary as the main knowledge base.This simple approach is very efficient with high precision.It fails in the low recall.We uses synonym dictionary for Chinese word expansion to tackle the low recall. The latter approach uses EM algorithm to estimate the probability of a translation pair of words.
     In the stage of phrase alignment, we firstly design an algorithm for 1-to-n phrase alignment using word alignment result. As for the Chinese phrases which can't find corresponding English phrases using this algorithm,we will try to find the candidate English phrases for each Chinese phrase and rank the candidates using maximum entropy framework.Finally many-to-many phrase alignment can be acquired from the one-to-many phrase alignment results.
引文
[1]Och F J, Tillmann C, Ney H. Improved alignment models for statistical machine translation. In Proceedings of the joint SIGDAT conf. On Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20-28, University of Maryland, College Park, MD, June 1999.
    [2]Wang Y Y, Grammar Inference and Statistical Machine Translation, PhD Thesis, Carnegie Mellon university,1998.
    [3]王伟,机器翻译中的对齐技术研究,北京邮电大学博士论文,2002.
    [4]Kaji H, Kida Y and Morimoto Y. Learning Translation Templates from Bilingual Texts. COLING-92, pp.672-678.1992.
    [5]Imamura K. Hirarchical phrase alignment harmonized with parsing, in proceedings of NLPRS 2001,Tokyo.2001
    [6]Matsumoto Y, Isimoto H, Utsuro T. Structural Matching of Parallel Texts, ACL-93,pp.23-30. 1993.
    [7]Grishman R. Iterative Alignment of Syntactic Structures for a Bilingual Corpus. In Proceedings of 2nd workshop for Very Large Corpora (WVLC-94), pp.57-68,1994.
    [8]Meyers A, Yanharber R, Grishman R. Alignment of Shared Forests for Bilingual Corpora. In Proceedings of COLING-96, pp 460-465.1996.
    [9]Watanabe H, Kurohashi S, Aramaki E. Finding Structural Correspondences from Bilingual Parsed Corpus for Corpus-based Translation. COLING-2000.
    [10]Smadja F, McKeown K R, Hatzivassiloglou V. Translating collocations for bilingual lexicons:A statistical approach. Computational Linguistics, Vol.22, No.1, pp.1-38, March 1996.
    [11]Ker S J, Chang J S. A class-based approach to word alignment. Computational Linguistics, Vol.23, No.2, pp.313-343,1997.
    [12]Melamed I D. Models of translational equivalence among words. Computational Linguistics, Vol.26, No.2, pp.221-249,2000.
    [13]Brown P F, Della Pietra S A, Della Pietra V J, Mercer R L. The Mathematics of Statistical Machine Translation:Parameter Estimation. Computational Linguistics, Vol.19, No.2, pp. 263-311,1993.
    [14]LIU Y, LIU Q, and LIN S X. Log-linear Models for Word Alignment, the 43rd Annual Meeting of Association of Computational Linguistics(ACL-05), Michigan, USA, June 25-30, 2005.
    [15]Dice L R. Measures of the amount of ecologic association between species. Journal of Ecology, Vol.26, pp.297-302,1945.
    [16]刘群.汉英机器翻译若干关键技术研究,北京大学博士论文,2004.5.
    [17]Dempster A P, Laird N M and Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B,39(1),1-38.1977.
    [18]梅家驹,竺一鸣,高蕴奇,殷鸿翔.同义词词林[M].上海:上海辞书出版社,1983.
    [19]王斌.汉英双语语料库自动对齐研究,中国科学院计算技术研究所博士毕业论文,1999年7月.
    [20](美)Tom M. Mitchell著,曾华军,张银奎等译.机器学习.机械工业出版社,2003.
    [21]Knight K. A Statistical MT Tutorial Workbook.35 pages, Aug.1999. http://www.isi.edu/natural-language/mt/wkbk.rtf.
    [22]Och F J. Statistical Machine Translation:From Single-Word Models to Alignment Templates. PhD dissertation,2002
    [23]Brown P F, Della Pietra S A, Della Pietra V J, Goldsmith M J,Hajic J, Mercer R L, Mohanty S.1993.But dictionaries are data too. In Proc. ARPA Workshop on Human Language Technology, pp.202-205, Plainsboro, NJ, March 1993.
    [24]Och F J, Ney H. A comparison of alignment models for statistical machine translation. In COLING'00:The 18th Int. Conf. on Computational Linguistics, pp.1086-1090, Saarbrucken, Germany, Aug.2000.
    [25]Green T. The Necessity of Syntax Markers. Two experiments with artificial languages. Journal of Verbal Learning and Behavior 18:481-496..1979.
    [26]Juola P. A Psycholinguistic Approach to Corpus-based Machine Translation. In CSNLP 1994, Dublin, Ireland.1994.
    [27]Veal T. and Way A. Gaijin:A Bootstrapping, Template-Driven Approach to Example-based Machine Translation. In RANLP-97, Tzigov Chark, Bulgaria, pp.239-244.1997.
    [28]Gough N, Way A and Hearne M. Example-based Machine Translation via the Web. In Richardson (ed.) Proceedings of AMTA-02, Tiburon, CA, pp.74-83.2002.
    [29]Way A and Gough N.wEBMT:Developing and Validating and Example-based Machine Translation System using the World Wide Web. Computational Linguistics 29(3).2003.
    [30]周强.汉语语料库的短语自动划分和标注研究.北京大学博士论文.1996.
    [31]Church K W. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In: Proceedings of Second Conference on Applied Natural Language Processing, Austin, Texas,1988.
    [32]赵军.汉语BaseNP识别及结构分析研究.清华大学博士论文.1998.
    [33]张卫国.三种定语、三个意义及三个槽位.中国人民大学学报.1996.4.
    [34]Bourigault D. Surface Grammatical Analysis for the Extraction of Terminological noun Phrases. In:Boitet C. ed. Proceedings of the 15th International Conference on Computational Linguistics (COLING'92), Academic Press, Nantes, pp.977-981,1992.
    [35]Ramshaw L A, Marcus M P. Text Chunking using Transformation-Based Learning. In: Proceedings of the Fourth Workshop on Very Large Corpus, pp.82-94,1995.
    [36]Kudoh T and Matsumoto Y. Use of Support Vector Learning for Chunk Identification. In Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal,2000.
    [37]Koeling R. Chunking with Maximum Entropy Models. In Proceedings of CoNLL-2000 and LLL-2000. Lisbon, Portugal.
    [38]Veenstra J and Buchholz S. Fast NP Chunking Using Memory-Based Learning Techniques. In Proceedings of BENELEARN'98, Pages 71-78, Wageningen, the Netherlands.
    [39]Tjong Kim Sang E F. Memory-Based Shallow Parsing. In:Journal of Machine Learning Research 2 (2002) pp.559-594,2002.
    [40]李文捷,周明,潘海华,林耀燊,黄锦辉.基于语料库的中文最长名词短语的自动提取.陈力为,袁琦主编,计算语言学进展与应用.北京:清华大学出版社,1995:119-124.
    [41]周强,孙茂松,黄昌宁.汉语最长名词短语的自动识别.软件学报.2000,11(2)195-201.
    [42]张昱琪,周强.汉语基本短语的自动识别.中文信息学报,第16卷第6期,2002.
    [431周雅倩;郭以昆,黄萱菁,吴立德.基于最大熵方法的中英文BaseNP识别.计算机研究与发展,2003.3.
    [44]马艳军,刘颖.基于隐马尔可夫模型和候选排序的汉语BaseNP识别.全国第八届计算语言学联合学术会议(JSCL-05).孙茂松,陈群秀主编:自然语言理解与大规模内容计算,清华大学出版社,2005.
    [45]马艳军,刘颖.汉英准等价名词短语.程学旗,王斌主编:大规模内容计算与内容安全,清华大学出版社,2005.参考文献
    [46]Ma Y J, Liu Y. Chinese-English Equivalent Noun Phrase:Definition and Automatic Identification. In:proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering,2005.
    [47]刘冬明,赵军,杨尔弘.汉英双语语料库中名词短语的自动对应.《中文信息学报》2005年第五期.
    [48]Berger A, Della Pietra S A, Della Pietra V J. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics,22(1), pp.39-68,1996.
    [49]Ney H. On the probabilistic-interpretation of neural-network classifiers and discriminative training criteria. IEEE Trans, on Pattern Analysis and Machine Intelligence,17(2):107-119, February.1995.
    [50]Darroch J and Ratcli D. Generalized iterative scaling for log-linear models. Annals of Math. Statistics,43(5):1470-1480,1972.
    [51]Ravichandran D, Hovy E and Och F J. Statistical QA-Classifier vs Re-ranker:What's the difference? In Proceedings of the ACL Workshop on Multilingual Summarization and Question Answering--Machine Learning and Beyond, Sapporo, Japan.2003.
    [52]Koehn P and Knight K. Feature-Rich Statistical Translation of Noun Phrases, In proceedings of the 41st Annual Meeting of the association for Computational Linguistics, pp. 311-318,2003.
    [53]Koehn P. Noun Phrase Translation. PhD dissertation,2003
    [54]Och F J. Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. In proceedings of the 40st Annual Meeting of the association for Computational Linguistics,2002.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700