日语依存句法分析技术研究

英文题名：Research on Japanese Dependency Parsing Technology
作者：成姣
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：日语依存关系解析 ; 条件随机场 ; 机器学习 ; 上下文信息 ; 错误驱动
英文关键词：Japanese dependency analysis ; Conditional Random Fields ; Machine learning ; Contextual information ; Error-driven technique
学位年度：2011
导师：蔡东风
学科代码：081203
学位授予单位：沈阳航空航天大学
论文提交日期：2010-12-08

摘要

日语依存关系解析是日语句子解析的一项基本技术,主要基于日语依存语法来确定句子中文节与文节间的依存关系。句法分析是进行语义分析等深层自然语言处理的首要基础,是诸多自然语言处理应用系统不可或缺的一个重要环节。依存关系解析在机器翻译、信息抽取、自动问答等领域有着重要的应用。
     目前对日语依存解析的相关研究,重点都集中在对学习框架的修改上,机器学习算法大多采用支持向量机或其他基于边界和记忆学习的方法。条件随机场作为一种优秀的序列标注器,在序列标注方面有着出色的表现,被成功地运用在自然语言处理的任务中,并取得了很好的效果,但是在日语依存关系解析方面,却未见相关的报道。本文采用层叠组块算法和条件随机场相结合的方法进行日语依存关系解析,融入丰富的上下文信息,从整句的角度给予每个标注单元一个最优的标注结果。在日本京都大学文本语料库(Version 4.0)上的实验结果表明,该方法在不使用动态特征的条件下,依存正确率和句子正确率分别取得了很好的效果。
     规则方法作为统计方法的有益补充,仍被广泛的用于自然语言处理的诸多领域中。传统的规则获取是根据知识工程师的经验和知识手工编写,完全依赖于编写规则的知识工程师的语言知识,获取规则集合需要大量的人力和物力。针对传统的获取规则方法的不足,本文采用了基于条件随机场的错误驱动机制,将条件随机场的一次识别结果作为特征加入到条件随机场二次识别的特征模板中,利用统计方法来自动学习其中的错误规律,训练得到机器识别模型并进行纠错,在上述的语料库上的实验结果表明,该方法进一步提高了依存关系解析的效果。
Japanese dependency analysis is recognized as a basic technique in Japanese sentence analysis, and it determines the dependency relationship between“bunsets”based on Japanese dependency grammar. Syntax parsing is the primary basis of deep natural language processing such as semantic analysis and is an indispensable part for many natural language processing application systems. Dependency analysis plays an important role in machine translation, information extraction, automatic question answering and other fields.
     Current related researches on Japanese dependency analysis have focused on changing the learning framework, and machine learning algorithms are used to Support Vector Machines or other boundary-based methods of learning and memory. Conditional Random Fields as an excellent sequence labeler has good performance in sequence labeling. This method has been successfully used in natural language processing tasks and obtained good results. However, there is no relevant reports on Japanese dependency analysis. This paper proposes a new method combining Cascade Chunking Algorithm with Conditional Random Fields into the rich contextual information, to give each unit an optimal labeling result from the point of whole sentence. Experiments on Kyoto University Text Corpus (Version 4.0) show that our method has achieved good results in dependency accuracy and sentence accuracy even without dynamic features.
     Rule method as a useful complement to statistical methods, is still widely used in many natural language processing fields. The traditional rule method is based on rules hand-written by knowledge engineers according to their experiences and knowledge, and entirely depends on the language knowledge of engineers who develop rules. The creation of rule set needs a lot of manpower and material resources. In order to make up for shortages of traditional rule method, the error-driven technique based on Conditional Random Fields is adopted to parsing again for improving the parsing results. It uses statistical methods to automatically learn the error disciplines and obtain machine identification model via training, the results in the first identification stage of parsing with Conditional Random Fields are used as the features to be added in the feature template in the second stage to learn the error disciplines and correct the errors for the second parsing. Expermental results on the same corpus metioned above show that our method further improves accuracy of dependency analysis.

引文

[1]统计自然语言处理[M].北京:电子工业出版社, 1995: 15-50
    [2] J. Allen. Natural Language Understanding (Second Edition): The Benjamin / Cummings Publishing Company, Inc. 1995
    [3] Tesniere L. Elements de syntaxe structurale. Editions Klincksieck. 1959
    [4]简幼良,唱红涛,王秀坤.基于依存关系分析的日语句法分析器.见:陈力为.语言工程.北京:清华大学出版社,1997:258—262
    [5]韩東力,伊藤毅志,古郡廷治.要素間の依存関係に基づく複合語の構造分析.電子情報通信学会論文誌.2003
    [6] J.Cowie and W.Lehnert. Information extraction. Commun. ACM, 1996,39(1): 80-91
    [7] M. Surdeanu, S. Harabagiu, and J. Williams. Using Predicate-Argument Structures for Information Extraction. Proceedings of the ACL. 2003:8-15
    [8]文勖,张宇,刘挺.类别主特征结合句法特征的中文问题层次分类.第二届全国信息检索与内容安全学术会议,上海. 2005: 211-220
    [9] X.Q. Luo. A Maximum Entropy Chinese Character-Based Parser. Proceedings of Conference on Empirical Methods in NLP. 2003: 192-199
    [10] P. Fung, G. Ngai, Y. Yang, and B. Chen. A Maximum Entropy Chinese Parser Augmented with Transformation-Based Learning. ACM Transactions on Asian Language Information Processing, 2004, 3(3):159-168
    [11] D.L. Waltz. An English language question answering system for a large relational database. Commun. ACM, 1978, 21(7): 526-539
    [12] T. Liu, W.X. Che, S. Li, Y.X. Hu, and H.J. Liu. Semantic role labeling system using maximum entropy classifier. CoNLL2005, Ann Arbor, Michigan. 2005: 189-192
    [13] D. Klein and C. Manning. Accurate Unlexicalized Parsing. the 41th Association for Computational Linguistics. 2003: 423-430
    [14]周强,黄昌宁.基于局部优先的汉语句法分析方法.软件学报, 1999, 10(1): 1-6
    [15] J.S. Ma, Y. Zhang, T. Liu, and S. Li. A statistical dependency parser of Chinese under small training data. Workshop: Beyond shallow analyses Formalisms and statistical modeling for deep analyses, IJCNLP-04, SanYa. 2004
    [16] D. Hindle and M. Rooth. Structural Ambiguity and Lexical Relations. Computational Linguistics, 1993, 19(1): 103-120
    [17] E. Jelinek, J. Lafferty, D. Magerman, R. Mercer, A. Ratnaparkhi, and S.Roukos. Decision Tree Parsing using a Hidden Derivation Model. Proceedings of the Human Language Technology Workshop, Plainsboro, New Jersey. 1994: 272-277
    [18] D. Magerman. Statistical Decision-Tree Models for Parsing. Proc. of the 33rd Annual Meeting of the ACL. 1995: 276-283
    [19] M. Collins. A Statistical Dependency Parser Of Chinese Under Small Training Data. Proc. of the 34th Annual Meeting of the ACL. 1996:184-191
    [20] M. Collins. Three Generative, Lexicalized Models for Statistical Parsing. Proceedings of the 35thannual meeting of the association for computational linguistics. 1997: 16–23
    [21] M. Johnson. PCFG models of linguistic tree representations. Computational Linguistics, 1998, 2(4): 613-632
    [22] R. Levy and C. Manning. Is it Harder to Parse Chinese, or the Chinese Treebank? Proceedings of the 41th Association for Computational Linguistics. 2003: 439-446
    [23] A. McCallum, D. Freitag, and F. Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation. Proceedings of ICML. 2000: 591-598
    [24] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of 18th International Conference on Machine Learning 2001:282-289
    [25] M. Johnson. Joint and Conditional Estimation of Tagging and Parsing Models. Proceedings of ACL. 2001: 314-321
    [26] A. Ratnaparkhi. Learning to Parse Natural Language with Maximum Entropy Models. Machine Learning, 1999, 34(1-3): 151-175
    [27] T. Joachims. Text categorization with support vector machines: learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning.1998: 137-142
    [28] T. Kudo and Y. Matsumoto. Japanese dependency structure analysis based on support vector machines. Proceedings of Empirical Methods in Natural Language Processing and Very Large Corpora, Hong Kong. 2000: 18-25
    [29] T. Kudo and Y. Matsumoto. Japanese Dependency Analysis using Cascaded Chunking. The 6th Conference on Natural Language Learning. 2002: 63-69
    [30] H. Yamada and Y. Matsumoto. Statistical Dependency Analysis with Support Vector Machines. Proc. of the 8th Intern. Workshop on Parsing Technologies (IWPT). 2003: 195-206
    [31] M.X. Jin, M.Y. Kim, and J.H. Lee. Two-Phase Shift-ReduceDeterministic Dependency Parser of Chinese. Proc. of IJCNLP: Companion Volume including Posters/Demos and tutorial abstracts. 2005
    [32] Y.C. Cheng, M. Asahara, and Y. Matsumoto. Deterministic dependency structure analyzer for Chinese. Proceedings of International Joint Conference of NLP. 2004: 500-508
    [33] Y.C. Cheng, M. Asahara, and Y. Matsumoto. Chinese Deterministic Dependency Analyzer: Examining Effects of Global Features and Root Node Finder. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. 2005
    [34] J.R. Quinlan. Induction of Decision Trees. Machine Learning, 1986, 1(1):81-106
    [35] Safir K. The syntax of (in)dependence[M]. Cambridge: MIT Press, 2004.
    [36] Hiroshi Kanayama, Kentaro Torisawa, Yutaka Mitsuishi et al. A Hybrid Japanese Parser with Hand-crafted Grammar and Statistics[C]. In proceedings of the COLING 2000, 2000, pages 411-417.
    [37] Taku Kudo and Yuji Matsumoto. 2000. Japanese Dependency Structure Analysis based on Support Vector Machines. In Empirical Methods in Natural Language Processing and Very Large Corpora, pages 18–25.
    [38] Manabu S.Linear-Time dependency analysis for Japanese.Proceedings of the 20th International Conference on Computational LinguiStics,Geneva,Switzerland,2004,14(1):3—18.
    [39] Nivre J,Scholz M.Deterministic dependency parsing of English text[C]. In Proceedings of the 20th International Conference on Computational Linguistics.Ceneva:ICCI ,2004:64—70.
    [40]陈晴.基于条件随机场的自动分词技术的研究[D].沈阳:东北大学, 2005:36-46
    [41]黄昌宁,高剑峰,李沐.对自动分词的反思[A].全国第七届计算语言学联合学术会议论文集[C] .北京:清华大学出版社, 2003: 26-38
    [42]刘源.信息处理用现代汉语分词规范及自动分词方法[M].北京:清华大学出版社,广西:广西科学技术出版社,1994
    [43]刘开瑛.现代汉语自动分词评测技术研究[J].语言文字应用,1997(1) 1: 03-108
    [44]俭明.汉语句法成分特有的套叠现象[M].《陆俭明自选集》.河南:河南教育出版社, 1993: 174-192
    [45]朱德熙.语法讲义[M].北京:商务印书馆,2002
    [46]张卫国.三种定语、三个意义及三个槽位[J].中国人民大学学报,1996,(4):97-100
    [47]詹卫东.面向中文信息处理的现代汉语短语结构规则研究[M].北京:清华大学出版社,广西科学技术出版社, 2000 D.
    [48] Andrew McCallum, Wei Li. Early Results for Named Entity Recognition with Conditional Random Fields Feature Induction and Web-Enhanced Lexicons[A]. In: Proceedings of the 7th Conference on Natural Language Learning[C]. Edmonton, Canada, 2003:188-191
    [49]李素建,刘群,杨志峰.基于最大熵模型的组块分析[J].计算机学报, 2003, 26(12):1722-1727
    [50] P.L. Shiuan, C.T.H. Ann. A Divide-and-Conquer Strategy for Parsing[A]. In Proceedings of the ACL/SIGPARSE 5th International Workshop on Parsing Technologies[C]. Santa Cruz, USA, 1996: 57-66
    [51] Bourigault. Surface Grammatical Analysis for the Extraction of Terminological noun Phrases[A]. Proceedings of the 15th International Conference on Computational Linguistics (COLING’92)[C]. Academic Press, Nantes, 1992: 977-981
    [52]周俊生,戴新宇,尹存燕等.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报, 2006, 34(5) :804-809
    [53] Kiyotaka Uchimoto, Satoshi Sekine, Hitoshi Isahara. Japanese Dependency Structure Analysis Based on Maximum Entropy Models[C]. In Proceedings of the EACL, 1999, pages 196-20.
    [54]周强.基于语料库和面向统计学的自然语言处理技术介绍计算机科学[J].计算机科学,1995.22(4):19~44
    [55] Brill E.Transformation-based error-driven parsing.Proceedings of the Third International Workshop on Parsing Technologies,Tilburg,Netherlands,1993:13-16.
    [56] Kawahara D, Kurohashi S, Hasida K. Construction of a Japanese relevance-tagged corpus[C]//Proceedings of the 3rd International Conference on Language Resources and Evaluation. Las Palmas de Gran Canaria: European Language Resources Association, 2002: 2008?2013.
    [57] Huiwei ZHOU, Degen HUANG, Tong Yu. Japanese Dependency Analysis using Fuzzy Support Vector Machines[C]. In: Proceedings of the 7th Conference on NLP-KE[C]. China, DaLian, 2009:188-191
    [58] S.Sekine, K.Uchimoto, and Hitoshi Isahara. Backward beam search algorithm for dependency analysis of Japanese. In Proceedings of the COLING 2000, Germany, pp. 754-760, 2000.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700