基于二元组合文法的语义知识库构建

英文题名：Building Semantic Knowledge-Bank Based on the Binary Combinatorial Grammar
作者：徐忠明
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：自然语言处理 ; 句法分析 ; 语义分析 ; 语义知识库
英文关键词：natural language processing ; syntax analysis ; semantic analysis ; semantic knowledge-bank
学位年度：2008
导师：万建成
学科代码：081202
学位授予单位：山东大学
论文提交日期：2008-04-05

摘要

句法分析一直是自然语言处理领域的热点。从上世纪80年代以来,句法分析的处理的重心逐渐转移到语义处理上来,词一级语言单位的研究又是语义处理的重心。无论做机器翻译、信息抽取还是词汇语义消歧,语义知识是所有这些应用不可或缺的基础性资源。
     文中首先介绍了本文和整个系统所基于的二元组合文法体系,然后给出了整个句法分析系统的整体架构。在句法分析过程中,句法、语义分析相互交互,语义知识库是语义分析和语义消歧的知识来源。
     在随后的章节中介绍了主要的语义学设计理论和当前有代表性的语义知识词典。语义学理论是语义知识库设计的理论基础。语义知识词典的描述体系涉及多方面的内容,既有层级分类关系,又有同义、同类关系。但是,总的来说,都还不能直接满足中文信息处理的应用需求,但可以成为本语义知识库的学习资源。
     从句法分析实际需求出发,我们设计了语义知识库的描述体系和组织结构。语义知识库由词库、语义搭配属性库、层次库、类属库和语义库维护子系统组成。词库在整个语义库的中心,语义搭配属性库存储词与词之间的二元语义搭配属性关系,类属关系库描述的是词语在某分类系统中的相对关系,组成关系库描述的则是词语之间整体与部分的关系。语义维护子系统负责维护语义知识库,提供检索、添加、删除语义知识的接口。
     然后讨论了向语义库中添加语义知识的方法。首先介绍了哈工大的依存树库,证明了可以将依存树转换为二元组合树,借鉴基于统计的搭配识别算法,采用搭配属性类别加统计的方法直接从依存树库中抽取搭配属性知识,比单独使用统计的方法提高了准确性和召回率,迅速的扩大了语义搭配属性库的规模。对于层次库和类属库,以知网和WordNet为知识源,主要利用人工发现和判断的方法,这样是为了保证层次不产生混乱,然后借助模式识别层次知识的方法,从文本中自动抽取层次知识。这样就构建了一个初步能够满足基于语义的句法分析需求的语义知识库。
     语义知识库的构建工程量大,难度很高,目前还只能在有限目标下开展工作。但是我们已经找到了一条可行的技术路径,为实现句法分析系统提供了基础资源。该语义知识库还可以为其它中文信息处理的应用提供基础资源,应用前景十分广阔。
Syntax analysis is always one of the most important fields of natural language processing, and the research has made great progress on this field. From the beginning of the 1980's, the focus of syntactic Analysis has gradually shifted to semantic processing, and words phrase in semantic processing is the focus of focus. Whether to machine translation, information extraction or manage lexical ambiguity, semantic representation system is the essential foundation resources in all these applications.
     This thesis first gave the description of Binary Combinatorial Grammar on which the whole system and the semantic system are based. Then, we introduce the overall system of the syntactic analysis. In the parsing process, syntactic and semantic analyses interact mutually, and the system is the source of the analysis and disambiguation.
     The ensuing chapter introduces the main semantic designing theories and representative semantic knowledge banks. Their description thesis includes many aspects, involving both classified relation and synonyms、similar relations. Generally, however, they are not directly meet the Chinese information processing application needs, but could be the learning resources of the bank.
     From the actual needs of the syntactic analysis, we designed the structure of semantic knowledge bank. The bank is composed of word library、semantic collocation library、class library and maintaince subsystem. The word library is the center of the whole bank. The semantic collocation library storages binary semantic collocation relations between two words. The classification library descriptions the relative relationship in certain system, and the component system descriptions the entire and the part relations.
     Then, The last chapter discussed the method to collect semantic knowledge. First of all, we introduced. the HIT Treebank and Proof that the dependent tree can be converted to binary tree. Subsequently, based on statistics algorithm to match collocation, we adapted the method of collocation types adding statistical methods and the accuracy and recall-rate were improved significantly. We mainly used artificial methods to judge classification and component knowledge from Hownet and Wordnet, so we could be sure of the accuracy of the knowledge. Then we adapted the pattern-recognition method to find knowledge from corpus. After then we have preliminarily built the semantic knowledge bank to meet the need of the syntax analysis.
     The project is complicated and difficult and so we could only do our research on a limited domain. However, we have found a viable technological path for the realization of parsing system to provide the basic resources. The semantic knowledge base can also be used to other Chinese information processing application and provide the basic source of knowledge. The application prospects are bright.

引文

[1]万建成,汉语的二元语义模型.计算语言学进展与应用,1995:7-11
    [2]Lesk Michal.Automatic sense disambiguation:How to tell a pine from an ice cream cone.Association for Computing Machinery,eds.The 1986SIGDOC Conference.New York,ACM[C].1986:24-26
    [3]Luk,Alpha K.Statistical Sense Disambiguation with relatively Small Corpora Using Dictionary Definitions.In:ACL eds.The 33rd Annual Meeting of ACL,Cambridge,Massachusetts.1995:181-188
    [4]Towell,Geoffrey;Ellen M.Voorhees.Disambiguating Highly Ambiguous Words,Computational Linguistics,1998,24(1):125-145
    [5]Resnik,Philip.Selection and Information:A Class-Based Approach to Lexical Relation.[Ph.D.Dissertation],USA:University of Pennsylvania.1993,23-54
    [6]Lam Sze-Sing,Kan-Fai Wong,and Vincent Lum.LSD-C-A.Linguistic-based word sense disambiguation algorithm for Chinese.Computer Processing of Oriental Languages.1997,10(4):409-422
    [7]李涓子,黄昌宁,杨尔弘.一种自组织的汉语词义排歧方法.中文信息学报.1999,13(3):1-8
    [8]王惠.机器翻译中基于语法、语义知识库的汉语消歧策略.广西师范大学学报(自然科学版).2003,21(1):86-93
    [9]董振东,董强.知网.http://www.keenage.com
    [10]王爱军.基于语义驱动的句法结构识别方法研究.山东大学硕士学位论文.2002,28-29
    [11]曹雁锋,万建成,卢雷.基于二元运算关系的汉语计算语法模型.山东大学学报(工学版),2005,35(1):88-93
    [12]Xiao Yang,Jiancheng Wan,Yongbo Qiao.A Binary Combinatorial Grammar for Chinese and Its Parsing Algorithm[A].Sixth Intelligent Systems Design and Applications(ISDA 2006),Vol.2[C].Jinan,China:IEEE Computer Society,2006:761-766.
    [13]Xiao Yang,Jiancheng Wan,Ling Zhang.Arithmetic Computing Based Chinese Automatic Parsing Method.Eighth ACIS International Conference on Software Engineering,Artifical Intelligence,Networking,and Parallel/Distributed computing(SNPD 2007)
    [14]吴竟存,候学超.现代汉语句法分析,北京大学出版社,北京,1996
    [15]Abraham,Samuel and Ferenc Kiefer.A theory of structural semantics.The Hague,Mouton&Co.-Janua Linguarum.Series minor No.49,1967
    [16]Chomsky Noam.Aspects of the Theory of Syntax.Cambridge,MA:MIT Press,1965
    [17]Chomsky Noam.Deep Structure,Surface Structure,and Semantic Interpretation.In:Chomsky Noam:Studies on Semantics in Generative Grammar.Den Haag/pairs,1972:62-119
    [18]Dillon George L.Introduction to contemporary linguistic semantics.Englewood Cliffs,Nuew Jersey:Prentice-Hall,1977
    [19]David T.Dowt,Robert E.wall and Stanley Peters.Introduction to Montague semantic.Dordrecht,Holland:D.Reidel Pub.Co.,1981
    [20]姚天顺,朱靖波,杨莹等.自然语言理解--种让机器懂得人类语言的研究[M].北京清华大学出版社,2002
    [21]王惠,詹卫东,刘群.《现代汉语语义词典》的概要及设计.1998中文信息处理国际会议论文集
    [22]Choueka,Y.Klein,Neuwitz.Automatic retrieval of frequent idiomatic and collocation expressions in a large corpus.Journal of the Association for Literary and Linguistic Computing,1983.4(1):34-38
    [23]Kenneth W.Church,Patrick Hanks.Word association norms,mutual information,and lexicography.Computational Linguistics,1990.16(1):22-29
    [24]Frank Smadja.Retrieving collocations from text:Xtract.Computational Linguistics.1993.19(1):143-177
    [25]孙茂松,黄昌宁,方捷.汉语搭配定量分析初探[J].中国语文.1997,(1):29-38
    [26]孙宏林.从标注语料库中归纳语法规则:“V+N”序列实验分析[A].语言工程第四届全国计算语言学联合学术会议论文集,清华大学出版社,1997.
    [27]陈小荷.动宾组合的自动获取与标注[A].计算语言学文集.全国第五届计算语言学联合学术会议.北京:清华大学出版社,1999:215-221
    [28]高建忠.汉语动宾搭配的自动识别研究[A].自然语言理解与机器翻译.第六届计算语言学联合学术会议.北京:清华大学出版社,2001:135-140
    [29]张琪,周强.大规模真实文本中汉语动词语法搭配模板的自动识别[A].自然语言理解与机器翻译-第六届计算语言学联合学术会议.北京:清华大学出版社,2001:129-134
    [30]车万翔等.面向依存文法分析的搭配抽取方法研究[A].自然语言理解与机器翻译-全国第六届计算语言学联合学术会议.北京:清华大学出版社,2001:135-159
    [31]周明,黄昌伟.面向语料库标注的汉语依存体系的探讨.中文信息学报.1998.3:22-24
    [32]刘伟权,王明会钟义信.现代汉语依存关系的层次体系.中文信息学报.1996.10(2):32-45.
    [33]周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4):1-8
    [34]党政法,周强.短语树到依存树的自动转换研究.中文信息学报,2005,19(3):21-27
    [35]Fei Xia and Martha Palmer.Converting Dependency Structures to Phrase Structures[A].In:Proceedings of the Human Language Technology Conference(HLT22001)[C],San Diego,CA,March,18 - 21.
    [36]Nivre,J.Theory-supporting treebanks[A]In:J.Nivre and E.Hinrichs,eds.,Proceedings of Treebanks and Linguistic Theories[C].
    [37]Christopher D.Manning,Hinrich Schütze.Foundations of Statistical Natural Language Process ing[M].苑春法,李庆中,王昀等译.北京:电子工业出版社,2005:258.
    [38]Cederberg,S.&Widdows,D.Using LSA and Noun Coordination Information to Improve the Precision and Recall of Automatic Hyponymy Extraction.Proc.of CoNLL-2003:111-118
    [39]Ciaramita,M.&Johnson,M.(2003)Supersense Tagging of Unknown Nouns in WordNet.Proc.of EMNLP-2003
    [40]Girju,R.&Johnson,M.Hierarchical Semantic Classification:Word Sense Disambiguation with Word Knowledge.Proc.of IJCAI-2003
    [41]Dominic Widdows and Beate Dorow. A graph model for unsupervised lexical acquisition. In 19th International Conference on Computational Linguistics, Taipei, Taiwan, august.2002: 1093-1099.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700