汉语功能块的自动识别研究

英文题名：A Study on Chinese Functional Chunk Parsing
作者：刘海霞
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：汉语功能块 ; 条件随机域模型 ; 语义信息 ; 歧义结构
英文关键词：Chinese functional chunk ; Conditional Random Fields (CRFs) ; Semantic information ; Ambiguous structure
学位年度：2011
导师：黄德根
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2011-11-09

摘要

汉语的功能块是定义在句子层面上的功能性成分,一般占据句子中的主语、谓语、宾语、状语、定语、中心语等功能位置,体现了汉语句子的基本骨架。功能块识别的目的正确标注出句子的功能块标记信息,覆盖自顶向下进行事件句式拆分而形成的各个基本信息单元,以显示句子在小句层面上的基本结构及骨架,为进一步的事件骨架树分析提供最小的功能块描述序列。
     本文将汉语功能块的自动识别问题转化为序列标注问题,使用的序列标注器是条件随机域(CRFs, Conditional Random Fields)。CRFs是一个基于无向图的条件概率模型,可以任意添加有效的特征向量,具有表达长距离依赖性和交叠性特征的能力,能够较好地解决标注偏置等问题。因此本文选择CRFs建立功能块的序列标注模型。
     为了构建较好的功能块自动识别系统,本文首先通过特征模板优化策略进行汉语功能块的识别,得到功能块识别的精确率、召回率和F1-measure值分别为85.84%、85.07%和85.45%,其中主语块、述语块、宾语块和状语块四个典型功能块的F1-measure值分别达到了85.16%、88.22%、81.75%和91.98%。
     在此基础上,本文首次将语义信息引入汉语功能块的识别系统,将通过词义聚合关系组织词语的《同义词词林》作为语义资源,把其中的语义信息作为特征加入到功能块的识别过程,缓解了数据稀疏以及歧义问题对识别结果造成的影响,使得上述三个性能指标分别提高到86.21%、85.31%和85.76%,与单独使用条件随机域模型的方法相比有了较大程度的提高。
The automatically parsing of Chinese functional chunk is transformed into the problem of sequence labeling in this paper. We build a sequence labeling model for Chinese functional chunk based on Conditional Random Fields which is a conditional probability model based on undirected graph. We can append any effective feature vector into Conditional Random Fields model at random. It has the ability of expressing the characteristics of long-distance dependencies and overlap, so it could solves the problem of label bias. Also, all of the feature could execute the global normalization and find the global optimal solution. Conditional Random Fields model has not that forceful assumption for the probability distribution of input or output like Hidden Markov Model, so it is very suitable to sequence labeling and we choose it for labeling of Chinese functional chunk.
     We focus on building a system for labeling Chinese functional chunks, through detecting the boundary of Chinese functional chunks and labeling the functional information in a sentence with correctly word segmenting and POS tagging. This paper proposes an approach that combines the feature template optimizing strategy with Conditional Random Field Model for automatic labeling Chinese functional chunks. On the testing data set, the precision, recall and F-1 measure of Chinese functional chunks reaches 85.84%,85.07% and 85.45% respectively, of which the F-1 measure of subject, predicate, object and adverb functional chunk reaches 85.16%,88.22%,81.75% and 91.98% respectively, and ranked the first in the close test of CIPS-ParsEval-2009 task3 Function Chunk.
     On the basis of combining the feature template optimizing strategy with Conditional Random Field Model, existing language resources Chinese thesaurus "Tongyici Cilin" is introduced into the processing module, of which the semantic information will be added to the feature template, the effect of data sparseness and ambiguous problem is remitted, thus the three performance indexes are increased to 86.21%、85.31% and 85.76% respectively, and better than the previous method based on Conditional Random Fields model solely.

引文

[1]赵铁军.机器翻译原理[M].哈尔滨：哈尔滨工业大学出版社,2001.
    [2]周强.汉语基本块描述体系[J].中文信息学报,2007,21(3)：21-27.
    [3]周强,等.汉语块分析评测任务设计[J].中文信息学报,2010,24(1)：123-128.
    [4]Abney S. Parsing by chunks. In:Berwick R, Abney S, Tenny C et al. Principle-Based Parsing[M]//Dordrecht:Kluwer Academic Publishers,1991:257-278.
    [5]Church K. A stochastic parts program and noun phrase parser for unrestricted text. Proceedings of Second Conference on Applied Natural Language Processing[C]. Texas, USA, 1988:136-143.
    [6]Lance A. Ramshaw and Mitchell P.Marcus. Text chunking using transformation-based learning[C]//In Proceedings of Third ACL Workshop on Very Large Corpora, Association for Computational Linguistics.1995:82-94.
    [7]李文捷,周明等.基于语料库的中文最长名词短语的自动提取.陈力为,袁琦主编.计算语言学进展与应用[M].北京：清华大学出版社,1995：119-124.
    [8]周强,孙茂松,黄昌宁.汉语最长名词短语的自动识别.软件学报[J].2000,11(2)：195-201.
    [9]李珩,杨峰,朱靖波等.基于增益的隐马尔科夫模型的文本组块分析[J].计算机科学,2004：152-154.
    [10]李珩,朱靖波,姚天顺.基于SVM的中文组块分析[J].中文信息学报,2004,18(2)：1-7.
    [11]李素建,刘群,杨志峰.基于最大熵模型的组块分析[J].计算机报,2003：1722-1727.
    [12]Sha F, Pereira F. Shallow parsing with conditional random fields. Proceedings of Human Language Technology/North American Chapter of the Association for Computational Linguistics Annual Meeting[C]. Edmonton,2003:213-220.
    [13]Tan Y M, Yao T S, Chen Q et al. Applying conditional random fields to Chinese shallow parsing. Proceedings of CICLing-2005[C]. Mexico City, Mexico,2005:167-176.
    [14]Zhou G D, Su J, Tey T G. Hybrid text chunking. Proceedings of CoNLL-2000 and LLL-2000[C]. Lisbon, Portugal,2000:163-165.
    [15]Osborne M. Shallow parsing as part-of-speech tagging. Proceedings of CoNLL-2000 and LLL-2000[C]. Lisbon, Portugal,2000:145-147.
    [16]Koeling R. Chunking with maximum entropy models. Proceedings of CoNLL-2000 and LLL-2000[C]. Lisbon, Portugal,2000:139-141.
    [17]周雅倩,郭以昆,黄萱菁等.基于最大熵方法的中英文基本名词短语识别[J].计算机研究与发展,2003,40(3)：440-446.
    [18]黄德根,王莹莹.基于SVM的组块识别及其错误驱动学习方法[J].中文信息学报,2006,20(6)：17-24.
    [19]黄德根,于静.分布式策略与CRFs相结合识别汉语组块[J].中文信息学报,2009：16-23.
    [20]张昱琪,周强.汉语基本短语的自动识别[J].中文信息学报,2002,16(6)：1-8.
    [21]刘世岳,李珩,张俐等.Co-training机器学习方法在中文组块识别中的应用[J].中文信息学报,2005,19(3)：73-79.
    [22]李珩,朱靖波,姚天顺.基于Stacking算法的组合分类器及其应用于中文组块分析[J].计算机研究与发展,2005,42(5)：844-848.
    [23]王荣波,池哲儒.基于神经元网络的汉语组块自动划分[J].计算机工程,2004,30(20)：133-135.
    [24]Liang Y H, Zhao T J, Mao L. A multi-agent strategy for Chinese text chunking. Proceedings of the Fourth International Conference on Machine Learning and Cybernetics[C]. Guangzhou,2005:18-21.
    [25]郭永辉,杨红卫,马芳等.基于粗糙集的基本名词短语识别[J].中文信息学报.2006,20(3)：14-21.
    [26]Gao H, Huang D G, Yang Y Sh. Chinese chunking using ESVM-KNN. Proceedings of the 2006 International Conference on Computational Intelligence and Security[C]. Guangzhou, 2006:721-734.
    [27]Chen W L, Zhang Y J, Isahara H. An empirical study of Chinese chunking. Proceedings of the Association for Computational Linguistics Annual Meeting[C]. Sydney, Australia, 2006:97-104.
    [28]秦颖,王小捷,钟义信.级联中文组块识别.北京邮电大学学报,2008,31(1)：14-17.
    [29]刘芳,赵铁军,于浩等.基于统计的汉语组块分析[J].中文信息学报,2000,14(6)：28-32.
    [30]Zhang L, Lu X Q et al. A statistical approach to extract Chinese chunk candidates from large corpora[M]//Sun M S, Yao T Sh, Yuan Ch F et al. Advances in Computation of Oriental Languages, Shenyang, China:Tsinghua University Press,2003:118-124.
    [31]周强,任海波,詹卫东.构建大规模汉语语块库.黄昌宁,张普.自然语言理解与机器翻译[M].北京：清华大学出版社,2001：102-107.
    [32]Sandra Kubler and Erhard W. Hinrichs. From chunks to function-argument structure:A similarity-based approach[C]//In:Proc.of ACL/EACL 2001. Toulouse,France, 2001:338-345.
    [33]Franco E Dr bek, Zhou Q. Experiments in Learning Models for Functional Chunking of Chinese Text[C]//In:Proc. of IEEE International Workshop on Natural Language Processing and Knowledge engineering. Tucson, Arizona,2001:859-864
    [34]Zhao Y Z, Zhou Q. A SVM-based Model for Chinese Functional Chunk Parsing[C]//In:Proc. of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney,2006:94-101.
    [35]周强,赵颖泽.汉语功能块自动分析[J].中文信息学报,2007：18-24.
    [36]王听,等.基于CRF的汉语语块分析和事件描述小句识别.第一届汉语句法分析评测学术研讨会论文集[C].北京,2009：46-52.
    [37]李军辉,周国栋.苏州大学第一届中文信息学会句法分析评测技术报告.周强,朱靖波.第一届汉语句法分析评测学术研讨会论文集[C].北京,2009：16-22.
    [38]陈亿,周强,宇航.分层次的汉语功能块描述库构建分析[J].中文信息学报,2008,22(3)：24-31.
    [39]Lafferty J, McCallum A, Pereira F. Conditional random fields:probabilistic models for segmenting and labeling sequence data. Proceedings of the International Conference on Machine Learning(ICML-2001)[C]. Williams MA,2001:282-289.
    [40]Clifford P. Markov random fields in statistics[J]. Disorder in physical systems, 1990,1:19-32.
    [41]Sha F, Pereira F. Shallow parsing with conditional random fields. Proceedings of Human Language Technology/North American Chapter of the Association for Computational Linguistics Annual Meeting[C]. Edmonton,2003:213-220.
    [42]Forney J G. The viterbi algorithm. Proceedings of the IEEE[C].1973,61(3):268-278.
    [43]詹卫东.面向自然语言处理的大规模语义知识库研究述要.徐波.中文信息处理若干重要问题[M].北京：科学出版社,2003：107.
    [44]梅家驹,等.同义词词林[M].上海：上海辞书出版社出版,1983.
    [45]周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4)：1-8.
    [46]计峰,等FudanNLP:一个基于在线学习算法的中文自然语言处理工具包.第一届汉语句法分析评测学术研讨会论文集[C].北京,2009：25-29.
    [47]谷波,等.汉语基本块与功能块的自动分析.第一届汉语句法分析评测学术研讨会论文集[C].北京,2009：32-38.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700