基于统计的常用汉语副词用法自动识别研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
现代汉语副词用法自动识别是面向自然语言处理的现代汉语副词知识库研究的重要内容之一,针对基于规则的现代汉语副词用法自动识别方法存在的不足,本文在已有工作的基础上,进一步提出了基于统计的常用汉语副词用法自动识别方法。分别采用条件随机场模型、最大熵模型和支持向量机模型,在1998年1月份《人民日报》分词与词性标注语料上,对8个常用的现代汉语副词进行了统计实验,实验表明基于统计的方法在现代汉语副词用法自动识别上具有较好的识别效果,能够很好地对未知的副词用法进行预测,在真实语料中取得了较高的准确率,与规则方法相比,统计实验结果的平均准确率有了较大的提高。实验证明基于统计的方法在常用现代汉语副词用法自动识别方面具有良好的应用前景。
     根据俞士汶等提出的构建“三位一体”的现代汉语虚词知识库的思想,本文着重研究现代汉语副词用法的自动识别,致力于采用统计机器学习方法实现副词用法的自动识别。
     本文的主要工作包括:
     (1)针对已经初步构建的现代汉语副词知识库,以副词用法信息词典中的例句集作为语料来考察副词用法规则,分析规则存在的问题,对用法规则进行修改,进而完善副词知识库。
     (2)使用基于规则的方法对人民日报语料中副词用法进行自动识别,并对识别结果进行人工校对,形成副词用法语料库,并作为实验语料。在对人民日报语料进行人工校对的同时,分析规则方法识别结果存在的问题,并进一步完善副词用法信息词典以及副词用法规则库。
     (3)针对基于规则方法存在的不足,实现基于统计的常用现代汉语副词用法自动识别,进一步提高副词用法识别的准确率。
     最后,论文对本文的研究工作进行了总结,并对下一步的研究进行了展望,指出了规则与统计方法相结合的现代汉语副词用法自动识别研究的可行性。
Researching on Automatic Recognizing usages of Modern Chinese Adverbs is one of the important contents of the NLP-oriented Chinese Adverbs Knowledge Base. To solve the problems of the existing rule-based method of adverbs' usages recognition, this paper bases on the previous work, and further study automatically recognizing Chinese adverbs'usages using statistical methods. Three statistical models, viz. CRF, ME, and SVM, are used to label several common Chinese adverbs' usages on the tagged corpus of People's Daily(1998.1) The experiments show that statistical-based method is effective in automatically recognizing of adverbs'usages and has good application prospects.
     According to the thought building the "Trinity" knowledge-base of functional words, this paper focuses on the important part of the adverb knowledge base—automatically recognizing usages of adverbs, and uses statistical-based method to realize automatically recognizing usages of adverbs.
     This article mainly includes:
     (1) According to Chinese Adverb Knowledge Base, we use the example data in the base as our corpus to examine the adverbs'rules, and analyze the problems of rules, and complete the adverb knowledge base.
     (2) We use the rule-based method to recognize adverbs'usages in our corpus. Then, we manually check the tagging results several times. Finally, formed the standard corpus and use it as the experiment corpus. At the same time, we further perfect the information dictionary and the rule base of adverbs'usages.
     (3) According to the shortcomings of the rule-based method, we realize automatically recognizing usages of adverbs, and further improve the recognition precision rate.
     In the end, this paper summarizes the research work, and the next research forecasted, and points out that the feasibility of combing the rule-based method and the statistical-based method on automatically recognizing adverbs'usages.
引文
1 http://crfpp.sourceforge.net
    2 http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html
    3 http://www.csie.ntu.edu.tw/~cjlin/libsvm
    [1]吕叔湘.现代汉语语法分析问题[M].北京:商务印书馆,1979
    [2]李晓琪.现代汉语虚词讲义[M].北京:北京大学出版社,2005
    [3]俞士汶,朱学锋,刘云.现代汉语广义虚词知识库的建设[J].汉语语言与计算学报,2003,13(1):89~98
    [4]刘云.汉语虚词知识库的建设[D].[博士后出站报告].北京:北京大学,2004
    [5]张谊生.现代汉语副词研究[M].上海:学林出版社,2000
    [6]刘锐,咎红英,张坤丽.现代汉语副词用法的自动识别研究[J].计算机科学,2008.8(A):172~174
    [7]刘云,俞士汶,朱学锋,等.现代汉语虚词知识库的建设[J].语言文字应用,2005.2:130~136
    [8]彭爽.现代汉语介词知识库的建设及相关研究[D].[博士后出站报告].北京:北京大学,2006
    [9]咎红英,张坤丽,柴玉梅,等.现代汉语副词用法的形式化描述[C].第八届汉语词汇语义学研讨会论文集,香港理工大学,2007.5
    [10]郝丽萍,咎红英,张坤丽,等.面向机器识别的现代汉语副词用法规则问题研究[C].第七届中文信息处理国际会议论文集,武汉,2007:52~56
    [11].董振东.知网.见http://www.keenage.com
    [12]宗成庆.统计自然语言处理[M].北京:清华大学出版社,2008
    [13]咎红英,张坤丽,柴玉梅,等.现代汉语虚词知识库的研究[J].中文信息学报,2007.9:107~111
    [14]刘锐.基于规则的现代汉语副词用法自动识别研究[D].[硕士学位论文].郑州大学,2009
    [15]俞士汶,朱学锋等.现代汉语语法信息词典详解(第二版)[M].北京:清华大学出版社,2003
    [16]张斌.现代汉语虚词词典[M].北京:商务印书馆,2003
    [17]吕叔湘.现代汉语八百词[M].北京:商务印书馆,1980
    [18]陈火旺.程序设计语言编译原理(第三版)[M].北京:国防工业出版社,2000
    [19]俞士汶,段慧明,朱学锋等.北京大学现代汉语语料库基本加工规范[J].中文信息学报,2002(5):49~64,2002(6):58~65
    [20]咎红英,李鸥,赵科.褒贬新词的自动发现算法研究[J].计算机研究新进展,2007.8:1~6
    [21]J Lafferty, A McCallum, F Pereira. Conditional random fields:probabilistic models for segmenting and labeling sequence data[C].In:International Conference on Machine Learning, 2001:282~289
    [22]Sha F, Pereira F.Shallow Parsing with Conditional Random Fields[C]. In Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics(HLT—NAACI),2003
    [23]洪铭材,张阔,唐杰等.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006(10):148~151
    [24]T.Cohn, P.Blunsom.Semantic Role Labeling with Tree Conditional Random Fields. Proceedings of the Ninth Conference on Computational Natural Language Learning[C].Ann Arbor, Michigan:Association for Computational Linguistics,2005:169~172
    [25]苗雪雷.基于条件随机场的汉语词义消歧方法研究[D].[硕士学位论文].沈阳:沈阳航空工业学院,2007
    [26]J. N. Darroch, D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models[C]. Annals of Mathematical Statistics. No.43:1470~1480
    [27]S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. Technical Report CMU-CS-95-144, CMU,1995
    [28]C.Zhu,R.H.Byrd,P.Lu,J.Nocedal,Algorithm 778:L-BFGS-B—Fortran subroutines for large-scale bound constrained optimization, ACM Trans.Math.Software,1997(12):550-560
    [29]E.TJaynes.Information theory and statistical mechanics[J]. Physics Reviews.1957,106: 620~630
    [30]S.Della Pietra and V.Della Pietra. Statistical modeling by maximum entropy. unpublished report.1993
    [31]陈笑蓉,秦进.基于最大熵原理的汉语词义消歧[J].计算机科学,2005,32(5):174-176
    [32]张磊.基于最大熵模型的汉语词性标注研究[D].[硕士学位论文].大连:大连理工大学2008
    [33]李素建,刘群,杨志峰.基于最大熵模型的组块分析[J].计算机学报,2003(12):1722~1727
    [34]彭其伟.基于统计方法的中文文本情感倾向分类研究[D].[硕士学位论文].太原:山西大学,2007
    [35]Vapnik V N.Statistical Learning Theory[M]. Wiley-Interscience Publication. John Wiley&Sons, Inc,1998
    [36]杨宇娜.基于统计的中文词义消歧技术研究[D].[硕士学位论文].哈尔滨:哈尔滨工业大学,2006
    [37]牛肖潇.支持向量机及用于文本分类的研究[D].[硕士学位论文].武汉:武汉理工大学,2006
    [38]肖明.基于SVM的智能邮件过滤系统研究与实现[D].[硕士学位论文].成都:电子科技大学,2005
    [39]史忠值.知识发展[M].北京:清华大学出版社,2002
    [40]李国正,王猛,曾华军.支持向量机导论[M].北京:电子工业出版社,2005
    [41]Kun Yu,Gang Guan, Ming Zhou. Resume information extraction with cascaded hybrid model[C].In:Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan,2005:499~506
    [42]J R Firth. A Synopsis of L inguistic Theory 1930-1955 In Studies on L inguistic Analysis [M]. London:B lackwell,1957:101~126
    [43]卢志茂,刘挺,李生等.统计词义消歧的研究进展[J].电子学报,2006,34(2):333~343
    [44]John Hughes. Automatically acquiring a classification of words[D].[PhD dissertation]. Paris:University of Leeds,1994
    [45]马真.现代汉语虚词研究方法论[M].北京:商务印书馆,2004
    [46]Hongying Zan, Junhui Zhang.Studies on Automatic Recognition of Chinese Adverb CAI's usages Based on Statistics[C]. In:Proceedings of International Conference on Natural Language Processing and Knowledge Engineering. Da Lian:2009:230~235
    [47]咎红英,张军珲,朱学峰.副词“就”的用法及其自动识别研究[J].中文信息学报,2010

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700