基于规则与统计的汉语自动分词研究

作者：李丹
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：中文分词 ; 机械匹配 ; 中文人名识别 ; 支持向量机 ; 基于转换的错误驱动学习
英文关键词：Chinese Segmentation ; mechanical matching method ; Chinese names recognition ; support vector machines ; transformation-based error-driven learning
学位年度：2010
导师：赵伟
学科代码：081202
学位授予单位：长春工业大学
论文提交日期：2010-03-01

摘要

随着网络的发展,数字化信息迅速增加,人们对中文信息的处理也越来越关注,同时,现代汉语信息的处理和研究也显得尤为重要。汉语自动分词和命名实体识别是中文信息处理的基础研究课题,它的研究和实现具有重要的理论意义和实用价值。由于它的研究结果直接影响到机器翻译、语法分析、语义分析、语音识别、信息检索、信息过滤等领域的研究,因此,对分词和命名实体识别的要求也显得日益迫切并一直引起人们的关注。
     同其它语言相比,汉语自动分词和命名实体识别有其特有的难点。我们认为影响分词正确率的因素有两个：1歧义切分问题2汉语人名、地名、机构名等专有名词。目前,汉语自动分词和命名实体识别的处理结果还有待提高。本文对汉语自动分词和作为命名实体识别子问题的中文人名识别这两个问题分别进行了研究,提出了结合词频的机械匹配算法和SVM与错误驱动学习相结合的中文人名识别算法。
     汉语自动分词是中文信息处理中的重要步骤,它是诸多中文信息应用领域的基础。目前汉语自动分词方法主要包括基于规则的方法、基于统计的方法和基于理解的方法。本文对现有自动分词算法进行了深入分析,在此基础上着重研究了基于规则和统计的汉语自动分词算法,提出了结合词频的机械匹配算法。该方法首先在基于长度优先的基础上同时结合词频优先进行分词,对未匹配字串再应用改进的正向最大匹配法和逆向最大匹配法结合熵率进行分词。实验结果表明,这种分词算法进一步提高了分词的准确率。
     中文姓名识别是中文分词中未登录词识别的一个重要部分,处理好中文姓名问题势必会有效地提高未登录词识别的精度。本文提出了支持向量机和基于转换的错误驱动学习相结合的中文人名识别方法。利用基于转换的错误驱动学习方法对SVM的识别结果进行校正,转换规则较好地处理了语言现象中的特殊情况,进一步提高了SVM的识别结果。实验结果表明,与单独使用SVM模型的人名识别方法相比,加入错误驱动学习方法后,中文人名识别的准确率、召回率和F值均得到了提高。
With the development of the internet, Digital information increase rapidly, people have become pay more attention to Chinese Information Processing system day by day. At the same time, modern Chinese has become more and more significant. Automatic Chinese segmentation and name entity recognition are basic research projects in natural language processing and computational linguistics. Its research and application have great theoretical and practical significance. The research on automatic Chinese segmentation and name entity recognition are of great benefit to many applied areas, such as machine translation, semantic analysis, parsing, speech recognition, information retrieval, information filtering and so on. So the demand on automatic natural language processing becomes indispensable.
     Comparing with other languages, automatic Chinese segmentation and name entity recognition have its own difficulties. We consider that there are two factors to affect the speed of the words auto-segmentation:1 the difference meaning syllables of words; 2 the proper noun of Chinese name、the name of place、the name of department and so on. At present, the results of automatic Chinese segmentation and name entity recognition are still not quite satisfying. In this paper, Chinese word segmentation and Chinese names recognition have been studied separately. And presents a Chinese word segmentation algorithm combing with word frequency and a method of Chinese name recognition based on Support Vector Machines and transformation-based error-driven learning.
     Chinese automatic segmentation is an important step in Chinese information processing. It is the foundation in many application fields of Chinese information. At present, three main methods have been used for automatic Chinese segmentation, which include rule method, statistical method and understanding method. Through analyzing the existed automatic segmentation methods, this paper emphasizes on the research of rule method and statistical method. And presents a Chinese word segmentation algorithm combing with word frequency. The method firstly based on priority of length combining with word frequency to segment short sentence. If any non-matching word strings of the short sentence exist, we apply the improved maximum matching method and reverse maximum matching method combined with entropy rate to segment. Experimental results show that the algorithm improves the accuracy of word segmentation.
     Recognition of Chinese personal name is emphasis and difficulty for unknown words recognition. If the problem is effectively solved, then it will improve the precision of unknown words recognition. The paper presents a method of Chinese name recognition based on Support Vector Machines (SVM) and transformation-based error-driven learning. Using the transformation-based learning approach to correct the identification results of SVM. Transformation rules effectively deal with the special cases of language phenomenon and improve the performance of SVM. Experiments show that the method is efficient in identifying person names from Chinese texts. In open test, the precision, recall, and F-measure are improved.

引文

[1]吕雅娟,赵铁军,杨沐昀等.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报.2001,15(1)：28-33.
    [2]刘开瑛.中文文本自动分词和标注.北京：商务印书馆,2000.4-10.
    [3]刘迁,贾惠波.中文信息处理中自动分词技术的研究与展望[J].计算机工程与应用,2006,42(3)：175-177,182.
    [4]梁南元.书面汉语自动分词系统-CDWS[J].中文信息学报,1987,1(2)：44-52.
    [5]揭春雨,刘源等.论汉语自动分词方法[J].中文信息学报,1989,3(1)：1-9.
    [6]黄德根,朱和合等.基于最长次长匹配的汉语自动分词[J].大连理工大学学报,1999,39(6)：831-835.
    [7]吴胜远.并行分词方法的研究[J].计算机研究与发展,1997,34(7)：542-545.
    [8]陈桂林,王永成等.一种改进的快速分词算法[J].计算机研究与发展,2000,37(4)：418-424.
    [9]李振星,徐泽平等.全二分最大匹配快速分词算法[J].计算机工程与应用,2002,38(11)：106-109.
    [10]徐秉铮,詹剑等.基于神经网络的分词方法[J].中文信息学报,1993,7(2)：36-44.
    [11]何克抗,徐辉等.书面汉语自动分词专家系统设计原理[J].中文信息学报,1991,5(2)：1-14.
    [12]Lai B. Y., Sun M. S., et al. Chinese word segmentation and Part-of-speech tagging in one step. Proceedings of International Conference:Research on Computational Linguistics, TaiPei,1997:229-236.
    [13]Sproat R., Shin C. L., et al. A stochastic finite-stateword segmentation algorithm for Chinese Computational Linguistics; 1996,22(3):377-404.
    [14]刘挺,吴岩等.最大概率分词问题及其解法[J].哈尔滨工业大学学报,1998,30(6)：37-41.
    [15]韩客松,王永成等.汉语语言的无词典分词模型系统[J].计算机应用研究,1999,16(10)：8-9.
    [16]李家福,张亚非.一种基于概率模型的分词系统[J].系统仿真学报,2002,14(5)：544-550.
    [17]孙茂松,肖明等.基于无指导学习策略的无词表条件下的汉语自动分词[J].计算机学报,2004,27(6)：736-742.
    [18]周昌乐,心脑计算举要.北京.清华大学出版社.2002：27-30.
    [19]Li Liangyan, He Zhongshi, Yi Yong. Principles and algorithms of semantic analysis,2003 Int. Conf. on Machine Learning and Cybernetics(ICMLC03), Xi'an China, November 2-5,2003:1613-1618.
    [20]张春霞,郝天永.汉语自动分词的研究现状与困难[J].系统仿真学报,2005,17(1)：138-143,147.
    [21]骆正清,陈增武,胡上序.一种改进的MM分词方法的算法设计[J].中文信息学报,1996,10(3)：30-36.
    [22]马志强,周长胜,丁维,杨娜.自扩充中文分词词典的研究与实现[J].计算机与数字工程,2007,35(6).143-146.
    [23]金在全,赵照,杜秀全,张东.一种改进的增字最大匹配算法[J].科学技术与工程,2007,7(18).4761-4764.
    [24]罗桂琼,费洪晓,戴弋.基于反序词典的中文分词技术研究[J].计算机技术与发展,2008(1).80-83.
    [25]易丽萍,叶水生,吴喜兰.一种改进的汉语分词算法[J].计算机与现代化,2007(2)：13-15.
    [26]费洪晓,康松林,朱小娟,谢文彪.基于词频统计的中文分词研究[J].计算机工程与应用,2005,41(7)：67-68,100.
    [27]曹勇刚,曹羽中等.面向信息检索的自适应中文分词系统[J].软件学报,2006,17(3)：356-363.
    [28]魏小宁.基于隐马尔科夫模型的中文分词研究[J].电脑知识与流2007(11).885-886
    [29]吴春颖,王士同.基于二元语法的N-最大概率中文粗分模型[J].计算机应用,2007,27(12).：2902-2905.
    [30]曾华琳,李堂秋,史晓东.一种基于提取上下文信息的分词算法[J].计算机应用,2005,25(9)：2025-2027.
    [31]Sproat R., Shih C. L.. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages,1993,4(4):336-249.
    [32]Yubin Dai, Teck Ee Loh, Christopher Khoo. A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information[C]. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.1999.82-89.
    [33]王锡江,王启祥,陈家俊.基于邻接知识的汉语自动分词系统[J].计算机研究与发展,1992,29(22)：54-55.
    [34]赵铁军,吕雅娟等.提高汉语自动分词精度的多步处理策略[J].中文信息学报,2001,15(1)：13-18.
    [35]Kok Wee Gan. Integrating Word Boundary Identification with Sentence Understanding[C]. The 31st Annual Meeting of the Association for Computational Linguistics.1993,301-303.
    [36]David D. Palmer. A Trainable Rule-based Algorithm for Word Segmentation [A].Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics[C].1997.321-328.
    [37]刘竞,苏万力.统计和规则相结合的中文姓名识别方法研究[J].福建电脑,2006,22(7)：92-96.
    [38]孙茂松,黄昌宁等.中文姓名的自动辨识[J],中文信息学报,1995,9(2)：16-27.
    [39]张俊盛等.多语料库作法之中文姓名辨识[J],中文信息学报,1992,6(3).
    [40]郑家恒,李鑫.基于语料库的中文姓名识别方法研究[J].中文信息学报,2000,14(1)：7-12.
    [41]刘秉伟,黄萱菁等.基于统计方法的中文姓名识别[J].中文信息学报,2000,14(3)：16-36.
    [42]张峰,樊孝忠,许云.基于统计的中文姓名识别方法研究[J].计算机工程与应用,2004,40(10)：53-55.
    [43]黄德根,杨元生等.基于统计方法的中文姓名识别[J].中文信息学报,2001,15(2)：31-37.
    [44]郑家恒,李鑫,谭红叶.基于语料库的中文姓名识别方法研究[J].中文信息学报,2000,14(1)1：63-168.
    [45]黄德根,马玉霞,杨元生.基于互信息的中文姓名识别方法[J].大连理工大学学报,2004,44(5)：744-748.
    [46]李建华,王晓龙.中文人名自动识别的一种有效方法[J],高技术通讯,2000,2：46-49.
    [47]季媛,罗振声.基于统计和规则的中文姓名自动辨识[J],语言文字应用,2001,1：14-18.
    [48]张仰森,徐波等.基于形式驱动的中文姓名自动识别方法[J],计算机工程与应用,2003,4：62-65.
    [49]王振华,孔祥龙等.结合决策树方法的中文姓名识别[J],中文信息学报,2004,18(6)：10-15.
    [50]Vapnik, V.N. Statistical Learning Theory[M]. New York:John Wiley & Sons,1998.
    [51]王国胜,钟义信.支持向量机的若干新进展[J].电子学报,2001,29(10)：1397-1400.
    [52]A.Blumer,A.Ehrenfeucht. Learnability and the Vapnik-Chervonenkis Dimension. Journal of the Association for Computing Machinery, Vol.36 No 4 October 1989 pp 929-956.
    [53]邓乃扬,田英杰,数据挖掘中的新方法—支持向量机,科学出版社,2004.
    [54]Hsu Chih-Wei and Lin Chin-Jen. A comparison of methods for multi-class support vector machines[J]. IEEE Transactions on Neural Networks,2002,13(2):415-425.
    [55]李丽双,黄德根,毛婷婷,徐潇潇.基于支持向量机的中国人名的自动识别[J].计算机工程,2006,32(19).188-190.
    [56]Brill E. Transform-based error-driven learning and natural language processing:a case study in part-of-speech tagging[J]. Computational Linguistics,1995,21(4):543-565.
    [57]Wu Youzheng, Zhao Jun, Xu Bo, et al. Chinese named entity recognition based on multiple features[C]//Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing(HLT/EMNLP), Vancouver,2005:427-434.
    [58]张华平,刘群.基于角色标注的中国人名自动识别研究[J].计算机学报,2004,27(1)：85-91.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700