SVM和最大熵相结合的中文机构名自动识别

英文题名：Automatic Identification of Chinese Organization Names Based on SVM and Maximum Entropy
作者：杨德来
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：中文机构名 ; 驱动式 ; 最大熵 ; 支持向量机
英文关键词：Chinese Organization Name ; Drive mode ; Maximum Entropy ; Support Vector Machine (SVM)
学位年度：2006
导师：黄德根
学科代码：081202
学位授予单位：大连理工大学
论文提交日期：2006-12-10

摘要

未登录词的识别是汉语自动分词的难点之一，而中文机构名是未登录词的一个重要部分，涉及广泛，种类繁多，形态各异，且绝大多数未收入到词典中。中文机构名的自动识别对提高汉语自动分词和句法分析的精确率都有重要的意义。
     本文提出一种支持向量机(Support Vector Machine，SVM)和最大熵相结合的中文机构名自动识别方法。中文机构名识别范围限定在以机构名特征词为结尾的完整机构名。根据机构名的特点，将机构名识别分为两个部分，后界判断和前部标注。对文本中出现在特征词典的词，基于SVM判断是否是机构名特征词(后界判断)，从识别出的机构名特征词前词开始向前基于最大熵标注，直到标注到非机构名成分停止标注(前部标注)，然后继续在文中重复上述过程。
     为了提高后界判断效率，提出驱动式识别方法，对文本中出现的收录在特征词典的词进行后界判断，识别出该词是否是机构名特征词，对识别出的机构名特征词开始前部标注。由此可知，后界判断问题是二值分类问题，而SVM是一种优秀的二值分类器，因此基于SVM的后界判断模型可以有效地解决机构名特征词识别问题。根据机构名特征词的统计分析和语法特征，建立基于SVM的后界判断模型。
     机构名前部词组成比较复杂，由于最大熵可以灵活地将许多分散、零碎的知识组合起来，对复杂问题的解决有较好的效果，同时最大熵以较好的效率解决多类分类问题，因此最大熵的前部标注模型有效地解决了比较复杂的中文机构名前部词识别问题。根据机构名前部词的特征和统计分析结果，制定最大熵特征模板，构建特征集并进行参数估计获得基于最大熵的前部标注模型。
     实验表明，SVM和最大熵相结合的中文机构名自动识别方法是有效的：系统开式召回率和精确率分别达91.05％，93.59％，F值为92.84％。和当前同类文献相比，本识别系统取得了比较好的识别结果。并且本文所提出的方法具有较强的推广能力，利用本方法还可以对其它未登录词如人名、地名等进行识别。
Chinese organization name recognition belongs to the domain of the recognition of Name Entity, which is a basic research work in Chinese lexical analysis. If there are some unknown Chinese organization names in the text, they will affect the correctness of segmentation and lexical analysis, this requires the segmentation system of having the ability to recognize the Chinese organization name, so it can improve the correctness of segmentation and lexical analysis.
    The automatic recognition method of Chinese organization name with the combination of SVM and Maximum Entropy is proposed. As for the words appeared in the characteristic dictionary, we use SVM to decide whether it is the characteristic word of the organization name (latter boundary decision) , we use the method based on SVM to tag from the word before the characteristic word, until encounter non-organization name composition (tagging foreside), then continue the process mentioned before in this paper.
    In order to improve the efficiency of the latter boundary decision, a drive recognition method is proposed, which decides the latter boundary of the words appear in the text, which are collected in the characteristic dictionary, then tag the former parts of the organization name.
    The latter boundary decision is a problem of two value categorization, and SVM can effectively solve the problem of the recognition of the characteristic word of the organization name.
    Due to the complex composition of the former word of the organization name, Maximum Entropy combine different kinds of text information, and solve the problem of the recognition of the more complex former words of the Chinese organization name. According to the feature of the former words and the analysis of the statistical results, we make the Maximum Entropy feature module, establish the feature set and access the parameters, eventually get the former parts tag module based on Maximum Entropy.
    The results show that SVM and Maximum Entropy combined Chinese organization name recognition is effective: in open test, the recall and precision rate and F-measure are 91.05%, 93.59%, and 92.84% respectively. Compared to present document of this kind, the recognition system gets better results , furthermore, it can also recognize other name entities, such as person name, place name and so on.

引文

[1] 黄昌宁，夏莹．语言信息处理专论．北京：清华大学出版社，1996．
    [2] 黄昌宁．中文信息处理中的分词问题．语言文字应用．1997，21(1)：72-78．
    [3] 黄德根，朱和合，王昆仑等．基于最长次长匹配的汉语自动分词．大连理工大学学报．1999，39(6)：831-835．
    [4] Wang X L. The problem of separating characters into fewest words and its algorithms. Chinese Science Bulletin. 1989, 34(22): 1924-1928.
    [5] 黄德根，朱和合，杨元生．基于单词与双词可信度的汉语自动分词．计算机研究与发展．2001，38(7增刊)：132-135．
    [6] Wong P K, Chan C K. Chinese word segmentation based on maximum matching and word binding force. Proc of COLING'96, 16th Int Conference on Computational Linguistics, Copenhagen, Denmark. 1996: 200-203.
    [7] Sproat R, Shih C, Gale W et al. A stochastic finite-state word segmentation algorithm for Chinese. Computational Linguistics. 1996, 22(3): 377-404.
    [8] 宋柔，朱宏，潘维桂等．基于语料库和规则库的人名识别法．见：陈力为编．计算语言研究与应用．北京：北京语言学院出版社，1993．
    [9] 郑家恒，李鑫，谭红叶．基于语料库的中文姓名识别方法研究．中文信息学报．2000，14(1)：7-12．
    [10] 刘秉伟，黄萱菁，郭以昆等．基于统计方法的中文姓名识别．中文信息学报．2000，14(3)：16-24．
    [11] 黄德根，马玉霞，杨元生．基于互信息的中文姓名识别方法．大连理工大学学报．2004，44(5)：744-748．
    [12] 黄德根，杨元生，王省等．基于统计方法的中文姓名识别．中文信息学报．2001，15(2)：31-37．
    [13] 李丽双，黄德根，陈春荣．用支持向量机进行中文地名识别的研究．小型微型计算机系统．2004，26(8)：1416-1419．
    [14] 黄德根，岳广玲，杨元生．基于统计的中文地名识别．中文信息学报．2003，17(2)：36-41．
    [15] 谭红叶，郑家恒，刘开瑛．中国地名自动识别系统的设计与实现．计算机工程．2002，28(8)：128-129．
    [16] 欧嘉致，陈凯江，李宗葛．基于NN／HMM混合模型的汉语地名识别系统．计算机工程与应用．2002，(23)：220-228．
    [17] 张小衡，王玲玲．中文机构名称的识别与分析．中文信息学报．1997，11(4)：21-32．
    [18] 张辉，徐健．中国组织机构名自动识别系统的设计与实现．电脑开发与应用．2002，15(1)：5-6．
    [19] 俞鸿魁，张华平，刘群．基于角色标注的中文机构名识别．20th International Conference on Computer Processing of Oriental Languages, Shenyang, China, 2003: 79-87.
    [20] 冯冲，陈肇雄，黄河燕．采用主动学习策略的组织机构名识别．小型微型计算机系统．2006，27(4)：710-714．
    [21] 周俊生．戴新宇，尹存燕等．基于层替条件随机场模型的中文机构名自动识别．电子学报．2006，34(5)：804-809．
    [22] Chen K J, Chen C J. Knowledge Extraction for Identification of Chinese Organization Names. In Processing of ACL Workshop on Chinese Language Processing, Taipei, 2000: 15-21.
    [23] 吴雪军，朱靖波，王会珍等．Co_Training的机器学习方法在中文机构名识别中的应用．全国第七届计算语言学联合学术会议，中国哈尔滨，2003：85-89．
    [24] 宁缨，王晓龙，刘秉权。一种基于SVM／RS的中文机构名称自动识别方法．电子与信息学报，2006，28(5)：895-900。
    [25] 张艳丽．黄德根，张丽静等．统计和规则相结合的中文机构名称识别，全国第六届计算语言学联合学术会议，中国山西，2001：233-239。
    [26] 王宁，葛瑞芳，苑春法等．中文金融新闻中公司名的识别．中文信息学报．2002，16(2)：1-6．
    [27] Goh C L, Asahara M, Matsumoto Y. Chinese unknown word identification using character-based tagging and chunking. Proceedings of ACL-2003 Interactive Posters and Demonstrations, Sapporo, Japan, 2003: 197-200.
    [28] Chieu H L, Ng H T. Named entity recognition with a Maximum Entropy approach. Proceedings of the Seventh Conference on Natural Language Learning, HLT-NAACL, Edmonton, Canada, 2003: 150-265.
    [29] Chiang T H, Chang J S, Lin M Y et al. Statistical models for segmentation and unknown resolution. Proceedings of ROCLING-V, R. O. C. Computational Linguistics Conferences, Taiwan, 1992: 123-146.
    [30] 陈小荷．自动分词中未登录词问题的一揽子解决方案。语言文字应用．1999，31(3)：103-109．
    [31] 吕雅娟，赵铁军，杨沐昀等．基于分解与动态规划策略的汉语未登录词识别．中文信息学报．2001，15(1)：28-33．
    [32] Thorsten J. Text categorization with support vector machines: Learning with many relevant features. Proceedings of 10th European Conference on Machine Learning, Chemnitz, Germany, 1998: 137-142.
    [33] Hiro Yasu Y. Partial language analysis using Support Vector Learning: [PhD Dissertation]. Japan: Nara Institute of Science and Technology, 2002.
    [34] Taku K, Yuji M. Chunking with Support Vector Machines. Proceedings of NAACL, Pittsburgh, Pennsylvania, United States, 2001: 1-8.
    [35] Taku K, Yuji M. Japanese dependency structure analysis based on Support Vector Machines. In Empirical Methods in Natural Language Proceeding and Very Large Corpora, Hongkong, 2000: 18-25.
    [36] 李素建，刘群，杨志峰．基于最大熵模型的组块分析．计算机学报．2003，26(12)：1722-1727．
    [37] 周雅倩，郭以昆，楚萱菁等．基于最大熵方法的中英文基本名词短语识别．计算机研究与发展．2003，40(3)：440-446．
    [38] Borthwick, Andrew, Sterling J et al. Exploiting diverse knowledge sources via Maimum Entropy in named entity recognition. Processing of the 6th Workshop on Very Large Corpora, Montreal, Canada, 1998: 152-160.
    [39] Skut, Wojcieth, Brants T. A Maximum-Entropy partial parser for unrestricted text. Processing of the 6~(th) Workshop on Very Large Corpora, Montreal, Canada, 1998: 143-151.

    [40] Darroch J N, Ratcliff D. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics. 1972, 43(5): 1470-1480.

    [41] Berger A L, Pietra B A D, Pietra V J D. A Maximum Entropy approach to natrual language processing. Computational Linguistics. 1996, 22(1): 39-71.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700