基于最大熵的汉语词性标注

英文题名：Chinese POS Tagging Based on Maximum Entropy
作者：孔海霞
论文级别：硕士
学科专业名称：计算机应用
中文关键词：词性标注 ; 最大熵 ; 模板 ; 未登录词
英文关键词：Part Of Speech (POS) ; ME ; Template ; Unlogged Words
学位年度：2007
导师：黄德根
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2007-12-09

摘要

词性标注是给文本中的每个词标注上正确的词性。它是自然语言处理的基础，其正确率将影响后期句法分析或组块分析的正确率。在词性标注时出现的错误会在后续自然语言处理链中被放大，正确标注词性对自然语言处理有非常重要的意义。本文的目的就是在文本分词的基础上，实现汉语词性标注，为后期词法分析和其它自然语言处理任务提供基础。
     本文首先阐述了汉语词性标注的研究现状及研究意义，然后在深入理解最大熵理论的基础上实现了基于最大熵的汉语词性标注系统，最后利用统计规则和词性限定方法对未登录词进行了进一步标注。
     利用不同模板将不同的上下文信息导入最大熵模型，构建了四个最大熵标注模型，选出具有最优标注效果的模板作为最终模板。为了简化模型，采用了三种不同的特征选取方法精简最大熵模型的候选特征，为了进一步提高词性标注正确率，采用了规则和词性限定法，结合最大熵对未登录词做了进一步标注。论文给出了最大熵标注模型的算法，并给出了标注结果，及对未登录词进一步标注后的结果。
     词性标注比较复杂，由于最大熵可以充分利用词的不同层次的上下文信息，能较好地解决复杂问题，因此用最大熵进行词性标注，取得了较好的效果。
     实验结果表明，用最大熵进行中文词词性标注是有效的：开试测试正确率为94.96％，未登录词的标注正确率为63.32％。
     本文的研究成果可应用于实际翻译系统中，为自然语言后期处理提供了基础。另外还可进一步应用到信息检索、文本分类等自然语言处理领域中。
Part of speech (POS) tagging is the problem of assigning POS or lexical categories to all the words in a text. It is the basic work in Natural Language Processing (NLP), and its tagging precision greatly affects the later step of syntax analysis or chunk analysis. The errors occurred in POS tagging will always propagate through the processing chain, so tagging POS correctly has great significance in NLP. The main goal of this thesis is to implement Chinese POS tagging task based on word segmentation, and provide the basis for later syntactic parsing and other NLP tasks.
     In this thesis, we first introduce the current research status of POS tagging and its significance, then implement Chinese POS tagging system based on Maximum Entropy (ME) on the basis of deep understanding of ME theory, and at last, statistical rules and POS confinement are used for tagging unlogged words.
     Different context information is introduced to ME model by using different templates, four ME POS tagging models are built, and the template with the highest tagging precision is selected as the final template. In order to simplify the model, three feature selection methods are used to simplify ME model's candidate features. In order to further improve the POS tagging precision, the method of combining rules, POS confinement and ME is adopted. This thesis presents the algorithm of ME tagging model and its result, moreover, the result of further unlogged words tagging is given.
     POS tagging is comparatively complex. Since ME can make full use of different context of a word on different levels to solve complex problems, so we used ME for POS tagging, and have achieved good results.
     The experimental results show that using ME for Chinese POS tagging is effective: the open test rate is 94.96%, and the test rate for unclogged words tagging is 63.32%.
     The POS tagging approaches introduced in this thesis can be used in actual MT system, which can provide basis for further NLP tasks. Moreover, the research of this thesis can be applied to other NLP tasks, such as information retrieval, text classification and so on.

引文

[1] Manning C D，Schutze H．统计自然语言处理基础．苑春法，李庆中，王昀等译．北京：电子工业出版社，2005．
    [2] Church K W, Gale W A.A comparision of enhanced good turing and deleted estimation methods for estimating probabilities of English Bigrams. Computer Speech and Language, 1991: 19-24.
    [3] DeRose S J. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 1988,14(1):31-39.
    [4] Ratnaparkhi Adwait. A maximum entropy model of part-of-speech tagging. Conference on Empirical Method in Natural Language Processing, University of Pennsylvanian, 1996: 133-142.
    [5] Weischedel R, Meteer M, Schwartz R et al. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 1993,19(2):361-382.
    [6] Marcus M P, Santorini B, Marcinkiewicz M A. Building a large annotated corpus of English: the PennTreebank. Computational Linguistics, 1993,19(2):313-330.
    [7] Jelinek F, kafferty J,Managerman D et al. Decision tree parsing using a hidden deviation model. The Human Language Technology Workshop, Plainsboro, NJ, 1994: 272-277.
    [8] Magerman D M. Statistical Decision-Tree models for parsing. The 33rd Annual Meeting of the ACL, Massachusetts, 1995: 276-283.
    [9] Brill E. Transformation-based error-driven learning and natural language processing:A case study in Part-of-Speech tagging. Computational Linguistics, 1995,21(4):543-565.
    [10] Zhao T J, Mao C J, Zhang Met al. Solving the ambiguity of Chinese POS in CEMT-Ⅲ system. Chinese Information Journal, 1994, 7(4): 52-59.
    [11] Zhao Q. An algorithm of tagging Chinese POS based on statistics and rule. Chinese Information Journal, 1996, 9(3): 1-9.
    [12] Merialdo B. Tagging English text with a probabilistic model. Computational Linguistics, 1994, 20(2): 1-29.
    [13] Zhao J, Wang X L. Chinese POS tagging based on Maximum Entropy model. The First International Conference on Machine Learning and Cybernetics, Beijing, 2002:601-605.
    [14] 冯志伟．计算语言学基础．北京：商务印书馆，2001．
    [15] 俞士文．计算语言学概论．北京：商务印书馆，2003．
    [16] Brill E D. A corpus-based approach to language learning:(Doctoral Thesiss).Philadelphia: University of Pennsylvania, 1993.
    [17] Zhao Y, Wang X L, Liu B Q et al. Chinese POS tagging based on maximum entropy model. The Third International Conference on Machine Learning and Cybernetics, Shanghai, 2004: 1641-1645.
    [18] 李素建，刘群，杨志峰．基于最大熵模型的组块分析．计算机学报，2003，26(12)：1722-1727．
    [19] 周雅倩，郭以昆，黄萱菁等．基于最大熵方法的中英文基本名词短语识别．计算机研究与发展，2003，40(3)：440-446．
    [20] Borthwick, Andrew, Sterling J et al. Exploiting diverse knowledge sources via Maximum Entropy in named entity recognition. The 6th Workshop on Very Large Corpora, Montreal, Canada, 1998:152-160.
    [21] Skut, Wojcieth, Brants T.A Maximum-Entropy partial parser for unrestricted text. The 6th Workshop on Very Large Corpora, Montreal, Canada, 1998:143-151.
    [22] Darroch J N, Ratcliff D. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 1972, 43(5): 1470-1480.
    [23] Berger A L, Pietra B A D, Pietra V J D.A Maximum Entropy approach to natrual language processing. Computational Linguistics, 1996,22(1):39-71.
    [24] 张民，李生，赵铁军．基于评价的汉语词性纯概率标注算法．计算机研究与发展，1998，35(4)：349-352．
    [25] 张民，李生，赵铁军．统计与规则并举的词性自动标注算法．软件学报，1998，9(2)：134-138．
    [26] 黄吕宁，夏莹．语言信息处理专论．北京：清华大学出版社，1996．
    [27] 魏欧，吴健，孙玉芳．基于统计的汉语词性标注方法的分析与改进．软件学报，2000，11(4)：473-480
    [28] Yap T F, Ding W, Grogor E. Repairing errors for Chinese word segmentation and part-of-speech tagging. The First International Conference on Machine Learning and Cybernetics, Beijing, 2002(4): 1881-1886.
    [29] 梁以敏，黄德根．基于完全二阶隐马尔可夫模型的汉语词性标注．计算机工程，2005，31(10)：177-179．
    [30] 黄德根，张丽静，张艳丽．规则与统计相结合的兼类词处理机制．小型微型计算机系统，2003，24(7)：1252-1255．
    [31] Cap H L, Zhao T J, Li S et al. Chinese POS Tagging based on bilexical co- ocurrences. The Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 2005: 3766-3769.
    [32] Jiang W, Guan Y,Wang X L. Conditional Random Fields based label sequence and information feedback. In: Computational Intelligence. Heidelberg:Springer Berlin, 2006:677-689.
    [33] Klatt S, Oliva K. On the road to high-quality POS-Tagging. In: Advances in Artificial Intelligence. Heidelberg:Springer Berlin, 2005(3698):394-408.
    [34] Padro M, Padro L. Developing competitive HMM PoS taggers using small training corpora. In: Advances in Natural Language Processing. Heidelberg:Springer Berlin, 2004(3230): 127-136.
    [35] Wong F, Chao S, Hu D C et al. Interpolated probabilistic tagging model optimized with genetic algorithm. The Third International Conference on Machine Learning and Cybernetics, Shanghai, 2004:2569-2574.
    [36] 俞士汶，段慧明，朱学锋等．北京大学现代汉语语料库基本加工规范．中文信息学报，2002，16(05)：49-64．
    [37] 梁以敏．基于统计的汉语词性标注方法的研究：(硕士学位论文)．大连：大连理工大学，2004．
    [38] 刘群，张华平，俞鸿魁等．基于层叠隐马模型的汉语词法分析．计算机研究与发展，2004，41(08)：1421-1429．
    [39] 张孝，陈肇雄，黄河燕等．词性标注中生词处理算法研究．中文信息学报，2003，17(05)：1-5．
    [40] 胡春静，韩兆强．基于隐马尔可大模型(HMM)的词性标注的应用研究．计算机工程与应用，2002(06)：62-64．
    [41] 孙茂松，卢红娜，邹嘉彦．基丁隐Markov模型的汉语词类自动标注的实验研究．清华大学学报(自然科学版)，2000，40(9)：57-60．
    [42] 刘秉伟，黄萱菁，郭以昆．基于统计方法的中文姓名识别．中文信息学报，2000，14(()：3)：16-31．
    [43] 黄德根，杨元生，王省等．基于统计方法的中文姓名识别．中文信息学报，2001，15(2)：31-38．

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700