中文文本分词及词性标注自动校对方法研究

英文题名：Research on the Methods of Automatic Correction of Chinese Word Segmentation and Part-of-Speech Tagging
作者：钱揖丽
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：分词自动校对 ; 词性标注自动校对 ; 粗糙集 ; 中文信息处理 ; 语料库加工质量保证
英文关键词：Automatic Correction of Chinese Word Segmentation ; Automatic Correction of Chinese Part-of-Speech Tagging ; Rough Sets ; Chinese Information Processing ; Quality Assuring of Chinese Corpus Processing
学位年度：2003
导师：郑家恒
学科代码：081203
学位授予单位：山西大学
论文提交日期：2003-06-01

摘要

语料库建设是中文信息处理研究的基础性工程。汉语语料的基本加工过程，包括自动分词和词性标注两个阶段。自动分词和词性标注在很多现实应用(中文文本的自动检索、过滤、分类及摘要，中文文本的自动校对，汉外机器翻译，汉字识别与汉语语音识别的后处理，汉语语音合成，以句子为单位的汉字键盘输入，汉字简繁体转换等)中都扮演着关键角色，为众多基于语料库的研究提供重要的资源和有力的支持。
     语料库的有效利用在很大程度上依赖于语料库切分和标注的层次和质量。当前对汉语语料的加工结果，虽已取得了一定的成绩，但国家的评测结果表明，其离实际需要的差距还是很大的，还有待于进一步的提高。
     本文以进一步提高汉语语料库分词和词性标注的正确率，提高汉语语料的整体加工质量为目标，分别针对语料加工中的分词和词性标注两个阶段进行了研究和探讨：
     1．讨论和分析了自动分词的现状，并针对分词问题，提出了一种基于规则的中文文本分词自动校对方法。该方法通过对机器分词语料和人工校对语料的学习，自动获取中文文本的分词校对规则，并应用规则对机器分词结果进行自动校对。
     2．讨论和分析了词性标注的现状，并针对词性标注问题，提出了一种基于粗糙集的兼类词词性标注校对规则的自动获取方法。该方法以大规模汉语语料为基础，利用粗糙集理论及方法为工具，挖掘兼类词词性标注校对规则，并应用规则对机器标注结果进行自动校对。
     3．设计和实现了一个中文文本分词及词性标注自动校对实验系统，并分别做了封闭测试、开放测试及结果分析。根据实验，分词校对封闭测试和开放测试的正确率分别为93.75％和81.05％；词性标注校对封闭测试和开放测试的正确率分别为90.40％和84.85％。
The building of corpus is the basic work in the area of Chinese information processing. The processing of Chinese corpus includes Chinese word segmentation and part-of-speech tagging. They are widely used in many researches (for example, the automatic searching of Chinese text, machine translation, and Chinese characters identification and so on), and they provide important study resources for these researches.
    The effective use of corpus strongly depends on its processing level and quality. Now, we have written a lot of software for Chinese corpus processing, and have gained great achievements. But the outcome of them cannot answer our needs very well, and needs further improvements.
    The paper aims at improving the accuracy of Chinese word segmentation and part-of-speech tagging, studies and analyzes the two phases respectively:
    1. It discusses and analyzes the actuality of Chinese word segmentation, and describes an approach to correcting the Chinese word segmentation automatically based on rules. It compares the corpus processed by computer with the right, acquires the rules for Chinese word segmentation correction, and then corrects the corpus automatically based on these rules.
    2. It discusses and analyzes the actuality of Chinese part-of-speech tagging, and describes an approach to correcting the Chinese part-of-speech tagging automatically. It mines rules from right-tagged corpus using the method of rough sets, and then corrects the results of part-of-speech tagging automatically.
    3. We have designed and implemented an experiment system for the correction of Chinese word segmentation and part-of-speech tagging. The results of close-test and open-test of the system for Chinese word segmentation correction are 93.75% and 81.05% respectively, and the results of close-test and open-test of the system for Chinese part-of-speech tagging correction are 90.40% and 84.85% respectively.

引文

[1] Eric Brill: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging, Computational Linguistics, 21(4): 543-565, 1995.
    [2] Eric Briil: Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging, IN: Yarowsky D.Churchk. Proceeding of 3rd Workshop on Very Large Corpus, Cambndge, Massachusetts, USA, 143,1995.
    [3] Eric Brill: A Simple Rule-Based Part of Speech Tagger, Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, 152-155,1992.
    [4] Eric Brill and Philip Resnik: A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation. http:www.cs.jhu_edu/～brill/acadpubs.html.
    [5] Jaime Carbonell: Generalized Example-Based Machine Translation, Camegie Mellon Oniversity, http://cslu.cse.ogi.edu.
    [6] Kuang-hua Chen, hisn-his Chen: A Rule-Based and Corpus-Oriented Approach to Prepesitional Phrases Attachment.
    [7] Chac-Huang, Chang and Cheng-Der Chen: HMM-based Part-of-Speech Tagging for Chinese Corpora
    [8] Young Mee Chung, Jae Yun Lee: A Corpus-Based Approach to Comparative Evaluation of Statistical Term Association Measures, Journal of the American Society for Information Science and Technology, 52(4): 283-296, 2001.
    [9] K T Lua: Part of Speech Tagging of Chinese Sentences Using Genetic Algorithm, Conference on Chinese Compating, 45-49, 1996.
    [10] Rabiner R: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proc of the IEEE, 77(2): 257-286, 1989.
    [11] Julian Kupiec: Robust part-of-speech tagging using a hidden Markov model, Computer Speech and Language, (6): 225-242,1992.
    [12] Liu S, Chen K, Chang l et al, Automatic Part-of-speech Tagging for Chinese Corpus, Conference on Chinese Computing, National University of Singapore, 45-49, 1996.
    [13] Agrawal R, Srikant R, Fast Algorithms for Mining Association Rules in Large Databases, In: proc of the 20th int'l conf on very large databases. Santiago, 487-499,1994.
    [14] Janusz Starzyk: Reduct Generation in Information Systems, http:/www.ent.ohiou.edu/～starzvk/network/Research/Papers/RedGenInfoSys.pdf.
    [15] ZDZISLAW PAWLAK: Rough Sets-Theoretical Aspects of Reasoning about Data, KIuwer Academic Publisher, 1991.
    [16] 黄昌宁，李涓子：《语料库语言学》，商务印书馆，2001。
    [17] 周强，段慧明：现代汉语语料库加工中的切词与词性标注处理，中国计算朝报，(21)：85-87，1994。
    [18] 刘开瑛：《中文文本自动分词和标注》，商务印书馆，2000。


    [19]孙茂松，黄昌宁：利用汉字二元语法关系解决汉语自动分词中的交集型歧义，计算机研究与发展，34(5)：332-339，1997。
    [20]王素格：现代汉语词性标注知识获取方法研究，硕士学位论文，山西大学，2000。
    [21]刘开瑛，自动分词与词性标注评测，计算机世界，评测专版，1996，3，25。
    [22]周强：一个汉语短语自动界定模型，软件学报，(7)：315-322，1996。
    [23]史忠植：《知识发现》，清华大学出版社，2002。
    [24]史忠植：《高级人工智能》，科学出版社，1998。
    [25]张文修，吴伟志，粱吉业，李德玉：《粗糙集理论与方法》，科学出版社，2001。
    [26]李晓黎，史忠植：用数据采掘方法获取汉语词性标注规则，计算机研究与发展，37(12)：1409-1414，2000。
    [27]王素格，张永奎：基于搭配模式的汉语词性标注规则的获取方法，计算机工程与应用，37(5)：56-58，2001。
    [28]朱靖波，张弱杰，姚天顺：一种短语结构规则的自动获取方法，计算机研究与发展，36(5)：601-607，1999。
    [29]关毅，王晓龙，张凯：基于转移的音字转换纠错规则获取技术，计算机研究与发展，36(3)：268-273，1999。
    [30]周强，詹卫东，任海波：构建大规模的汉语语块库，自然语言理解与机器翻译，清华大学出版社，102-107，2001。
    [31]刘开瑛，郑家恒，赵军：语料库词类自动标注算法研究，机器翻译研究进展，电子工业出版社，第8期，1992。
    [32]温锁林：中文文本兼类词的标注技术，中文信息处理国际会议论文集，清华大学出版社，194-199，1998。
    [33]王素格，苗夺谦，刘开瑛：基于Rough Set自动获取词性标注规则初探，Beijing International Conference on Machine Tramlation & Computer Language Information Processmg, 1999。
    [34]孙杰，林鸿飞，姚天顺：一种获取机器翻译系统词类搭配规则的机器学习方法，模式识别与人工智能，12(2)：157-163，1999。
    [35]俞士汶：关于计算语言学的若干研究，语言文字应用，(3)：3-13，1993。
    [36]赵铁军，吕雅娟，于浩，杨沐昀，刘芳：提高汉语自动分词精度的多步处理策略，中文信息学报，15(1)：13-18，2001。
    [37]周强：基于语料库和面向统计学的自然语言处理技术，计算机科学，22(4)：36-40，1995。
    [38]孙茂松，左正平，邹嘉彦：基于K—近似的汉语词类自动判定，计算机学报，23(2)：166-170，2000。
    [39]高维君，姚天顺，黎封洋，陈伟光，邹嘉彦：机器学习在汉语关联词语识别中的应用，中文信息学报，14(3)：1-8，2000。
    [40]周明，吴进，黄昌宁：用于词性标注的一种快速学习算法—对Brill的基于变换算法的一项改进，计算机学报，21(4)：357-366，1998。


    [41]应志伟，柴佩琪，陈其晖：文语转换系统中基于语料的汉语自动分词研究，计算机应用，20(2)：8-11，2000。
    [42]刘挺，吴岩，王开铸：串频统计和词形匹配相结合的汉语自动分词系统，中文信息学报，12(1)：17-25，1998。
    [43]付国宏，王晓龙，姜守旭：—种启发式的汉语词性标注算法，计算机工程与设计，21(5)：61-64，2000。
    [44]朱廷劭，高文：基于数据挖掘的普通话韵律规则学习，计算机学报，23(11)：1179-1183，2000。
    [45]魏欧，孙玉芳：基于非监督训练的汉语词性标注的实验与分析，计算机研究与发展，37(4)：477-482，2000。
    [46]鲁松，白硕：自然语言处理中词语上下文有效范围的定量描述，计算机学报，24(7)：742-747，2001。
    [47]孙茂松，卢红娜，邹嘉彦：基于隐Markov模型的汉语词类自动标注的实验研究，清华大学学报，40(9)：57-60，2000。
    [48]王挺，陈火旺，杨谊，史晓东：一种自适应词性标注方法，软件学报，8(12)：937-943，1997。
    [49]周强，俞士汶：—个人机互助的汉语语料库多级加工处理系统CCMP，计算语言学进展与应用，清华大学出版社，50-55，1995。
    [50]钱揖丽，郑家恒：文本切分知识获取及其应用，计算机工程与应用，39(2)：63-64，2003。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700