基于分词的中文文本自动分类研究与实现

作者：张海燕
论文级别：硕士
学科专业名称：控制理论与控制工程
中文关键词：中文文本自动分类 ; 分词 ; 2元语法短语标引 ; 信息处理 ; K-最临近算法 ; 朴素贝叶斯算法 ; 简单向量距离法 ; 语料库
英文关键词：Chinese Text Classification ; Word Segmentation ; Phrase labeling of 2-gram syntax ; Information Processing ; K-Nearest Neighbor ; Naive Bayes ; Simple Vector Space
学位年度：2002
导师：童调生 ; 陈治平
学科代码：081101
学位授予单位：湖南大学
论文提交日期：2002-12-01

摘要

随着信息技术的不断发展，特别是Internet应用的普及，网上信息成指数级增长，如何自动处理这些海量的信息，以有效地保留大的文本集合就成为了目前重要的研究课题。对文本进行有效管理的方法之一，就是将它们进行系统地分类，即文本自动分类问题。文本自动分类是一项重要的智能信息处理技术，是文本检索技术的基础，在新闻自动分类、电子会议、电子邮件自动分类和信息过滤等方面极具应用价值。
     本文详细分析了中文文本分类的模型构造及对应的分类算法，对常用的文本分类算法进行了评价(主要有SVM方法、Boosting方法、Na(?)ve Bayes方法、KNN法、基于向量空间模型方法等)。文本分类算法是一种监督式的学习方法，在对文本进行自动分类时，需要解决以下几个问题，如：获取训练文档集、建立文档表示模型、文档属性选择、选择分类算法和性能评估模型等。
     本文对中文文本分类的分词技术进行了着重讨论。对于基于信息过滤的自动分类问题，使用字典分词并不是一个必须的过程，因而本文提出了基于2元语法短语标引的分词方法，它将设立切分标志法与基于词频统计的方法相结合，可以识别基于词典方法处理不了的词汇，如：人名、地名、专业术语等。由于这种方法获取信息简单，用此进行分类可使文档分类系统摆脱对复杂切词处理程序和庞大词典的依赖，因此可以替代基于字典的机械分词方法。
     在第三章分词的基础之上，结合KNN,Na(?)ve Bayes和简单向量距离分类算法，建立了一个基于分词的自动分类系统。它运用基于2元语法短语标引的自动分词方法来抽取向量空间模型需要的特征词来表征文档的内容，并表示成向量。其中：分词模块由分词预处理与分词两部分所组成；然后，对向量的维数加以缩减，以降低系统的复杂度，同时提高分类的精度；最后结合新闻语料库(文章采用网上下载的新闻语料库500篇，所有的新闻稿都由领域专家事先进行了分类，按照中图法分成政治、经济、军事等共十大类)进行验证。实验结果表明了分词算法的有效性。
With the development of the Information Technology, especially the popularization of the Internet Applications, information on the Net increases exponentially. How to manage automatically the mass information to keep the volume texts is for the moment the important research task. One method of managing the texts efficiently is to classify them, namely, the problem of Text Classification. Text Automatic Classification is one of the important intelligent information processing, which is of great applications in such fields as news classification, E-conference, E-mail automatic classification and so on.
    In this paper, the model construction and methods of Chinese Text Classification are analyzed particularly, such as SVM, Boosting, KNN, and so on. Text Classification method requires to solve the problems, such as the obtainment of the training documents, the establishment of the expression modules, the selection of the classification methods, and so on, while classifying the documents.
    In this paper, the Word Segmentation technology of Chinese Text Classification is debated emphatically. And the method of Word Segmentation based on the phrase labeling of 2-gram syntax is put forward combining the method of setting separate-signs and the method based on the statistic of word-frequency, which can recognize the vocabularies which the method based on the dictionary can not manage. This method is easy to obtain information so that it can break away the independence on the dictionaries and Word Segmentation managing programs, it can replace the mechanical Word Segmentation methods on the dictionaries.
    Lastly, an automation classifying system is established combining the classifying methods of KNN, Naive Bayes and Simple Vector Space,which validates the efficiency of the Word Segmentation method.

引文

[1] 张月杰，姚天顺．基于特征相关性的汉语文本自动分类模型的研究．小型微型计算机系统，1998．8，19(8)：49-55
    [2] 陆建江，张文献．中文文本分类器的设计．计算机工程与应用，2002．15：49-51
    [3] 张义忠，赵明生，朱精南．基于内容的中文网页自动分类研究．信息与控制，2001，10，30(5)：408-412
    [4] 肖明，沈英．自动分类研究进展．图书馆自动化，2000(5)：25-28
    [5] 成颖，史九林．自动分类研究现状与展望．情报学报，1999，18(1)：20-26
    [6] 陈勤，张国煊，王小华．基于模糊综合评判的文本自动分类算法．计算机学报，2000．10：56-59
    [7] 苏伟峰，李绍滋，李堂秋．一个基于概念的中文文本分类模型．计算机工程与应用，2002．6：193-195
    [8] 汪保友，周益群，周水庚等．基于主观Bayes方法的渐进式中文文档分类．模式识别与人工智能，2001．12，14(4)：470-474
    [9] 杨昂．K特征线法在文本分类上的应用．计算机科学，2002，29(1)：47-48
    [10] 程小平．基于α-截集表达的模糊贝叶斯网．计算机科学，1999，26(6)：65-66
    [11] Andrew McCallum, Kamal Nigam. A Comparision of Event Models for Naive Bayes Text Classification. in Submitted to AAAI-98 Workshop on Learning for Text Categorizatio)n, 1998
    [12] 马翟华，何瑗．Augmented Bayes分类器的一种学习方法．计算机工程与应用，2002．17：100-102
    [13] Kamal Nigam, Andrew Mccallum, Sebastian Thrun, etc. Learning to Classifv Text from Labeled and Lnlabeied Documents. Machine Learning, 1-22
    [14] Rayid Ghani. Combining labeled and unlabeled data for text classification with a large number of categories, 2002
    [15] David J .Miller, Hasan S. Lyar. A mixture of experts elassifier with learning hased on both labeled and unlabeled data. In Advanees in Neural Information Proceeding Systems (NIPS 9), 1997
    [16] Seegeer, M. 2001. Learning with labeled and unlabeled data. Technical

    report, Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, United Kongdom
    [17] Castelli, Ⅴ. 1994. The Relative Value of Labeled and Unlabeled Samples in Pattern Recognition. Ph.D. Dissertation, Stanford University
    [18] Kamal Paul Nigam. Using Unlabeled Data to Improve Text Classification. CMU-CS-01-126, 2001.5
    [19] Goldman, S., Zhou, Y. 2000. Enhancing supervised learning with unlabeled data. In International Joint Conference on Machine Learning
    [20] Zhang, T., and Oles, F. 2000. A probability analysis on the value of unlabeled data for classification problems. In International Joint Conference on Machine Learning, 1191-1198
    [21] Mitchell, T. 1999. The role of unlabeled data in supervised learning. In Proc. Of the Sixth international Colloquium on Cognitive Science
    [22] Castelli, Ⅴ., & Cover, T. M. (1995). On the exponential value of labeled samples. Pattern Recognition Letters, 16(1), 105-111
    [23] 陶卿，姚穗等．一种新的机器学习算法：Support Vector Machines．模式识别与人工智能，2000．9，13(3)：285-289
    [24] Joachims, T. (1999). Transductive inference for text classification using support vector machines. Machine Learning: Proceedings of the Sixteenth International Conference
    [25] 沈学华，周志华等．Boosting和Ragging综述．计算机工程与应用，2000．12：31-33
    [26] Eric Bauer, Ron Kohavi.An Empirical Comparison of Noting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36, 105-142, 1999
    [27] David Lewis and Marc Ringuette.A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81-93,1994
    [28] 王爱华，张铭，杨冬春等．基于Boost和信任函数的多文本分类器组合模型．计算机工程与应用．2002．2：51-54
    [29] David D, Lewis.A Sequential Algorithm for Training Text Classifiers: Corrigendum and Additional Data SIGIR Forum, Vol. 29, No. 2, Fall 1995,

    pp. 18-19
    [30] David D. Lewis, William A. Gale. A Sequential Algorithm for Training Text Classifiers. Proceedings of Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp, 3-12, 1994
    [31] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1997). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38
    [32] Ghahramani, Z., & Jordan, M. Ⅰ. (1994). Supervised learning form incomplete data via an EM approach. Advances in Neural Information Processing Systems 6, pp. 120-127
    [33] 欧振猛，余顺争．中文分词算法在搜索引擎应用中的研究．计算机工程与应用，2000．8：80-84
    [34] 黄萱菁，吴立德等．基于机器学习的无需人工编制词典的切词系统．模式识别与人工智能，1996．12，9(4)：297-303
    [35] 丁承，邵志清．基于字表的中文搜索引擎分词系统的设计与实现．计算机工程，2001．2，27(2)：191-192
    [36] 严威，赵政．开发中文搜索引擎汉语处理的关键技术．计算机工程，1999．6，25(6)
    [37] 陈燕娜，邵志清．基于全文搜索的中文搜索引擎设计技术．计算机工程与应用，2002．17：196-198
    [38] 梁南元．书面汉语自动分词综述．情报学报，1985，44-50
    [39] 骆正清，陈增武，王泽兵等．汉语自动分词研究综述．浙江大学学报(自然科学版)，1997．5：306-300
    [40] 朱寰，阮彤，于庆喜．文本分割算法对中文信息过滤影响研究．计算机工程与应用，2002．3：62-64
    [41] 曹素丽，曾伏虎，曹焕光．基于汉字字频向最的中文文本自动分类系统．山西大学学报(自然科学版)1999．22(2)：144-149
    [42] 梁南元．书面汉语的自动分词与另一个自动分词系统CDWS．中国汉字信息处理系统学术会议，桂林，1983
    [43] 刘源，谭强，沈旭昆编著．信息处理用现代汉语分词规范及自动分词方法．清华大学出版社，广西科学技术出版社，1992


    [44] 刘博勤，丁晓明．潜语义标引与汉语信息检索研究．计算机科学，2000．27(3)：93-95
    [45] 魏欧，孙玉芳．汉语词性标注方法的研究．计算机科学，2000，27(7)：71-75
    [46] 赵珀璋，徐力．计算机中文信息处理[M]．北京：宇航出版社，1988
    [47] 邱广君，王宝库等．汉语信息处理中的语义关系类型分析．东北大学学报(自然科学版)．1998．2 19(1)：48-51
    [48] 孙春葵，钟义信．关于自动文摘系统中文摘句式的一种机器学习方法．计算机工程与应用，2000．5
    [49] 孙茂松，卢红娜，邹嘉彦．基于隐Markov模型的汉语词类自动标注的实验研究．清华大学学报(自然科学版)，2000，40(9)：57-60
    [50] 江铭虎，朱小燕，袁保宗．一种适应域的汉语N-gram语言模型平滑算法．清华大学学报(自然科学版)，1999，39(9)：99-102
    [51] 李蕾，钟义信，郭祥昊．全信息理论在自动文摘系统中的应用．计算机工程与应用，2000．1：4-7
    [52] 李涓子，黄昌宁．基于转换的无指导词义标注方法．清华大学学报(自然科学版)，1999，39(7)：112-120
    [53] 林鸿飞，高天，姚天顺．中文文本的可视化表示．东北：大学学报(自然科学版)，2000．10，21(5)：501-503
    [54] 魏欧，吴健，孙玉芳．基于统计的汉语词性标注方法的分析与改进．软件学报，2000，11(4)：473-480
    [55] 孙茂松，左正平，黄昌宁．消解中文三字长交集型分词歧义的算法．清华大学学报(自然科学版)，1999，39(5)
    [56] 周强，孙茂松，黄昌宁．汉语最长名词短语的自动识别．软件学报，2000，11(2)，195-201
    [57] 傅兴岭，陈章焕．常用构词词典．中国人民大学出版社，1982．7
    [58] 曲阜师范大学编写组编著．现代汉浯常用虚词词典．1986．6

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700