基于概念的文本自动分类研究

作者：苏伟峰
论文级别：硕士
学科专业名称：计算机应用
中文关键词：文本分类 ; 文本表示 ; kNN ; 知网 ; 召回率 ; 精确率 ; 义原 ; 可分义原 ; 向量空间 ; 向量
英文关键词：text categorization ; text representation ; kNN ; How-Net ; recall ; precision ; sememe ; classfiable sememe ; vector space ; vector
学位年度：2002
导师：李绍滋
学科代码：081203
学位授予单位：厦门大学
论文提交日期：2002-05-01

摘要

随着因特网的迅猛发展，大量的信息朝着人们扑面而来，如何管理好所得到的信息的问题变得越来越突出，对文本进行分类管理是人们经常采用的一种文件管理方法。
     本文提出了一个基于概念的自然语言文本自动分类模型，该模型以《知网》为主要的概念知识源，以词所表示的概念为分类基础，把概念继续分解至义原，并在可分义原组成的向量空间进行文本分类。该模型概述如下：文本分类系统分为训练模块和分类模块，义原分为可分义原和不可分义原，文本在经过预处理后，按一定规则提取出关键词，对有岐义的关键词，根据其词性和上下文对对其进行概念排岐，根据关键词所表示的概念在《知网》中的定义，把关键词分解成义原，并将不可分义原剔除，从而把文本表示成可分义原向量空间中的一个向量。在训练集中的文本均表示成向量空间的文本之后，训练集中相似的向量在向量空间中会形成文本聚类。对于将要进行分类的文本，亦按上述的方法将其表示为一向量，并在训练集中找出k个与其距离最近的邻居的类别作为该文本的类别。实验表明，该模型相对于基于关键词的文本分类方法有更好的召回率和精确率，进行分类时所需的空间较少，计算时间也相对较短。
     本文在三个方面提出了新的思想：第一，首先提出把义原分类为可分义原和不可分义原，并提出分类的原则和方法。这种分类方式可以实现在进行文本分类时，获取概念中最重要的领域特性。第二，虽然现有文献提出用概念来表示文本，但这种概念的表示方式都基于同义词的，把概念分解到义原更能反映出概念的本质和概念之间的相关性，采用义原来表示文本则更反映出文本所要表达的中心意思。第三，首先把概念排岐引入到文本分类中，并提出一种新的概念排岐算法。
With the rapid growth of Internet, 1ots of information surges toward us. 1t
    has been an urgent prob1em on how to manage al1 the information we have gotten.
    Text Categorization (TC) is an important method man usua11y use to deal with this
    prob l em.
    Thi s paper proposes a new automatic natura1 1anguage text categorization
    modu1e based on concept. Thi s modu1e takes HowweNet as the main source of knowledge,
    the concepts of words as the bas;is of text categorization. The concepts of words
    are reduced to sememes and the TC is performed in the Classfiab1e Sememe Vector
    Space (CSVS). The TC modu1e can be summarized as be1owt the TC system is divided
    into two parts t training part and categorization part. Sememes are divided into
    c 1assfiab 1e sememes and unclassfiab1 e sememes. Keywords are extracted from the
    text after it has been preprocessed. The keywords are di sambi guated accord ing to
    their parts of speech and context. The concepts of keywords are then reduced to
    sememes according to their definitions in How--Net. As a resu1t, the text is
    represented as a vector in the CSVS after removing a11 unclassfiab1e sememes. The
    simi1ar texts form a c1uster in the CSVS. FOr a new text, it is represented as
    a vector as above and we find k nearest neighbors with the vectors of the training
    texts. It is supposed that the maximum category of those k texts is the category
    of the text. 1t has been approved by experiments that the reca11 and the precision
    of this TC module are better than those TC modu1es based on keywords. This modu1e
    takes 1ess ca1culating time and working space and too.
    This paper puts forward new ideas in three ways. 1. The sememes are divided
    into classfiable and unc1assfiab1e sememes. We a1so propose the princip1e and
    method on how to get classfiab1e sememes. In thi s wny, we can get the most important
    domain attributes of a concept. 2. A1though there are papers use concept to
    represent a text, the representations are represented by synonym. Reducing a
    concept to sememes can represent the nature of the concept more accurate1y and
    the re1evance between concepts more natura11y. As a resu1t, the main idea of a
    text is represented more accurate1y by sememe. 3. The words disambiguation are
    firstly put into use in text categorization. A new disambiguation a1gorithm is
    put forward in this paper.

引文

1 Tzeras, K. and Hartmann, S. 1993. Automatic indexing based on Bayesian inference networks. In Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval (Pittsburgh, US,1993), pp. 22-34.
    2 Belkin, N. J. and Croft, W. B. 1992. Information filtering and information retrieval: two sides of the same coin? Communications of the ACM 35, 12, pp. 29-38.
    3 Cohen, W. W. 1996.Learning rules that classify e-mail. In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access (Palo Alto, US, 1996), pp. 18-25.
    4 Gale, W. A., Church, K. W., and Yarowsky, D. 1993. A method for disambiguating word senses in a large corpus. Computers and the Humanities 26, 5, pp415-439.
    5 Attardi, G., Gull', A., and Sebastiani, F. 1999. Automatic Web page categorization bylink and context analysis. In C. Hutchison and G. Lanzarone Eds., Proceedings of TNAI-99, European Symposium on Telematics, Hypermedia and Arti. cial Intelligence(Varese, IT, 1999), pp. 105-119.
    6 Larkey, L. S. 1999. A patent search and classi, cation system. In Proceedings of DL-99, 4th ACM Conference on Digital Libraries (Berkeley, US, 1999), pp. 179-187.
    7 Schapire, R. E. and Singer, Y. 2000. BoosTexter: a boosting-based system for text categorization. Machine Learning. Forthcoming.
    8 Sable, C. L. and Hatzivassiloglou, V. 1999. Text-based approaches for the categorization of images. In Proceedings of ECDL-99,3rd European Conference on Research and Advanced Technology for Digital Libraries (Paris, FR, 1999), pp. 19—38.


    9 Fabrizio Sebastiani. Machine Learning in Automated Text Categorization, Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, US, 1997), pp. 412—420.
    10 C. Apte, F. Damerau, S.M. Weiss. Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information System, 1994
    11 吴军，王作英．汉语语料的自动分类．中文信息学报，1995，9(4)：pp：25-32
    12 刘开瑛，薛翠芳，郑家恒，周晓强．中文文本中抽取特征信息的区域与技术．中文信息学报，1998，12(2)pp：1-7
    13 何新贵，彭甫阳．中文文本的关键词的自动抽取和模糊分类方法．中文信息学报，1999，13(4)：pp：9-15
    14 邹涛．Web信息的采集、文本的识别与分类．计算机世界报，1999．4．19
    15 邹涛．检索模型．计算机世界报，1999．4．19
    16 黄萱菁，吴立德，石崎洋之，徐国伟独立于语种的文本分类方法．中文信息学报，2000，14(6)pp：1-7
    17 邹涛，黄源，张福炎，基于WWW的文本信息挖掘。情报学报，1999，18(4)pp：289-293
    18 Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. http://citeseer, nj. nec. com/deerwester90indexing. html
    19 Thomas Hofmann. Probabilistic Latent Semantic Indexing. http://citeseer. nj. nec. com/deerwester90indexing.html
    20 庞剑锋，卜东波，白硕．基于向量空间模型的文本自动分类系统的研究与实现．2001，http://WWW.ict.ac.cn/xueshu/2001/115.doc
    21 周水庚，关佶红，胡运发，周傲英．一个无需词典支持和切词处理的中文文本分类系统．计算机研究与发展，2001年，38(7)pp：839-844
    22 周水庚，关佶红，俞红奇，胡运发．基于N-gram信息的中文文本分类研究．中文信息学报，2001年，15(1)pp：34-39
    23 董振东，董强．知网．http://www.keenage.com/html/index.html.


    24 苏伟峰，李绍滋，李堂秋，尤文建．可分义原向量空间中的跨语种文本过滤模型．自然语言理解与机器翻译，清华大学出版社 pp359-366
    25 苏伟峰，李绍滋，李堂秋．一个基于概念的中文文本分类模型．计算机工程与应用．2002年，38(6) pp193-195
    26 David D. Lewis. Reuters-21578 text categorization test collection Distribution 1.0 README file(vl.2).September 26,1997. http://www.research.att. com/~lewis
    27 Porter, M.F. An algorithm for suffix stripping, Program,1980,14(3):130-137

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700