基于特征选择和质心构建的文本分类研究

英文题名：Research of Text Categorization Based on Feature Selection and Centroid Construction
作者：谢华
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本分类 ; 基于语义 ; 基于质心 ; 特征选择 ; 余弦相似度
英文关键词：Text Classification ; Semantic-Based ; Centroid-Based ; Feature Selection ; Cosine Similarity
学位年度：2010
导师：王健
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2010-11-01

摘要

随着信息技术的发展,人们能够获取的信息呈现爆炸式的增长。面对日益增多的海量信息,仅仅依靠人工的方式来处理这些信息变得越来越困难。需要一些自动化的辅助工具来帮助人们更好的管理和过滤这些信息。文本分类正是在这样的背景下提出的一种文本自动化处理工具。
     文本分类就是将文本集中的每个文本分配到预先定义好的类别集中的某一个类别中去。使用机器学习的方法,其目的就是从实例中进行分类器的学习,然后利用分类器进行自动分类。这是一个有监督的学习问题。当前,存在多种文本分类方法,如朴素贝叶斯,K-近邻,神经网络,基于质心的方法和SVM等。文本分类在许多领域,例如网络资源的分类和垃圾邮件过滤等,都得到了广泛的应用。
     本文的主要工作是对基于丰富语义信息的文本表示方法进行了研究,并提出了一种新的称为FSCC的基于质心的文本分类方法。首先介绍了文本分类的相关背景知识和研究现状。接着详细说明了文本分类的一般流程,包含文本的表示,分类器的选择和训练,最终分类结果的评测。然后研究了文本分类中基于语义信息的文本表示方法。将基于语义的文本表示方法与传统的BOW表示方法进行了比较。最后,在传统的基于质心的分类方法的基础上,本文提出了一种改进的基于质心的分类方法FSCC。在FSCC方法中,首先采用特征选择的方法计算特征与类别之间的特征选择值,然后根据特征选择值定义了一个新的质心特征权重计算公式,并由此得到类别的质心向量。最后,采用非归一化的余弦相似度(demoralized cosine measure)来计算文档与质心之间的相似度。本文在不同的语料上进行了实验,实验结果表明,该方法相比经典的质心分类方法以及SVM,分类效果均有显著的提高。
With the development of information technology, the information people can get are growing in an explosive way. Facing the mass information increasing day after day, people find that dealing with them solely relying on artificial means becomes more and more difficult. People need some automation auxiliary tool to help them management and filter the information more convenient. Text categorization is one kind of text automated tools proposed under such background.
     The goal of text categorization is classifying the documents into a fixed number of predefined categories. Using the method of machine learning, its goal is to learn the classifier from examples, and then use the classifier for automatic classification. This is a supervised learning problem. At present, there are many methods for text categorization, such as Naive Bayes, k-nearest neighbor, Neural Network, Centroid-Based Approaches and SVM, etc. Text classification have been widely used in many fields, such as network resources classification and spam filtering, etc.
     In this paper, the text representation method based on rich semantic information is studied, and a new method based on centroid-based approach, which is called FSCC, is put forward. Firstly, the background knowledge and research status about text categorization is introduced. Then the general flow of text categorization is given, including text representation, classifier selection and training, the assessments of classification results. And then text representation method based on semantic information in the text classification is studied. The semantic-based text representation methods and traditional BOW representation methods are compared subsequently. Finally, based on the traditional centroid-based classification method, this paper proposes an improved method called FSCC. In FSCC, firstly, the relevancy between features and categories is calculated by using feature selection, and then a new formula for calculating feature weight in a centroid, from which the centroid can be constructed, is defined. Finally, a denormalized cosine measure is employed to calculate the similarity score between a text vector and a centroid. Experiments on different corpus show that FSCC significantly outperforms the traditional centroid-based approach, and state-of-the-art SVM classifier.

引文

[1]Maron M. Automatic indexing:an experimental inquiry[J]. Journal of the Association for Computing Machinery,1961,8(3):404-417.
    [2]Cheng Ying, Shi Jiulin. Research on the automatic classification:present situation and prospects[J]. Journal of the China Society for Scientific and Technical Information,1999,1:20-27.
    [3]Kibriya A M, Frank E, Pfahringer B, et al. Multinomial naive bayes for text categorization revisited[J]. AI 2004,2004,3339:488-499.
    [4]Guo Gongde, Wang Hui, Bell D, et al. Using kNN model for automatic text categorization[J]. Soft Computing,2006,10(5):423-430.
    [5]Chau R N, Yeh C S, Smith K A. A neural network model for hierarchical multilingual text categorization[J]. Advances in Neural Networks, LNCS,2005,3497:238-245.
    [6]Guan Hu, Zhou Jingyu, Guo Minyi. A Class-Feature-Centroid Classifier for Text Categorization[C]. WWW 2009, Madrid, Spain,2009:201-210.
    [7]Lertnattee V, Theeramunkong T. Combining homogeneous classifiers for centroid-based text classification[C]. ISCC 2002, Tokyo, Japan,2002:1034-1039.
    [8]Cataltepe Z, Aygun E. An improvement of centroid-based classification algorithm for text classifcation[C]. IEEE 23rd International Conference on Data Engineering Workshop, Istanbul, Turkey,2007:952-956.
    [9]Lertnattee V, Theeramunkong T. Effect of term distributions on centroid-based text categorization[J]. Information Sciences,2004,158:89-115.
    [10]Klinkenberg R, Joachims T. Detecting Concept Drift with Support Vector Machines[C]. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA,2000:487-494.
    [11]Xue Guirong, Xing Dikan, Yang Qiang, et al. Deep classification in large-scale text hierarchies[C]. In Proceedings of the 31st Annual International ACM SIGIR Conference, Singapore,2008:627-634.
    [12]Ni Xiaochuan, Xue Guirong, Ling Xiao, et al. Exploring in the weblog space by detecting informative and affective articles[C]. WWW 2007, Banff, Canada,2007: 281-290.
    [13]Hu Xiaohua, Zhang Xiaodan, Caimei Lu, et al. Exploiting Wikipedia as External Knowledge for Document Clustering[C]. KDD'09, Paris, France,2009:389-396.
    [14]Wang Pu, Domeniconi C. Building Semantic Kernels for text classification using Wikipedia[C]. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA,2008:713-721.
    [15]Shen Dou, Sum Jiantao, Yang Qiang, et al. Text Classification Improved through Multigram Models[C]. CIKM'06, Arlington, Virginia, USA,2006:672-681.
    [16]Joachims T. Text Categorization with Support Vector Machines:Learning with Many Relevant Features[C]. In European Conference on Machine Learning(ECML), Berlin,1998:137-142.
    [17]唐春生,张磊,潘东,等.文本分类研究进展.2001.http://epcc.sjtu.edu.cn/seminar/Categorization.pdf
    [18]林建国.基于句子排序和组合分类的中文文本分类方法研究[D].哈尔滨：哈尔滨工业大学计算机科学与技术学院,2007.
    [19]张申亚.文本分类技术中的特征选择算法研究[J].信阳农业高等专科学校学报,2007,17(3)：125-127.
    [20]朱明,王军,王俊普.Web网页识别中的特征选择问题研究[J].计算机工程,2000,26(8)：35-37.
    [21]Yang Yiming, Pedersen J P. A comparative study on feature selection in text categorization[C]. The Fourteenth International Conference on Machine Learning, Nashville, Tennessee, USA,1997:412-420.
    [22]Soucy P, Mineau G W. Feature selection strategies for text categorization[J]. Advances in Artificial Intelligence,2003,2671:505-509.
    [23]Vapnic V. The Nature of Statistical Learning Theory [M]. New York: Springer Press, 1995.
    [24]McCallum A, Nigam K. A comparison of event models for naive bayes text classification[C]. In AAAI-98 Workshop on Learning for Text Categorization, Madison, Wisconsin, USA,1998:41-48.
    [25]Masand B, Linoff G, Waltz D. Classifying news stories using memory based reasoning[C]. SIGIR'92, Copenhagen, Denmark,1992:59-64.
    [26]Yang Yiming. Expert network:Effective and efficient learning from human decisions in text categorization and retrieval[C]. SIGIR'94, Dublin, Ireland,1994:13-22.
    [27]王煜,王正欧,白石.用于文本分类的改进KNN算法[J].中文信息学报,2007,21(3)：76-82.
    [28]Martin K, Florian L, Carlos R P, et al. Overview of the protein-protein interaction annotation extraction task of BioCreative Ⅱ[J]. Genome Biology,2008,9(Suppl 2):S4.
    [29]Tang Lei, Rajan S, Narayanan V K. Large Scale Multi-Label Classification via Metalabeler[C]. WWW 2009, Madrid, Spain,2009:211-220.
    [30]Yang Yiming and Liu Xin. A re-examination of text categorization methods[C]. In SIGIR'99, Berkeley, CA, USA,1999:42-49.
    [31]Gao Jing, Zhang Jun. Clustered SVD strategies in latent semantic indexing[J]. Information Processing and Management,2005,41:1051-1063.
    [32]Wang Qiang, Wang Xiaolong, Guan Yi. A Study of Semi-Discrete Matrix Decomposition for LSI in Automated Text Categorization[C]. IJCNLP-04 Proceedings, Sanya, Hainan island, China,2004:302-309.
    [33]Zhang Wen, Yoshida T, Tang Xijin. Text classification based on multi-word with support vector machine[J]. Knowledge-Based Systems,2008,21:879-886.
    [34]Tesar R, Poesio M, Strnad V, et al. Extending the Single Words-Based Document Model: A Comparison of Bigrams and 2-Itemsets[C]. DocEng'06, Amsterdam, The Netherlands,2006:138-146.
    [35]Gabrilovich E, Markovitch S. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge[C]. AAAI-06, Boston, Massachusetts, USA,2006:1301-1306.
    [36]Gabrilovich E, Markovitch S. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis[C]. IJCAI-07, Hyderabad, India,2007:1606-1611.
    [37]Denoyer L, Gallinari P. The Wikipedia XML Corpus[C]. SIGIR Forum, WA, USA, 2006:64-69.
    [38]Tan Songbo. An improved centroid classifier for text categorization[J]. Expert Systems with Applications,2008,35(1-2):279-285.
    [39]Tan Songbo. Using hypothesis margin to boost centroid text classifier[C]. In SAC'07,Seoul,Korea,2007:398-403.
    [40]Theeramunkong T, Lertnattee V. IMPROVING CENTROID-BASED TEXT CLASSIFICATION USING TERM-DISTRIBUTION-BASED WEIGHTING SYSTEM AND CLUSTERING[C]. ISCIT'01, Washington, D.C., USA,2001:33-36.
    [41]Resnick P, Iacovou N, Suchak M, et al. Grouplens:An open architecture for collaborative filtering of netnews[C]. In Proceedings of ACM Conference on Computer Supported Cooperative Work, Chapel Hill, NC,1994:175-186.
    [42]Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys,2002,34(1):1-47.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700