SVM文本分类中基于法向量的特征选择算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着Internet的快速发展,文本分类已经成为了组织在线信息的核心任务之一,并且成为了许多应用中的关键架构。相对于其他学习算法,SVM在文本的分类中表现出了更优异的性能。
     在采用SVM算法的文本分类中,由于文本所表征的向量空间维数通常非常巨大,因此在训练过程中将耗费大量的系统资源。在资源受限的情况下,往往无法直接在文本原始的空间维数上进行处理。在此情况下,引入有效的特征选择算法就显得相当的必要。
     文本介绍了一种基于法矢量权重的特征选取方法,并将此方法应用于基于SVM的中文文本分类。此特征提取方法提供一种有效的途径,在基本保持分类器性能的前提下显著的减少特征空间的维数,进而提升系统的资源利用效率。本文研究的关键技术包括:
     第一,为了描述SVM训练过程中对计算资源的消耗,引入“稀疏度”的概念。此处,稀疏度指得是每一文本样本所表征的矢量中非零特征项的平均统计数。文档矢量的稀疏度直接影响计算资源的开销,这里的资源包括稀疏矢量所消耗的存储资源和进行运算所耗费的时间。
     第二,介绍了一种基于法矢量权重的特征选取方法。基于法向量权重的特征提取方法需要选取训练数据集的子集,预训练得到SVM模型,将法向量权重作为特征项的评估指标,再以此作为特征排序的依据。
     第三,研究在计算资源有限的条件下,使用特征选择算法增保留部分特征并保留尽可能多的训练文档,和减少训练文档数并保留尽可能多的文本特征数两种情况下的文本分类性能。
     第四,研究对于线性SVM分类器,选用基于法向量的特征选择算法,和传统的基于几率比和基于信息增益的特征选择算法,对文本分类性能的影响。
     实验证明,对于线性SVM分类器,相比与保留全部的特征而只保留部分训练文档,使用特征选择算法保留部分特征而相应的保留更多的训练文档能够获得更好的特征性能,从而为在资源受限情况下,特征选取算法的使用提供有力的理论依据。同时,比较基于法向量的特征选择算法,基于几率比和基于信息增益的特征选择算法下的分类性能,证明了对于线性SVM分类器,基于法向量的特征选择算法能够获得最好的分类性能。基于法向量的特征选择算法可以在较大幅度减少计算资源消耗的同时基本维持所得到的分类器性能。从而在资源受限的条件下,提供了一种SVM文本分类的解决途径。
With the rapid growth of Internet, text classification has been one of the key tasks of organizing on-line information, and have become the key component of lots of applications. Compare to other learning algorithm, SVM learning algorithm performances better in text classification.
     For text classification based on SVM learning algorithm, usually there is an abundance of training data, which will cost a lot of computing resources in training process. So training of classifiers cannot be performed over the full set of data due to limited computing resources. Under this situation, it is significant to introduce feature selection methods.
     This paper introduces a feature selection method based on the weight of normal from SVM model, and applies this method to text classification based on SVM learning algorithm. This feature selection method provides an effective way to maintain the classification performance while reducing the dimension of feature space and then significantly enhances the efficiency of computing resource . This paper will perform the research on following points:
     Firstly, in order to describe the cost of computing resources in SVM training process, we introduce the concept of“sparsity”. Sparsity is here defined as the average number of non-zero components in the vector representation of data. Sparsity of vector impact the cost of computing resources directly, the resources here involve both system memory and time cost for computing.
     Secondly, introduce a feature selection method based on the weight of normal from SVM model. This feature selection method is first to train linear SVM on a subset of training data to create initial classifiers, then taking the weight of normal from SVM model as the measure of features , by which features are sorted.
     Thirdly, when the computing resources are limited, for the following two situations, eliminating part of features by feature selection method to retain as much training data as possible, eliminating part of training data to retain as many features as possible, compare the performance of text classification.
     Fourthly, for linear SVM classifier, explore the performance of normal-based feature selection method by comparing it with two traditional feature selection methods: odd ratio and information gain.
     Experimental results show that for linear SVM classifier , compare to eliminating part of training data to retain as many features as possible, it will performance better by eliminating part of features by feature selection method to retain as much training data as possible. Which provides a strong theoretic evidence for performing feature selection when the computing resources are limited. At the same time, compare to traditional feature selection methods : odd ratio and information gain, the normal-based method yields better classification performance. This feature selection method provides an effective way to maintain the classification performance while reducing the dimension of feature space and then significantly enhances the efficiency of computing resource .
引文
[1]中国互联网络信息中心.第24次中国互联网络发展状况统计报告[DB/OL].北京.2009年7月.
    [2]顾绍元.一种基于Agent的需求分析与建模方法.微型电脑应用, 2001, 17 (3).
    [3] Maron, M. Automatic indexing: an experimental inquiry. Journal of the Association for Computing Machinery, 1961, 8(3):404-417.
    [4] Cheng Ying, Shi Jiu-Lin. Research on the automatic classification: present situation and prospects. Journal of the China Society for Scientific and Technical Information, 1999, 1:20-27.
    [5] Sparck J K, Willett P, etal. Readings of information retrieval. San Mateo, US: Morgan Kaufmann, 1997.
    [6] Salton G et al. A vector space model for automatic indexing. Communications of the ACM, 1975, 18:613-620.
    [7] Salton G and Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988, 24(5):513-523.
    [8] Yang, Y., Pedersen, J.O. A Comparative Study on Feature Selection in Text Categorization [J]. Proceedings of the 14 International Conference on Machine Learning, 1997.
    [9] Tom Mitchell. Macchine Learning [M]. McGraw Hill, 1996
    [10] Dunja Mladenic , and Marko Grobelnik. Feature Selection for Unbalanced Class Distribution and Naive Bayes [J]. In Proceedings of the 16 International Conference on Machine Learning, 1999.
    [11] Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. Introduction to WordNet: An On-line Lexical Database [J]. International Journal of Lexicography, 1990.
    [12] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R.. Indexing by latent semantic analysis [J]. Journal of the American Society of Information Science,1990.
    [13] Mladenic D, Brank J, Grobelnik M, Milic-Frayling N. Feature selection using linear classifier weights: Interaction with classification models [J]. Proc. of the 27th ACM Int'l Conf. on Research and Development in Information Retrieval (SIGIR-04).
    [14] J Weston, S Mukherjee, O Chapelle, M Pontil . Feature Selection for SVMs [J] Advances in Neural Information Processing Systems, 2001.
    [15] T. Joachims. Text Categorization with Support Vector Machine: Learning with Many Relevant Features [J]. European conference on machine learning, 1998.
    [16] Baeza-Yates R. and Ribeiro-Neto B. Modern Information Retrieval, 1st ed. Addison-Wesley-Longman, Reading, MA, 1999.
    [17] Salton G , McGill M J. Introduction to Modern Information Retrieval.McGraw-Hill, 1983.
    [18] Robertson S, Sparck-Jones K. Relevance Weighting of Search Terms.Journal of American Society for Information Science, 1976, 3(27):129-146.
    [19] D. D. Lewis.Evaluating and optimizing autonomous text classification systems[C] .In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, Seattle, US, 1995, 246–254.
    [20]张宁,贾自艳,史忠植.使用KNN算法的文本分类[J].计算机工程,2005,31(8):171-172.185.
    [21] Vapnik VN. Estimation of Dependencies Based on Empirical Data. Springer Verlag,1982.
    [22] Vapnik VN. The Nature of Statistical Learning Theory. Springer Verlag, 1995. 2001,12(9):1386-1392.
    [23] S. Eyheramendy, D. Lewis, D. Madigan. On the Naive Bayes Model for Text Categorization. In: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, Florida, 2003.
    [25] Chih-Chung Chang and Chih-Jen Lin, LIBSVM: A Library for Support Vector Machines,2004.
    [26] Xin Liu. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),1999, 42-49.
    [27] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys,34(I):1-47,2002.
    [28] CJ.Van Rijsbergen. Information Retrieval. ( 2nd a edition)Butterworths, London,1979.
    [29]]K Kira .L A Rendell. The Feature Selection Problem: Traditional Methods and a New Algorithm[A]. Proc of 9th National Conf on AI[C], 1992, 129-134.
    [30]G H John. R Kohavi. K Pfleger. Irrelevant Feature and the Subset Selection Problem[A]. Proc of the 11th Int Conf on Machine Learning[C],1994, 121-129.
    [31] D Koller. A Sahami. Toward Optional Feature Selection[A]. Proc of Int Conf on Machine Learning[C],1996, 284-292.
    [32] Manoranjan Dash. Huan Liu. Feature Selection for Classfication[J]. Intelligence Data Analysis, 1997, 1(3):131-156.
    [33] T Gartner, P A Flach , WBCsvm Weighted Bayesian Classification based on support vector machine. Willianstown, USA: 18th Int. Conf. on Machine Learning , 2001,154-161.
    [34] V Sindhawani, Pushpak B, Subrata R. Information Theoretic Feature Crediting in Multiclass Support Vector Machine . Chicago, IL, USA: 1st SIAM Int. Conf. on Data Mining , 2001,1-18.
    [35]谭松波,王月粉.中文文本分类语料库-TanCorpV1.0, http://www.searchforum.org.cn/tansongbo/corpus.htm.
    [36] T Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Chemnitz, Germany: In European Conference on Machine Learning (ECML) , 1998, 137-142.
    [37]刘丽珍,宋瀚涛.文本分类中的特征提取.计算机工程, 2004, 30(4): 14 -15.
    [38] Lewis D D, Yang Y, Rose T , et al. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 2004, 5: 361-397.
    [39]孙晋文,肖建国.自动文本分类中的智能处理技术.计算机科学, 2003,30(8):18-20.
    [40]刁倩,张惠惠.文本自动分类中的词权重与分类算法.中文信息学报, 2000,14(3):25-29.
    [41]任纪生,王作英.基于特征有序对量化表示的文本分类方法.清华大学学报,2006,46(4):527-530.
    [42]庞剑锋,卜东波.基于向量空模型的文本自动分类系统的研究与实现.计算机应用研究, 2001,18(9):23-27.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700