文本分类中特征选择的理论分析和算法研究

英文题名：Theoretical Analysis and Algorithm Study on Feature Selection for Text Categorization
作者：熊小草
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：特征选择 ; 维度约简 ; 文本分类
英文关键词：Feature Selection ; Dimension Reduction ; Text Categorization ; Text Classification
学位年度：2007
导师：李春平
学科代码：081203
学位授予单位：清华大学
论文提交日期：2007-05-01

摘要

近几十年来,随着互联网的发展,人们可获取的数据量不断增大,大部分数据是文本格式数据。文本挖掘逐渐变为一个热点研究领域。文本分类是文本挖掘的一个重要研究领域。近几年来,出现了一大批对文本分类进行细致的研究的文献。然而,随着网络的发展与应用需求的变化,许多文本分类和文本处理方法在大数据量的处理上遇到了困难,如数据量巨大,特征数量太多,计算时间很长,噪音数据多,计算精度低等等。特征是数据挖掘的基本处理和分析单元。理论证明,随着特征数的增加,文本处理方法的计算复杂度成指数级增长。所以,维度约简成为了近几年的一个重要的研究问题。
     维度约简可以帮助提高学习算法的处理能力,加速学习过程,减少计算空间消耗,帮助对学习模型的理解。通过提取出最重要的特征集,可以帮助人们更好的理解数据和数据之间的关系。维度约简大致有两类方法,特征抽取和特征选择。特征抽取是在原特征的基础上,构造出一个新的特征集合,这个特征集合更好的表达了数据的特性,计算更容易,在这一特征集上进行学习可以得到比原特征集更好的学习效果。然而,特征抽取的问题是,它的计算量比较大,例如PCA是O ( n3)。特征抽取适合处理特征数比较少的数据集或应用。而当特征集较大的,特征抽取方法的计算量是无法接收的。另一类维度约简方法是特征选择。特征选择方法不改变原特征集合,而是直接在原特征集合上选择一个子集作为新的特征集合。特征选择方法的特点是计算速度快,适合处理大规模数据集,例如文本数据、生物基因等数据。
     本论文以文本分类中特征选择方法的研究为主。论文调研了经典的文本分类方法、当前的主流特征选择方法、特征抽取方法,并提出了一种基于图的特征选择方法。我们的方法是先对每个特征的每个类别计算其马尔可夫链的分布,然后通过组合模型将特征在各个类别的分布值综合起来,得到一个总评分。通过理论推导和实验,我们提出的基于图的方法得到了较好的结果。通过与当前主流特征选择方法的对比,实验表明,基于图的方法绝大部分情况下超过了CHI-square方法,在低维情况下(10-1000特征)上比当前最流行的特征选择IG的分类效果更好。
In the past decades, with the growth and development of internet, the quantity of data that can be freely retrieved has increased very fast. Most of the data is in the format of text. Text Classification, also known as Text Categorization, has become an important research area in data mining and text mining. In the past few years, a lot of research algorithms and results appeared in the literature. However, with the rapid change of internet and application requirement, many text classification methods and text processing methods encountered difficulties, such as large-scale data, too many features, too much computing time, much noisy data, low learning performance, and so on. Feature is the basic processing and analysis unit of data mining. Theoretically, as increase of number of features, the computation complexity grows exponentially. Thus, dimension reduction has become an important research problem in recent years.
     Dimension Reduction helps to improve processing capability of learning algorithm, speed up learning procedure, reduce storage space, and comprehend learning model. Through extraction of most important features, feature selection could help users comprehend data and relationship of data. There are two categories of dimension reduction methods: Feature Extraction and Feature Selection. Feature Extraction constructs a set of new features from original features. This set of new features could represent the characteristics of data better, be easier to do learning, and get better learning performance. However, the problem with feature extraction is that, its computation cost is much larger, e.g., computation time of Principle Component Analysis (PCA) is O ( n3). Feature extraction is more appropriate for data or application with small number of features. But if the number of features become large, it is unacceptable. Another category of dimension reduction method is feature selection. Feature selection does not change the original feature set, but also directly selects a subset of features as the new feature set. One of the advantages of feature selection methods is fast computing, e.g., O (n) for filter methods, and O ( n2)for wrapper methods. Thus, feature selection methods are more appropriate for processing large-scale dataset, such as text dataset, and micro-array dataset, etc.
     This thesis focuses on research of feature selection algorithm for text classification. After investigating classic text categorization methods, prevalent feature selection methods and feature extraction methods, we propose a graph-based feature selection method. Our method calculates distribution stationary from Markov Chain in each category for each word feature, combines category score by combination model to get an overall score for each word feature, and ranks features by overall score. Then select the high score features to get the selected feature subset. Through model description, it is showed that it make sense to do feature selection based on graph model, and with experiments, it is showed that the graph-based method is almost always better than CHI-square, and is better than the most prevalent feature selection methods IG when reduced dimension is relative low as about 10-1000 features.

引文

[1] Kjersti Aas and Line Eikvil, Text categorisation: A survey, Norwegian Computing Center, Raport NR 941, 1999.
    [2] Bellman, R. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ.
    [3] Blum, A. and Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245-271.
    [4] Cover, T. and Van Campenhout, J. (1977). On the possible orderings in the measurement selection problems. IEEE Trans. Systems, Man, and Cybernetics, 7(9):657-661.
    [5] Ferri, F., Pudil, P., Hatef, M., and Kittler, J. (1994). Pattern Recognition in Practice, volume 4, chapter Comparative study of techniques for large-scale feature selection, pages 403-413. Elsevier.
    [6] Hall, M. and Smith, L. (1999). Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In Proc. of the Florida Artificial Intelligence Symposium.
    [7] Holland, J. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press.
    [8] John, G., Kohavi, R., and Peger, K. (1994). Irrelevant features and the subset selection problem. In Proc. 11th Intl. Conf. on Machine Learning, pages 121-129.
    [9] Kittler, J. (1978). Pattern Recognition and Signal Processing, chapter Feature set search algorithms, pages 41-60. Sijtho and Noordho ,Alphen aan den Rijn, Netherlands.
    [10] Kohavi, R. and John, G. (1997). Wrappers for feature subset selection. Articial Intelligence, 97(1-2):273-324.
    [11] Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Proc. 13th Intl. Conf. on Machine Learning, pages 284-292, Bari, Italy.
    [12] Kudo, M. and Sklansky, J. (2000). Comparison of algorithms that select features for pattern classifiers. Pattern Recognition,33:25-41.
    [13] Raudys, S. and Jain, A. (1991). Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans. Pattern Anal. Machine Intell., 13(3):252-264.
    [14] Siedelecky, W. and Sklansky, J. (1988). On automatic feature selection. International Journal of Pattern Recognition,2:197-220.
    [15] Vafaie, H. and DeJong, K. (1993). Robust feature selection algorithms. In Proc. 5th Intl. Conf. on Tools with Artificial Intelligence, pages 356-363, Rockville, MD.
    [16] Vapnik, V. (1995). The nature of statistical learning theory. Springer-Verlag.
    [17] H. Douglas, T. Ioannis and C.F. Aliferis, A theoretical characterization of linear SVM-based feature selection, In Proceedings of the Twenty-first International Conference on Machine Learning, 2004.
    [18] J.H. Friedman, On bias, variance, 0/1-loss, and the curse-of-dimensionality, Data Mining and Knowledge Discovery, 1(1), 55-77, 1997.
    [19] E. Greengrass, Information Retrieval: A Survey, November 2000.
    [20] I.T. Jolliffe, Principal Component Analysis, New York: Springer-Verlag, 1986.
    [21] H. Liu, and H. Motoda, Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic, Norwell, MA, USA, 1998.
    [22] A. Malhi, and R.X. Gao, PCA-Based Feature Selection Scheme for Machine Defect Classification, IEEE Transactions on Instrumentation and Measurement, 53.1517-1525, 2004.
    [23] A.M. Martinez, and A.C. Kak, PCA versus LDA, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23. 228 – 233, 2001.
    [24] A. McCallum and K. Nigam, A comparison of event models for Na?ve Bayes text classification, In AAAI-98 Workshop on Learning for Text Categorization, 1998.
    [25] T.M. Mitchell, Machine Learning, McGraw Hill, 1997.
    [26] S. Mitra et.al., Data mining in soft computing framework: a survey, IEEE Trans. on neural networks, vol. 13, no. 1, pp. 3-14, January 2002.
    [27] L. Page, S. Brin, R. Motwani, and T. Winograd, The pagerank citation ranking: Bringing order to the web, Technical report, Stanford University, Stanford, CA, 1998.
    [28] J. Platt, Fast training of support vector machines using sequential minimal optimization, In Advances in Kernel Methods: support vector learning, MIT Press, Cambridge, 185-208, 1999.
    [29] B.Y. Ricardo and N.B. Ribeiro, Modern Information Retrieval. Addison-Wesley, 1999.
    [30] S.T. Roweis, and L.K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science, 290(5500):2323-6, 2000.
    [31] E. Seneta, Non-negative matrices and markov chains, Springer-Verlag, New York, 1981.
    [32] Y. Yang, and J.O. Pedersen, A comparative Study on Feature Selection in Text Categorization, In Proceedings of the 14th International Conference on Machine Learning, 412-420, 1997.
    [33] Lei Yu, Huan Liu, Efficient Feature Selection via Analysis of Relevance and Redundancy,Journal of Machine Learning Research 5 1205-1224, 2004.
    [34] Yvan Saeys, Feature Selection for Classification of Nucleic Acid Sequences, Ph.D dissertation.
    [35] Yang, Y. and Liu, X., 1999. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, CA, 1999), 42–49.
    [36] Fabrizio Sebastiani. Machine learning in automated text categorisation. Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell’Informazione, 2001.
    [37] Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl, J. 2000b. Application of dimensionality reduction in recommender system--A case study. In Proceedings of the ACM WebKDD 2000 Web Mining for E-Commerce Workshop.
    [38] I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.
    [39] Yiming Yang, An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, v.1 n.1-2, p.69-90, 1999.
    [40] David D. Lewis, Evaluating and optimizing autonomous text classification systems, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.246-254, July 09-13, 1995, Seattle, Washington, United States.
    [41] Leah S. Larkey , W. Bruce Croft, Combining classifiers in text categorization, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.289-297, August 18-22, 1996, Zurich, Switzerland.
    [42] Wai Lam , Miguel Ruiz , Padmini Srinivasan, Automatic Text Categorization and Its Application to Text Retrieval, IEEE Transactions on Knowledge and Data Engineering, v.11 n.6, p.865-879, November 1999.
    [43] Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features, Proceedings of the 10th European Conference on Machine Learning, p.137-142, April 21-23, 1998.
    [44] Thorsten Joachims , Fabrizio Sebastiani, Guest Editors' Introduction to the Special Issue on Automated Text Categorization, Journal of Intelligent Information Systems, v.18 n.2-3, p.103-105, March-May 2002.
    [45] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000.
    [46] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical models. Machine Learning, 37:183–233, 1999.
    [47] T. Griffiths and M. Steyvers. A probabilistic approach to semantic representation. InProceedings of the 24th Annual Conference of the Cognitive Science Society, 2002.
    [48] H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical clusters. In Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
    [49] A.K. Jain, D. Zongker, Feature selection: Evaluation, application, and small sample performance, IEEE Trans. Pattern Anal. Mach. Intell. 19 (2) (1997) 153–158.
    [50] Kira, K. & Rendell, L. A. (1992b), A practical approach to feature selection, in Proceeding of the Ninth International Conference on Machine Learning, Morgan Kaufmann.
    [51] Dietterich T. (1997). Machine-Learning Research: Four Current Directions. AI Magazine. Winter 1997, pp97-136.
    [52] [Quinlan, 1986] Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1):81{106.
    [53] [Quinlan, 1993] Quinlan, J. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700