基于最短描述长度的高维特征选择方法研究

英文题名：Mdl-based Feature Selection for High Dimensional Data
作者：刘峤
论文级别：博士
学科专业名称：信息安全
中文关键词：机器学习 ; 模型选择 ; 特征选择 ; 稀疏建模 ; 正则化 ; 零范数 ; 高维
英文关键词：machine learning ; model selection ; feature selection ; sparse modeling ; regularization ; L0 norm ; high dimensional
学位年度：2010
导师：秦志光
学科代码：081202
学位授予单位：电子科技大学
论文提交日期：2010-03-15

摘要

高维特征选择问题也称为稀疏建模问题,是当前机器学习研究领域的热点研究问题之一,目标是解决现有的特征建模方法在高维特征空间普遍失效的问题。主要的研究方法是基于模型参数的1-范数或零-范数约束的正则化方法。当前流行的1-范数方法存在的主要问题是缺乏对相关特征的组选能力和特征选择能力受样本容量限制。而传统的零-范数方法则在稀疏建模实践中普遍存在过拟合问题,主要原因是对模型复杂度的约束条件不合理。
     最近的理论研究揭示出基于零-范数约束的逐步回归法在理论上能够获得比1-范数方法更好的稀疏建模性能。据此本文从最短描述长度原则出发,通过理论推导建立了三种新型的基于零-范数约束的高维特征选择方法模型,分别是:
     1.通过向随机复杂度模型中引入模型参数的高斯分布假设,对模型复杂度下界的费舍尔信息近似公式进行推导求解得到一个易于计算的特征选择判据,据此构造出一种基于随机复杂度约束的特征选择方法模型,并通过仿真实验和真实基因数据集上的实验,验证了该方法在稀疏建模任务中的性能优于当前主流的1-范数方法和文献报道的最新相关理论成果;
     2.通过向基于风险膨胀判据(RIC)的特征选择模型中引入2-范数约束条件,解决了RIC模型从低维特征空间向高维特征空间的推广问题,据此构造出一种基于有偏风险约束的特征选择方法模型,并同样通过仿真实验和基因选择实验验证了该方法在稀疏建模任务中相对于当前主流方法的性能优越性; 3.为尝试建立推广性更好的零-范数高维特征选择方法模型,本文在吸收借鉴前述方法的优点的基础上,通过向随机复杂度模型引入一个Tikhonov类型的正则化因子,削弱了该模型的理论限制条件,据此构造出一个基于有偏最短描述长度的特征选择方法。仿真实验,基因选择及图像分类实验的数据表明,该方法能够有效处理稀疏建模任务,且性能优于当前主流的1-范数方法和和文献报道的最新相关理论成果,在本文提出的三个模型中性能表现最优。
     上述研究成果证明了基于零-范数的正则化特征选择方法不仅适用于高维特征空间,而且能够获得比1-范数方法更好的稀疏建模性能。同时本文提出的方法模型为解决高维特征选择问题提供了新的研究思路和有希望的解决方案。
Feature selection for high-dimensional data, also called sparse modeling problem, is one of the hot issues in current machine learning research area, aimed at addressing the problem that a majority of the existing methods may fail to produce meaningful results in high dimensional feature spaces. Regularization method is the most popular methodology used in the current study, especially the L1-penalized and the L0-penalized methods. However, prevalent L1-norm regularization approaches share some inherent theoretical drawbacks, such as lack the ability to select out grouped features, and can not select more features than the sample size. Besides, most of the traditional L0-penalized methods are prone to over-fitting in data-sparse environments, mainly due to lack of constraints on the model complexity.
     Recent research has revealed that stepwise regression with L0 regularization can perform better than L1 algorithms in sparse cases. We therefore proposed three novel L0-penalized high-dimensional feature selection approaches derived from the minimum description length principle (MDL), namely:
     1. An easily computable feature evaluation criterion, which was derived from the Fisher information approximation to the stochastic complexity criterion, and was simplified by imposing a Gaussian assumption on the parametric model. Then we proposed a novel stepwise regression approach for sparse modeling based on such a simplified stochastic complexity criterion. Numerical results on synthetic and real microarray open datasets show that the proposed approach outperforms prevalent L1-penalized methods and other cutting-edge alternative approaches.
     2. A biased risk inflation criterion for feature selection, in which we managed to improve the generality of the RIC approach by introducing an L2-penalty to the model parameters in combining with the risk inflation criterion. Based on such a novel evaluation criterion, we proposed another stepwise regression approach for choosing between nested parametric models based on the biased RIC. Empirical studies also demonstrated its superiority to other popular alternatives tested above.
     3. The third approach was proposed with the intention to borrow strength from the above two approaches and thus to construct a more general solution to the sparse modeling problem. By employing another Tikhonov-type penalty to the parametric model combined with the stochastic complexity criterion, we set up a novel biased minimum description length based feature selection method, which was shown to successfully mitigate the theoretical risks and limitations of the pure MDL approaches. Experimental results on synthetic datasets, real microarray datasets and real image spam datasets show that the proposed method can deal with all kinds of sparse modeling tasks efficiently, it outperforms prevalent L1-penalty methods and other alternatives in all the tests, and demonstrates a superiority over the above two approaches.
     The theoretical studies and corresponding experimental results provide new and strong evidence for the validity of the claim that stepwise regression with L0 regularization can perform better than L1 algorithms in sparse cases. It is also the author's hope that the proposed findings and conclusions will provide a new perspective and will encourage more investigations along the lines suggested.

引文

[1] G. Forman. Feature selection: we've barely scratched the surface. IEEE Intelligent Systems, 2005, 20(6): 74-76
    [2] X. Guyon, J. Yao. On the underfitting and overfitting sets of models chosen by order selection criteria. Journal of Multivariate Analysis, 1999, 70(2): 221-249
    [3] A. Y. Ng, Feature selection, L1 vs. L2 regularization, and rotational invariance, Proceedings of the twenty-first international conference on Machine learning (ICML'04), 2004.
    [4] J. Zhou, D. P. Foster, R. A. Stine, et al. Streamwise feature selection. The Journal of Machine Learning Research, 2006, 7(Mar):1861-1885
    [5] H. Zou, T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 2005, 67(2): 301-320
    [6] H. Liu, H. Motoda. Computational methods of feature selection. Chapman \& Hall/CRC, 2008
    [7] D. Lin, E. Pitler, D. P. Foster, et al. In defense of l0. Proceedings of ICML/UAI/COLT Workhsop on Sparse Optimization and Variable Selection.
    [8] P. D. Grunwald, J. Rissanen. The minimum description length principle. The MIT Press, 2007
    [9] M. Dredze, R. Gevaryahu, A. Elias-Bachrach. Learning fast classifiers for image spam. Proceedings of Fourth Conference on Email and Anti-Spam (CEAS'07). 487-493
    [10] T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag, 2001
    [11] J. Chen, Sparse Modeling in Classification, Compression and Detection, in School of Industrial and Systems Engineering. vol. Doctor of Philosophy: Georgia Institute of Technology, 2004, p. 121.
    [12] V. N. Vapnik.统计学习理论.北京:电子工业出版社, 2004
    [13] I. W. Tsang, J. T. Kwok, P. M. Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 2006, 6(1): 363-392
    [14] J. Weston, S. Mukherjee, O. Chapelle, et al. Feature selection for SVMs. Advancesin Neural Information Processing Systems, 2001, 13:668-674
    [15] C. T. Su, C. H. Yang. Feature selection for the SVM: An application to hypertension diagnosis. Expert Systems with Applications, 2008, 34(1): 754-763
    [16] I. Guyon, A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 2003, 3(Mar): 1157-1182
    [17] L. Sweeney, Information Explosion, Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle ed., Eds. Amsterdam: North Holland: Elsevier Science, 2001, p. 462.
    [18] T. M. Mitchell. Machine Learning. McGraw Hill, 1997
    [19]O. Bousquet, S. Boucheron, G. Lugosi. Introduction to statistical learning theory. Lecture Notes in Computer Science, 2004, 169-207
    [20]T. Hesterberg, N. H. Choi, L. Meier, et al. Least angle and L1 penalized regression: A review. Statistics Surveys, 2008, 2(1): 61-93
    [21]刘峤,秦志光,罗旭成, et al.,统计机器学习中的特征选择方法综述, in中国计算机大会(CNCC'09)天津:中国计算机学会(CCF), 2009, pp. 781-794.
    [22]H. Akaike, Information theory and an extension of the maximum likelihood principle, in Second International Symposium on Information Theory, 1973, pp. 267-281.
    [23] H. Liu, H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998
    [24] A. M. Bruckstein, D. L. Donoho, M. Elad. From sparse solutions of systems of equations to sparse modeling of signals and images. Siam Review, 2009, 51(1): 34-81
    [25] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 1995, 24(2): 227-234
    [26] E. I. George. The Variable Selection Problem. Journal of the American Statistical Association, 2000, 95:1304-1307
    [27] G. Schwarz. Estimating the dimension of a model. The annals of statistics, 1978, 6(2): 461-464
    [28] J. Rissanen. Modeling by shortest data description. Automatica, 1978, 14(465-471
    [29] R. Kohavi, G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 1997, 97(1-2): 273-324
    [30] A. L. Blum, P. Langley. Selection of relevant features and examples in machinelearning. Artificial Intelligence, 1997, 97(12): 245-271
    [31] L. Yu, H. Liu, Redundancy based feature selection for microarray data, in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining Seattle, WA, USA: ACM, 2004, pp. 737-742.
    [32] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996, 58(1): 267-288
    [33] C. H. Zhang, J. Huang. The sparsity and bias of the lasso selection in high-dimensional linear regression. Annals of Statistics, 2008, 36(4): 1567-1594
    [34] Y. Liu, Y. Wu. Variable selection via a combination of the L0 and L1 penalties. Journal of Computational and Graphical Statistics, 2007, 16(4): 782-798
    [35] E. Candes, T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Annals of Statistics, 2007, 35(6): 2313-2351
    [36] B. Efron, T. Hastie, R. Tibshirani. Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 2007, 35(6): 2358-2364
    [37] J. Rissanen. Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory, 2001, 47(5): 1712-1717
    [38] J. Rissanen. Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 1996, 42(1): 40-47
    [39] J. I. Myung, D. J. Navarro, M. A. Pitt. Model selection by normalized maximum likelihood. Journal of Mathematical Psychology, 2006, 50(2): 167-179
    [40] P. S. Dhillon, B. Tomasik, D. Foster, et al., Multi-task feature selection using the multiple inclusion criterion (MIC), in Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I Bled, Slovenia: Springer, 2009, pp. 276-289.
    [41] B. Yu, I. Tabus, M. Weinberger, et al. Festschrift in Honor of Jorma Rissanen on the Occasion of his 75th Birthday. Tampere University Press, 2008
    [42] H. Liu, L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(4): 491-502
    [43] L. Yu, H. Liu. Efficient feature selection via analysis of relevance andredundancy. The Journal of Machine Learning Research, 2004, 5:1205-1224
    [44] G. H. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, in in Proceedings of the Eleventh International Conference on Machine Learning (ICML'94) New Brunswick, NJ, USA, 1994, pp. 121-129.
    [45] D. Koller, M. Sahami, Toward optimal feature selection, in In Proceedings of the International Conference on Machine Learning, 1996, pp. 284-292.
    [46] E. P. Xing, M. I. Jordan, R. M. Karp, Feature selection for high-dimensional genomic microarray data, in Proceedings of the Eighteenth International Conference on Machine Learning (ICML'01), 2001, pp. 601-608.
    [47] S. Das, Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection, in Proceedings of the Eighteenth International Conference on Machine Learning, 2001, pp. 74-81.
    [48] J. C. Van Houwelingen. Shrinkage and penalized likelihood as methods to improve predictive accuracy. Statistica Neerlandica, 2001, 55(1): 17-34
    [49] J. Bi, K. Bennett, M. Embrechts, et al. Dimensionality reduction via sparse support vector machines. The Journal of Machine Learning Research, 2003, 3:1229-1243
    [50] O. M. Barak Chizi, Data Mining and Knowledge Discovery Handbook: Springer US, 2005, pp. 93-111.
    [51] B. Efron, T. Hastie, I. Johnstone, et al. Least angle regression. The Annals of statistics, 2004, 32(2): 407-451
    [52] N. Meinshausen, G. Rocha, B. Yu. Discussion: A tale of three cousins: Lasso, L2Boosting and Dantzig. Annals of Statistics, 2007, 35(6): 2373-2384
    [53] J. Fan, R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 2001, 96(456): 1348-1360
    [54] R. Tibshirani, M. Saunders, S. Rosset, et al. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B, 2005, 67(1): 91-108
    [55] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 2006, 101(476): 1418-1429
    [56]N. Meinshausen. Lasso with relaxation. Computational Statistics and Data Analysis,2007, 52(1): 374-393
    [57] C. Mallows. Some comments on Cp. Technometrics, 2000, 42(1): 87-94
    [58] M. H. Hansen, B. Yu. Model selection and the principle of minimum description length. Journal of the American Statistical Association, 2001, 96(454): 746-774
    [59] C. S. Wallace, D. L. Dowe. Minimum message length and Kolmogorov complexity. The Computer Journal, 1999, 42(4): 270-283
    [60] J. Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 1986, 14(3): 1080-1100
    [61] T. M. Cover, J. A. Thomas. Elements of information theory. Wiley Series In Telecommunications, 1991
    [62] J. Rissanen. Stochastic Complexity in Statistical Inquiry Theory. River Edge, NJ, USA: World Scientific Publishing Co., Inc., 1989
    [63] D. J. Navarro. A note on the applied use of MDL approximations. Neural Computation, 2004, 16(9): 1763-1768
    [64]刘峤,王娟,陈伟, et al.基于随机复杂度约束的高维特征自动选择算法.电子学报, 2010. (已录用)
    [65] J. Fan, Y. Fan. High dimensional classification using features annealed independence rules. Annals of statistics, 2008, 36(6): 2605-2637
    [66] E. Greenshtein, J. Park. Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification Journal of Machine Learning Research, 2009, 10(Jul): 1687-1704
    [67] T. R. Golub, D. K. Slonim, P. Tamayo, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999 286(5439): 531-537
    [68] G. J. Gordon, R. V. Jensen, L. L. Hsiao, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer research, 2002, 62(17): 4963-4967
    [69] D. Singh, P. G. Febbo, K. Ross, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 2002, 1(2): 203-209
    [70] D. P. Foster, E. I. George. The risk inflation criterion for multiple regression. The Annals of Statistics, 1994, 22(4): 1947-1975
    [71] S. Dudoit, J. Fridlyand, T. P. Speed. Comparison of discrimination methods forthe classification of tumors using gene expression data. Journal of the American Statistical Association, 2002, 97(457): 77-87
    [72]刘峤,王娟,陈伟, et al.全基因组关联分析中的基因自动选择算法.航天医学与医学工程, 2010.(已录用)
    [73]邱浪波,王广云,王刚, et al.相关向量机在肿瘤表达谱分类问题中的应用.中国生物医学工程学报, 2008, 27(1): 82-86
    [74] N. Pochet, F. De Smet, J. A. K. Suykens, et al. Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics, 2004, 20(17): 3185-3195
    [75] V. Balasubramanian. Statistical inference, Occam's razor and statistical mechanics on the space of probability distributions. Neural Computation, 1997, 9(349–368
    [76]刘峤,秦志光,程红蓉, et al.基于颜色和边缘特征直方图的图像型垃圾邮件分类模型.计算机应用与研究, 2010. (已录用)
    [77] M. Wan, F. Zhang, H. Cheng, et al. Text localization in spam image using edge features. Proceedings of International Conference on Communications, Circuits and Systems (ICCCAS'2008). 838-842
    [78] Q. Liu, Z. Qin, H. Cheng, et al. Efficient modeling of Spam Images. Proceedings of Third International Symposium on Intelligent Information Technology and Security Informatics (IITSI'2009). IEEE, 663-666
    [79] G. Fumera, I. Pillai, F. Roli. Spam filtering based on the analysis of text information embedded into images. Journal of Machine Learning Research, 2006, 7:2699-2720
    [80] I. E. Frank, J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 1993, 109-135