支持向量机特征选择中的L_p正则化方法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

支持向量机特征选择中的L_p正则化方法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：L_p Regular Izat Ion in Support Vector Machine for Features Selection
作者：姚兰
论文级别：博士
学科专业名称：应用数学
中文关键词：关机器学习 ; 特征选择 ; 数据分类 ; Lp正则化 ; 支持向量机 ; 内点法 ; 序列线性规划法
英文关键词：machine learning ; feature selection ; classifcation ; L_p regularization ; support vector machine ; interior point method ; successive linear programming
学位年度：2014
导师：曾金平
学科代码：070104
学位授予单位：湖南大学
论文提交日期：2014-04-11
答辩委员会主席：万仲平

摘要

特征选择是机器学习领域中一个重要的研究课题.特征选择可以剔除数据集中冗余和噪声特征,得到一个精简且判别能力更强的特征子集,从而避免学习过程中的“过拟合”问题,提高模型的泛化能力和可解释性,减少数据的采集量和存储量,节省训练和预测时间.
     岛正则化方法在特征选择中具有重要地位,已成为当前研究的热点课题.在标准的支持向量机中所使用的L2范数不具备特征选择的能力.为了能在学习分类模型的同时实现特征选择,常采用L0范数或L1范数正则化方法.但Lo-SVM是一个难以求解的组合优化问题,而Li-SVM存在欠稀疏的缺点,因此介于两者之间的LP-SVM(0     1.针对LP-SVM(0     2.大量计算研究表明：L1/2正则化可作为Lp(0收敛性.人工数据实验结果表明,与L0-SVM和L1-SVM相比,L1/2-SVM能够更准确的找到相关且非冗余的特征.真实数据实验表明,L1/2-SVM可获得比L0-SVM更精确的分类结果,以及比L1-SVM更稀疏的特征选择结果.
     3.本文研究求解L1/2-SVM的惩罚序列线性规划算法(PSLP)该算法利用线性规划逼近最优解,适用于变量和约束都很多的大规模问题.我们将PSLP算法应用于具有高维小样本、高噪声、高冗余等特点的基因表达谱数据集.数值实验结果表明,PSLP算法的准确性高于求解Lo-SVM的FSV算法.与L1-SVM相比,PSLP算法不仅能找到比L1-SVM更少的特征基因,而且可获得比L1-SVM更好或相当的分类结果.我们统计得出各数据集中频繁被选择的前十位基因,为生物学的进一步研究提供参考.
     4.本文对Lp正则化支持向量机在特征选择方面的能力进行理论分析.我们首先分析对特定数据进行特征选择的可能性,研究表明支持向量机实现特征选择不仅与目标函数采用的范数有关,还与数据本身有关.然后推导出一个用于度量支持向量机特征选择能力的概率计算公式,并应用该公式计算LP-SVM在p不同取值时的特征选择概率.计算结果表明,较小的正则化阶数p有助于提升LP-SVM的特征选择能力.
Feature selection plays an important role in machine learning. Feature s-election is to eliminate the redundant or irrelevant features, which results in a simplified but more information feature subset. The advantages of feature selec-tion include:preventing overfitting and improving the generalization performance, improving data interpretability and visualization, reducing the computation cost, reducing data measurement and storage requirements.
     We study feature selection strategies for support vector machine. Recently, the so called LP-SVM (0     1. We reformulate the LP-SVM into an optimization model with linear ob-jective function and smooth constraints (LOSC-SVM), so that it can be solved by efficient numerical methods for smooth constrained optimization. We establish the equivalence between LOSC-SVM and LP-SVM, and show some good properties of LOSC-SVM. This efficient reformulation would open a new path to develop nu-merical methods for Lp-SVM. Numerical experiments on artificial datasets show that the LOSC-SVM (0     2. Recent numerical researches have shown that the L1/2regularization prob-lem can be regarded as a representative in the Lp (0     3. We introduce a penalty successive linear programming algorithm (PSLP) to solve the L1/2-SVM. The PSLP can be applied to as large a problem as the LP code can handle, and is especially efficient on highly constrained problems. Under mild conditions, we establish the convergence of the PSLP algorithm. We explore the application of L1/2-SVM to gene selection for cancer classification. The characters of gene expression data are high dimensionality, huge redundancy and noise and highly imbalanced. Numerical results on three well-known gene expres-sion data sets support the effectiveness of the proposed method compared to other commonly-used regularization SVMs.
     4. To our knowledge, there is still no formal proof or comprehensive mathe-matical understanding on how Lp regularization can bring feature selection. We analyze theoretically the feature selection ability of the LP-SVM. We first discuss the possibilities of feature selection for specified data. The results show that fea-ture selection depends not only on the parameter p but also on the data itself. Then we provide a formula for computing the probabilities which measure the fea-ture selection ability. Based on this formula, we compute the probabilities for some popular p. Computations show that, small p really helps to enhance the feature selection ability of Lp-SVMs.

引文

[1]Jolliffe I. Principal component analysis. Wiley Online Library,2005,1-23.
    [2]Fisher R A. The use of multiple measurements in taxonomic problems. An-nals of Eugenics,1936,7(2):179-188.
    [3]He X F, Niyogi P. Locality preserving projections. In:NIPS,2003,16:234-241.
    [4]Hardoon D R, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 2004,16(12):2639-2664.
    [5]Geladi P, Kowalski B R. Partial least-squares regression:a tutorial. Analyt-ica Chimica Acta,1986,185:1-17.
    [6]Guyon I, Elisseeff A. An introduction to variable and feature selection. The Journal of Machine Learning Research,2003,3:1157-1182.
    [7]Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Machine Learning,2002,46(1-3):389-422.
    [8]Zhang H H, Ahn J, Lin X, et al. Gene selection using support vector machines with non-convex penalty. Bioinformatics,2006,22(1):88-95.
    [9]Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics,2007,23(19):2507-2517.
    [10]Yang Y M, Pedersen J O. A comparative study on feature selection in text categorization. In:ICML,1997,97:412-420.
    [11]Forman G. An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research,2003,3:1289-1305.
    [12]Weston J, Perez-Cruz F, Bousquet O, et al. Feature selection and transduc-tion for prediction of molecular bioactivity for drug design. Bioinformatics, 2003,19(6):764-771.
    [13]Liu Y. A comparative study on feature selection methods for drug discovery. Journal of Chemical Information and Computer Sciences,2004,44(5):1823-1828.
    [14]Akaike H. Information theory and an extension of the maximum likeli-hood principle. In:Second International Symposium on Information Theory. Akademinai Kiado,1973,1:267-281.
    [15]Schwarz G. Estimating the dimension of a model. The Annals of Statistics, 1978,6(2):461-464.
    [16]Kohavi R, John G H. Wrappers for feature subset selection. Artificial Intel-ligence,1997,97(1):273-324.
    [17]Blum A L, Langley P. Selection of relevant features and examples in machine learning. Artificial intelligence,1997,97(1):245-271.
    [18]Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. Knowledge and Data Engineering, IEEE Transactions on, 2005,17(4):491-502.
    [19]Cai D, Zhang C Y, He X F. Unsupervised feature selection for multi-cluster data. In:Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM,2010,1:333-342.
    [20]Guyon I, Weston J, Barnhill S, et al., editors. Feature extraction:foundations and applications. Springer,2006,207.
    [21]Kira K, Rendell L A. A practical approach to feature selection. In:Pro-ceedings of the Ninth International Workshop on Machine Learning. Morgan Kaufmann Publishers Inc.,1992,1:249-256.
    [22]Press W H, Flannery B P, Teukolsky S A, et al. Numerical recipes in C:the art of scientific programming. Section,1992,10:408-412.
    [23]Duda R O, Hart P E, Stork D G. Pattern classification. John Wiley & Sons, 2012,1-13.
    [24]Davies S, Russell S. NP-completeness of searches for smallest possible feature sets. In:Proceedings of the AAAI Fall'94 Symposium on Relevance. New Orleans,1994,1:37-39.
    [25]Pudil P, Novovicova J, Kittler J. Floating search methods in feature selection. Pattern Recognition Letters,1994,15(11):1119-1125.
    [26]Sebban M, Nock R. A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recognition,2002,35(4):835-846.
    [27]Oh I S, Lee J S, Moon B R. Hybrid genetic algorithms for feature selec-tion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2004,26(11):1424-1437.
    [28]Lal T N, Chapelle O, Weston J, et al. Embedded methods. In:Feature Extraction, Springer,2006,137-165.
    [29]Quinlan J R. C4.5:programs for machine learning. Machine Learning,16(3).
    [30]Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2001. NY Springer,2001,1-13.
    [31]Rakotomamonjy A. Variable selection using svm based criteria. The Journal of Machine Learning Research,2003,3:1357-1370.
    [32]Weston J, Mukherjee S, Chapelle O, et al. Feature selection for SVMs. In: Neural Information Processing Systems. Colorado,2000,13:668-674.
    [33]Peleg D, Meir R. A feature selection algorithm based on the global mini-mization of a generalization error bound. In:Neural Information Processing Systems. Vancouver,2004,1-8.
    [34]Ma S G, Huang J. Penalized feature selection and classification in bioinfor-matics. Briefings in Bioinformatics,2008,9(5):392-403.
    [35]Tikhonov A N. On solving incorrectly posed problems and method of regu-larization. Doklady Akademii Nauk USSR,1963,151:501-504.
    [36]Hoerl A E, Kennard R W. Ridge regression:Biased estimation for nonorthog-onal problems. Technometrics,1970,12(1):55-67.
    [37]Becker N, Werft W, Toedt G, et al. penalizedSVM:a R-package for feature selection SVM classification. Bioinformatics,2009,25(13):1711-1712.
    [38]Xu Z B, Zhang H, Wang Y, et al. L1/2 regularization. Science China,2010, 53(6):1159-1169.
    [39]Bradley P S, Mangasarian O L. Feature selection via concave minimization and support vector machines. In:Proceedings of the International Confer-ence on Machine Learning. San Francisco,1998,1:82-90.
    [40]Weston J, Elisseeff A, Scholkopf B, et al. Use of the zero norm with linear models and kernel methods. The Journal of Machine Learning Research, 2003,3:1439-1461.
    [41]Chan A, Vasconcelos N, Lanckriet G. Direct convex relaxations of sparse SVM. In:Proceedings of the 24th International Conference on Machine Learning. Corvallis,2007,1:145-153.
    [42]Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological),1996,58(1):267-288.
    [43]Fung G M, Mangasarian O L. A feature selection Newton method for support vector machine classification. Computational Optimization and Applications, 2004,28(2):185-202.
    [44]Zhu J, Rosset S, Hastie T, et al.1-norm support vector machines. Advances in Neural Information Processing Systems,2004,16(1):49-56.
    [45]Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research,2010,11:1081-1107.
    [46]Chartrand R. Exact reconstruction of sparse signals via nonconvex mini-mization. Signal Processing Letters, IEEE,2007,14(10):707-710.
    [47]Chartrand R. Nonconvex regularization for shape preservation. In:IEEE International Conference on Image Processing. San Antonio,2007,1:293-296.
    [48]Fan J Q, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics,2004,32(3):928--961.
    [49]Knight K, Fu W J. Asymptotics for lasso-type estimators. Annals of Statis-tics,2000,1356-1378.
    [50]Liu Y F, Zhang H H, Park C, et al. Support vector machines with adaptive Lq penalty. Computational Statistics and Data Analysis,2007,51(12):6380-6394.
    [51]Liu Z Q, Lin S L, Tan M T. Sparse support vector machines with Lp penal-ty for biomarker identification. Computational Biology and Bioinformatics, IEEE/ACM Transactions on,2010,7(1):100-107.
    [52]Tan J Y, Zhang Z Q, Zhen L, et al. Adaptive feature selection via a new version of support vector machine. Neural Computing and Applications, 2013,23(3-4):937-945.
    [53]Tian Y J, Yu J, Chen W J. lp-norm support vector machine with CCCP. In:Seventh International Conference on Fuzzy Systems and Knowledge Dis-covery. Yantai,2010,4:1560-1564.
    [54]Rakotomamonjy A, Flamary R, Gasso G, et al. lp-lq penalty for sparse linear and sparse multiple kernel multi-task learning. IEEE Trans. on Neural Networks,2011,22(8):1307-1320.
    [55]Xu Z B, Chang X Y, Xu F M, et al. L1/2 regularization:A thresholding rep-resentation theory and a fast solver. Neural Networks and Learning Systems, IEEE Transactions on,2012,23(7):1013-1027.
    [56]Liang Y, Liu C, Luan X Z, Leung, et al. Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics, 2013,14(1):198.
    [57]Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers. In:Proceedings of the Fifth Annual Workshop On Computational Learning Theory. ACM,1992,1:144-152.
    [58]Vapnik V N. The nature of statistical learning theory. New York:Springer Verlag,1995,1-14.
    [59]Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press,2000, 10-16.
    [60]邓乃扬,田英杰.数据挖掘中的新方法：支持向量机.科学出版社,2004,164-196.
    [61]Fan J Q, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association,2001, 96(456):1348-1360.
    [62]Amaldi E, Kann V. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 1998,209:237-260.
    [63]Goldberg N, Leyffer S, Munson T. A new perspective on convex relaxations of sparse SVM. In:Proceedings of the 2013 SIAM International Conference on Data Mining,2013,1:450-457.
    [64]Liu Y F, Wu Y C. Variable selection via a combination of the Lo and L1 penalties. Journal of Computational and Graphical Statistics,2007, 16(4):782-798.
    [65]Mangasarian 0 L. Exact 1-norm support vector machines via unconstrained convex differentiable minimization. The Journal of Machine Learning Re-search,2006,7:1517-1530.
    [66]Bi J, Bennett K, Embrechts M, et al. Dimensionality reduction via sparse support vector machines. The Journal of Machine Learning Research,2003, 3:1229-1243.
    [67]Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 2005,67(2):301-320.
    [68]Wang L F, Chen G, Li H Z. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics,2007,23(12):1486-1494.
    [69]Wang L, Zhu J, Zou H. The doubly regularized support vector machine. Statistica Sinica,2006,16(2):589.
    [70]Becker N, Toedt G, Lichter P, et al. Elastic SCAD as a novel penaliza-tion method for SVM classification tasks in high-dimensional data. BMC Bioinformatics,2011,12(1):138.
    [71]Huang J, Horowitz J L, Ma S G. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics,2008, 36(2):587-613.
    [72]Chen W J, Tian Y J. Lp-norm proximal support vector machine and its applications. Procedia Computer Science,2010,1(1):2417-2423.
    [73]Zhang C H, Shao Y H, Tan J Y, et al. Mixed-norm linear support vector machine. Neural Computing and Applications,2012,11:1081-1107.
    [74]Liu J L, Li J P, Xu W X, et al. A weighted Lq adaptive least squares support vector machine classifiers-Robust and sparse approximation. Expert Systems with Applications,2011,38(3):2253-2259.
    [75]Ge D, Jiang X, Ye Y. A note on the complexity of Lp minimization. Math-ematical Programming,2011,129(2):285-299.
    [76]Newman D J, Hettich S, Blake C L, et al. UCI repository of ma-chine learning databases. Technical Report 9702, Department of Infor-mation and Computer Science, Univerisity of California, Irvine,1998. http://archive.ics.uci.edu/ml/.
    [77]袁亚湘,孙文瑜.最优化理论与方法.科学出版社,1997,467-473.
    [78]李董辉,童小娇,万中.数值最优化.科学出版社,2005,181-183.
    [79]Tian B S, Yang X Q. An interior-point l1/2-penalty method for nonlinear programming. Technical Report 1, Department of Applied Mathematics, Hong Kong Polytechnic University,2013.
    [80]Li D H, Zhang X J, Wu L. Constrained optimization reformulation to L1/2 regularization and an interior-point method. Manuscript,2013,1-22.
    [81]Griffith R E, Stewart R A. A nonlinear programming technique for the optimization of continuous processing systems. Management Science,1961, 7(4):379-392.
    [82]Palacios-Gomez F, Lasdon L, Engquist M. Nonlinear optimization by suc-cessive linear programming. Management Science,1982,28(10):1106-1120.
    [83]Baker T E, Lasdon L S. Successive linear programming at Exxon. Manage-ment Science,1985,31(3):264-274.
    [84]Zhang J Z, Kim N, Lasdon L. An improved successive linear programming algorithm. Management Science,1985,31(10):1312-1331.
    [85]Sarker R A, Gunn E A. A simple SLP algorithm for solving a class of nonlin-ear programs. European Journal of Operational Research,1997,101(1):140-154.
    [86]Amos F, Ronnqvist M, Gill G. Modelling the pooling problem at the New Zealand refining company. Journal of the Operational Research Society,1997, 1:767-778.
    [87]Misener R, Floudas C A. Advances for the pooling problem:Modeling, global optimization, and computational studies. Applied and Computational Mathematics,2009,8(1):3-22.
    [88]Bhat H S, Vaz G J, Meza J C. Fast solution of load shedding problems via a sequence of linear programs. In:Big Data,2013 IEEE International Conference on,2013,1:1-6.
    [89]Marcos M. Development of a Model to Optimize Irrigation Planning for Multi-reservoir Systems. PhD thesis, Concordia University,1996,31-33.
    [90]Mousavi H, Ramamurthy AS. Optimal design of multi-reservoir systems for water supply. Advances in Water Resources,2000,23(6):613-624.
    [91]Maag V, Berger M, Winterfeld A, Kiifer K. A novel non-linear approach to minimal area rectangular packing. Annals of Operations Research,2010, 179(1):243-260.
    [92]Golub T R, Slonim D K, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999,286(5439):531-537.
    [93]Furey T S, Cristianini N, Duffy N, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics,2000,16(10):906-914.
    [94]Thomas J G, Olson J M, Tapscott S J, et al. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Research,2001,11 (7):1227-1236.
    [95]Troyanskaya O G, Garber M E, Brown P O, et al. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 2002,18(11):1454-1461.
    [96]Noble W S. Support vector machine applications in computational biology. Kernel Methods in Computational Biology,2004,71-92.
    [97]Han S P, Mangasarian O L. Exact penalty functions in nonlinear program-ming. Math. Programming,1979,17:251-269.
    [98]Fletcher R. Practical methods of optimization:Vol.2:Constrained opti-mization. JOHN WILEY & SONS, INC., ONE WILEY DR., SOMERSET, N. J.224,1981,1-36.
    [99]Shipp M A, Ross K N, Tamayo P, et al. Diffuse large B-cell lymphoma out-come prediction by gene-expression profiling and supervised machine learn-ing. Nature Medicine,2002,8(1):68-74.
    [100]Singh D, Febbo P G, Ross K, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell,2002,1(2):203-209.
    [101]Yang P Y, Zhou B B, Zhang Z L, et al. A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data. BMC Bioinformatics,2010, 11(Suppl 1):S5.
    [102]Zhang H Y, Wang H Y, Dai Z J, et al. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics, 2012,13(1):298.
    [103]Kracmarova A, Cermak J, Brdicka R, et al. High expression of ERC-C1, FLT1, NME4 and PCNA associated with poor prognosis and ad-vanced stages in myelodysplastic syndrome. Leukemia & Lymphoma,2008, 49(7):1297-1305.
    [104]Kelly L, Clark J, Gilliland D G. Comprehensive genotypic analysis of leukemia:clinical and therapeutic implications. Current Opinion in On-cology,2002,14(1):10-18.
    [105]Moroz C, Traub L, Maymon R, et al. PLIF, a novel human ferritin sub-unit from placenta with immunosuppressive activity. Journal of Biological Chemistry,2002,277(15):12901-12905.
    [106]Demirtas S, Akan O, Can M, et al. Cystatin C can be affected by nonre-nal factors:a preliminary study on leukemia. Clinical Biochemistry,2006, 39(2):115-118.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700