变量选择集成方法

英文篇名：Variable Selection Ensemble Methods
作者：张春霞 ; 李俊丽
英文作者：ZHANG Chun-xia;LI Jun-li;School of Mathematics and Statistics, Xi'an Jiaotong University;
关键词：高维数据分析 ; 变量选择 ; 线性回归模型 ; 集成学习 ; 稳定性
英文关键词：high-dimensional data analysis;;variable selection;;linear regression model;;ensemble learning;;stability
中文刊名：GCSX
英文刊名：Chinese Journal of Engineering Mathematics
机构：西安交通大学数学与统计学院;
出版日期：2019-02-15
出版单位：工程数学学报
年：2019
期：v.36
基金：国家自然科学基金(11671317;61572393)~~
语种：中文;
页：GCSX201901001
页数：17
CN：01
ISSN：61-1269/O1
分类号：5-21

摘要

随着海量高维数据在众多研究和应用领域的不断涌现,如何利用数据的稀疏性特征,从中挖掘到有价值的信息显得至关重要.变量选择作为可解释性建模、提高统计推断和预测精度的有效工具,在高维数据的分析中发挥着愈来愈重要的作用.由于集成学习能显著提高选择精度、缓解变量选择过程的不稳定性、降低噪声变量被误选的机率,变量选择集成方法近年来得到了广泛研究.为了给相关方向的研究者提供一个系统的参考资料,论文对现有的变量选择集成方法进行了详细阐述,按照构建集成所用的不同策略将其分为两大类,分析了各类方法的特征,并采用数值试验研究了各类方法在变量选择、预测等方面的性能.最后,论文对变量选择集成方法在未来值得研究的方向进行了探讨.
With the emergence of massive high-dimensional data in many research and application fields, it is crucial to mine valuable information by using the sparsity of high-dimensional data. Being an effective tool for building an interpretative model, improving inference and prediction accuracy, variable selection plays an increasingly important role in statistical modelling of high-dimensional data. Because ensemble learning has advantages to significantly improve selection accuracy, to alleviate the instability of traditional selection methods, and to reduce falsely including noise variables, variable selection ensemble(VSE) methods have gained considerable interest in context of variable selection. In order to provide a systematic reference for researchers in related fields, this paper presents a detailed survey of the existing VSEs and categorizes them into two classes according to their different strategies. The main characteristics of the methods in each class are also analyzed. In the meantime, some simulated experiments are carried out to investigate the selection and prediction performance of some representative VSE techniques. Finally, several research directions of VSEs deserved to be further studied are discussed.

引文

[1]Fan J Q,Li R Z.Statistical challenges with high dimensionality feature selection in knowledge discovery[C]//Proceedings of the International Congress on Mathematicians,Freiburg,2006:595-622
    [2]Fan J Q,Lv J C.A selective overview of variable selection in high dimensional feature space[J].Statistica Sinica,2010,20(1):101-148
    [3]B¨uhlmann P,van de Geer S.Statistics for High-Dimensional Data:Methods,Theory and Applications[M].New York:Springer,2011
    [4]Nan Y,Yang Y H.Variable selection diagnostics measures for high-dimensional regression[J].Journal of Computational and Graphical Statistics,2014,23(3):636-656
    [5]Sauerbrei W,Buchholz A,Boulesteix A L,et al.On stability issues in deriving multivariable regression models[J].Biometrical Journal,2015,57(4):531-555
    [6]Xin L,Zhu M.Stochastic stepwise ensembles for variable selection[J].Journal of Computational and Graphical Statistics,2012,21(2):275-294
    [7]Yang Y H.Can the strengths of AIC and BIC be shared?A conflict between model indentification and regression estimation[J].Biometrika,2005,92(4):937-950
    [8]Li R,Liang H.Variable selection in semiparametric regression modeling[J].The Annals of Statistics,2008,36(1):261-286
    [9]Bertin K,Lecu′e G.Selection of variables and dimension reduction in high-dimensional non-parametric regression[J].Electronic Journal of Statistics,2008,2:1224-1241
    [10]Miller A.Subset Selection in Regression(2nd Edition)[M].New Work:Chapman&Hall CRC Press,2002
    [11]Tibshirani R.Regression shrinkage and selection via the lasso[J].Journal of the Royal Statistical Scoiety(Series B),1996,58(1):267-288
    [12]Zou H.The adaptive lasso and its oracle properties[J].Journal of the American Statistical Association,2006,101(476):1418-1429
    [13]Fan J Q,Li R Z.Variable selection via nonconcave penalized likelihood and its oracle properties[J].Journal of the American Statistical Association,2001,96(456):1348-1360
    [14]Fan J Q,Lv J C.Sure independence screening for ultrahigh dimensional feature space[J].Journal of the Royal Statistical Society(Series B),2008,70(5):849-911
    [15]Cho H,Fryzlewicz P.High dimensional variable selection via tilting[J].Journal of the Royal Statistical Society(Series B),2012,74(3):593-622
    [16]Lin B Q,Pang Z.Tilted correlation screening learning in high-dimensional data analysis[J].Journal of Computational and Graphical Statistics,2014,23(2):478-496
    [17]B¨uhlmann P,Hothorn T.Boosting algorithms:regularization,prediction and model fitting[J].Statistical Science,2007,22(4):477-505
    [18]Zhang T.Adaptive forward-backward greedy algorithm for learning sparse representations[J].IEEE Transactions on Information Theory,2011,57(7):4689-4708
    [19]Stodden V.Reproducing statistical results[J].Annual Review of Statistics and Its Application,2015,2(1):1-19
    [20]Lim C,Yu B.Estimation stability with cross-validation(ESCV)[J].Journal of Computational and Graphical Statistics,2016,25(2):464-492
    [21]Meinshausen N,B¨uhlmann P.Stability selection(with discussions)[J].Journal of the Royal Statistical Society(Series B),2010,72(4):417-473
    [22]Yu B.Stability[J].Bernoulli,2013,19(4):1484-1500
    [23]Breiman,L.Bagging predictors[J].Machine Learning,1996,24(2):123-140
    [24]Kuncheva L I.Combining Pattern Classifiers:Methods and Algorithms(2nd Edition)[M].New Jersey:John Wiley&Sons,2014
    [25]Breiman L.Random forests[J].Machine Learning,2001,45(1):5-32
    [26]周志华.机器学习[M].北京:清华大学出版社,2016Zhou Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016
    [27]Schapire R E,Freund Y.Boosting:Foundations and Algorithms[M].Cambridge:MIT Press,2012
    [28]Zhu M,Chipman H A.Darwinian evolution in parallel universes:a parallel genetic algorithm for variable selection[J].Technometrics,2006,48(4):491-502
    [29]Bach F R.Bolasso:model consistent lasso estimation through the bootstrap[C]//Proceedings of the 25th International Conference on Machine Learning,Helsinki,Finland,ACM Press,2008:33-40
    [30]Zhu M,Fan G Z.Variable selection by ensembles for the Cox model[J].Journal of Statistical Computational and Simulation,2011,81(12):1983-1992
    [31]Wang S,Nan B,Rosset S,et al.Random lasso[J].The Annals of Applied Statistics,2011,5(1):468-485
    [32]李毓,张春霞,王冠伟.线性回归模型的Boosting变量选择方法[J].工程数学学报,2015,32(5):677-689Li Y,Zhang C X,Wang G W.Boosting variable selection algorithm for linear regression models[J].Chinese Journal of Engineering Mathematics,2015,32(5):677-689
    [33]Zhang C X,Wang G W,Liu J M.RandGA:injecting randomness into parallel genetic algorithm for variable selection[J].Journal of Applied Statistics,2015,42(3):630-647
    [34]Zhang C X,Ji N N,Wang G W.Randomizing outputs to increase variable selection accuracy[J].Neurocomputing,2016,218:91-102
    [35]Che J X,Yang Y L.Stochastic correlation coefficient ensembles for variable selection[J].Journal of Applied Statistics,2017,44(10):1721-1742
    [36]Lin B Q,Wang Q H,Zhang J,et al.Stable prediction in high-dimensional linear models[J].Statistics and Computing,2017,27(2):1401-1412
    [37]Yang W,Yang Y.Toward an objective and reproducible model choice via variable selection deviation[J].Biometrics,2017,73(1):20-30
    [38]Ye C,Yang Y,Yang Y.Sparsity oriented importance learning for high-dimensional linear regression[J].Journal of the American Statistical Association,2018,doi:10.1080/01621459.2017.1377080
    [39]Zhang C X,Zhang J S,Yin Q Y.A ranking-based strategy to prune variable selection ensembles[J].Knowledge-Based Systems,2017,125:13-25
    [40]Zhang C X,Zhang J S,Yin Q Y.Early stopping aggregation in selective variable selection ensembles for high-dimensional linear regression models[J].Knowledge-Based Systems,2018,153:1-11
    [41]Tibshirani R,Walther G,Hastie T.Estimating the number of clusters in a data set via the gap statistic[J].Journal of the Royal Statistical Society(Series B),2001,63(2):411-423
    [42]Hofner B,Boccuto L,G¨oker M.Controlling false discoveries in high dimensional situations:boosting with stability selection[J].BMC Bioinformatics,2015,doi:10.1186/s12859-015-0575-3
    [43]Beinrucker A,Dogan¨U,Blanchard G.Extensions of stability selection using subsamples of observations and covariates[J].Statistics and Computing,2016,26(5):1058-1077

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700