摘要
当数据中变量个数远大于样本个数时,变量之间的共线性问题变得尤其突出.偏最小二乘方法作为一种潜变量方法,将原始变量通过线性组合的方式转化为几个新的潜变量用于对响应变量的建模解释,但变量之间复杂共线性的存在使得变量选择困难重重.本文采用主因子近似方法分离出原始变量之间的共线性信息,再进行变量选择.模拟研究表明主因子逼近方法能有效地提高变量选择的精度.
The problem of variable collinearity between variable becomes particularly acute when variables are far more than samples in data. As a method of latent variables, partial least squares transform original variables into a few new factors by collinear combination, which can interpret response variable modeling. But, the complex sample data correlation structure makes variable selection become a tough task. In this paper, we introduced a principal component approximation(PFA) method to directly eliminate the effect of sample correlation on the observed values of the regression coefficients. Simulation studies were performed under three typical sample data correlation structures and the results showed that PFA and PLS performs comparably well.
引文
[1] Pearson K. Mathematical Contributions to the Theory of Evolution—On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs[J]. Proceedings of the Royal Society of London(1854-1905), 1896, 60(1):489~98
[2]FanJ,LvJ.Sureindependencescreeningforultrahighdimensionalfeaturespace[J].JournaloftheRoyalStatisticalSociety:SeriesB(Statistical Methodology), 2008, 70(5):849~911
[3] Trygg J, Wold S. Orthogonal projections to latent structures(O-PLS)[J]. Journal of Chemometrics, 2002, 16(3):119~28
[4] Wold S, Sj Str M M, Erikssonl. PLS-regression:a basic tool of chemometrics[J]. Chemometrics and Intelligent Laboratory Systems, 2001, 58(2):109~30
[5] Centner deNoord etal. Elimination of uninformative variables for multivariate calibration[J]. Analytical chemistry, 1996, 68(21):3851~8
[6] Cai WS, Li YK, Shao XG. A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra[J]. Chemometrics and Intelligent Laboratory Systems, 2008, 90(2):188~94
[7] Fernandez Pierna J A,Abbas O, Baeten V, et al. A Backward Variable Selection method for PLS regression(BVSPLS)[J]. Analytica Chimica Acta, 2009,642(1-2):89~93
[8] Hoskuldsson A. Variable and subset selection in PLS regression[J]. Chemometrics and Intelligent Laboratory Systems, 2001, 55(1-2):23~38
[9] Andersen R Bro. Variable selection in regression—a tutorial[J]. Journal of Chemometrics, 2010, 24(11-12):728~37
[10] Leek J T, Storey J D. A general framework for multiple testing dependence[J]. Proceedings of the National Academy of Sciences of the United States of America, 2008, 105(48):18718~23
[11] Fan J, Han X, Gu W. Estimating False Discovery Proportion Under Arbitrary Covariance Dependence[J]. J Am Stat Assoc, 2012, 107(499):1019~35