生存数据统计模型的变量选择方法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
生存数据广泛出现在生物医学、经济金融、保险精算、可靠性工程等领域。由于生存数据一般都存在删失,完全数据下的统计方法几乎都会失效。因此,如何对其统计分析一直是一个方兴未艾的主题。而且,在许多的实际问题中,往往会观察到多个不同的生存时间,我们称为多元生存时间数据。该数据的主要特点是各类生存时间之间可能是相依性的。由于这种复杂的相依性和删失的存在,使得对多元生存时间数据的统计分析变得比较困难。然而,因其广泛的实用性价值,引起了越来越多学者的关注。
     随着现代科技的发展,海量数据随处可见,特别是在生物信息、航空航天、人工智能以及电子商务等方面。这些海量数据的特点一般是维数很高、噪声很大。如何从这种高维数据中提取出有用信息是人们最为关心的问题。变量选择作为一种重要的信息提取工具,受到了统计学家们高度的重视。然而,经典的变量选择方法面对如此的高维数据有可能完全失去作用。为此,统计学家提出了各种的改进方法。其中,最为流行的方法就是正则化方法,如LASSO、SCAD以及MCP等。本文主要在生存数据,包括多元生存时间数据框架下研究正则化变量选择方法的三个问题:第一,结构化协变量的选择问题;第二,超高维,即p》n下的变量选择;第三,半参数回归模型的变量选择。
     在本文的第二章中,基于可加危险率模型我们讨论具有组结构协变量的变量选择问题。研究的目标是同时识别重要的组内和组间变量。为此,我们考虑了一个层次化的惩罚方法。在协变量维数发散情况下,我们证明了所提估计的大样本性质。数值计算结果表明,在协变量具有组结构情况下,该方法优于现有的方法,如LASSO, SCAD和Adaptive LASSO等。最后,我们使用所提方法分析了一组基因数据。
     本文的第三章主要研究,在协变量的维数p=O(exp(nδ))其中δ>0情况下,可加危险率模型的一类非凸惩罚方法的大样本性质。在类似于Zhao and Yu[97]的不可忽略性条件(Irrepresentable Condition)下,我们证明了所提估计具有强Oracle性质。有趣的是该性质对LASSO同样适合。另外,我们也建立了该非凸惩罚估计(此时不包括LASSO)的渐近正态性。
     本文的第四章以及第五章基于多元生存时间数据分别考虑部分变系数、部分线性比例危险率回归模型的变量选择问题。对于参数部分协变量的选择和估计,我们主要采用一步回切估计的思想。对于非参部分的重要性识别,主要是通过假设检验完成。在一些正则化条件下,我们分别获得了相应估计的Oracle性质。模拟结果证实所提方法具有很好的变量选择效果。最后,我们分别将该方法应用于结肠癌数据统计分析中。
Survival data occurs widely in biomedicine, economic and finance, actuarial science of insurance, reliability engineering and other fields. However, due to censoring, it is not suitable to analyze survival data by classical statistical methods of complete data. Therefore, how to make inferences about it is always a burgeoning theme. Moreover, multivariate survival time data arises frequently in many biomedical studies when more than one failure outcome is observed for an individual. A key feature of this type of data is that the survival times may be related to each other for the same subject or cluster. Because of the complex dependence and censorship, inferences about it become nontrivial. However, owing to its wide use in practice, the statistical analysis for multivariate survival time data has attracted more and more attention.
     With the development of modern technology, mass data has been encountered in many fields, especially biological information, aerospace, artificial intelligence and elec-tronic commerce and so on. Generally, this data behaves very high dimension and noise. How to extract the useful information from such high dimensional data is a fundamental problem. As an efficient tool to mine important information, variable selection has re-ceived great attention by statisticians. However, it is often infeasible to deal with such high dimensional data by classical variable selection methods. Therefore, many improved methods have been proposed. Among them, the most popular methods are the regular-ization methods, such as LASSO, SCAD and MCP etc. In the framework of survival data, including multivariate survival time data, this dissertation addresses the following three questions about the regularization methods:firstly, how to select important variables when covariates have a group structure; secondly, how to carry out variable selection for the settings of the dimension p>> n, where n is the sample size; thirdly, how to identify important variables for a semiparametric regression model.
     In Chapter2, we discuss the variable selection problem in the additive hazards model where the covariates have been grouped. The aim of this study is to simultaneously identify the important variables between the intra group and inter group. To this end, we consider a hierarchical penalty method. For the case of the diverging dimension, we establish the large sample properties of the proposed method. Numerical results indicate that, when there exits a group structure for the covariates, the hierarchically penalized method outperforms than some existing methods such as the LASSO, SCAD and Adaptive LASSO and so on. Finally, we analyze a gene expression dataset by the proposed method.
     In Chapter3, we consider the large sample properties for a class of nonconcave penalized procedures in the additive hazards model when the dimension of covariates may grow nonpolynomially with the sample size n, namely, exp(nδ) with δ>0. In the condition similar to Irrepresentable Condition proposed by Zhao and Yu [97], we prove that the proposed estimation behaves strong oracle property. It is interesting to notice that this property holds for the LASSO. In addition, the asymptotic normality has been established, which don't satisfy for the LASSO penalty.
     In Chapter4and5, we study the variable selection in the partially linear vary ing-coefficient marginal hazards model and the partially linear marginal hazards model for multivariate survival time data, respectively. For the parametric parts, we mainly use an ideal of the one-step backfitting method. And, the important nonparametric function can be identified through hypothesis testing. Under some regular conditions, we obtain the oracle properties of the corresponding estimations. The simulation results demonstrates that the proposed methods perform well. Finally, we apply these methods to the colon cancer data analysis.
引文
[1]Aalen, O. O. (1980). A model for regression analysis of counting processes. In Math-ematical statistics and probability theory (pp.1-25). New York:Springer.
    [2]Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1993). Statistical models based on counting processes. New York:Springer.
    [3]Antoniadis, A. and Fan, J. (2001). Regularization of wavelet approximations. Journal of the American Statistical Association,96,939-967.
    [4]Atkinson, K., Storb, R., Prentice, R. L., Weiden, P. L., Witherspoon, R. P., Sullivan, K. and Thomas, E. D. (1979). Analysis of late infections in 89 long-term survivors of bone marrow transplantation. Blood,53,720-731.
    [5]Bhatia, R. (1997). Matrix analysis. New York:Springer.
    [6]Bickel, P. J., Ritov, Y. A. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics,37,1705-1732.
    [7]Bradic, J., Fan, J. and Jiang, J. (2011). Regularization for Cox's proportional hazards model with NP-dimensionality. Annals of Statistics,39,3092-3120.
    [8]Bradic, J., Fan, J. and Wang, W. (2011). Penalized composite quasi-likelihood for ul-trahigh dimensional variable selection. Journal of the Royal Statistical Society Series B,73,325-349.
    [9]Breiman, L. (1995). Better subset regression using the nonnegative garrote. Techno-metrics,37,373-384.
    [10]Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics,24,2350-2383.
    [11]Buhlmann, P. L. and van de Geer, S. (2011). Statistics for high-dimensional data. New York:Springer.
    [12]Byar, D. P. (1980). The Veterans Administration study of chemoprophylaxis for recurrent stage I bladder tumors:comparisons of placebo, pyridoxine, and topical thiotepa. Bladder tumors and other topics in urological oncology,18,363-370.
    [13]Cai, J. and Prentice, R. L. (1995). Estimating equations for hazard ratio parameters based on correlated failure time data. Biometrika,82,151-164.
    [14]Cai, J., Fan, J., Li, R. and Zhou, H. (2005). Variable selection for multivariate failure time data. Biometrika,92,303-316.
    [15]Cai, J., Fan, J, Zhou, H. and Zhou, Y. (2007). Marginal hazard models with varying-coefficients for multivariate failure time data. Annals of Statistics,35,324-354.
    [16]Cai, J., Fan, J., Jiang, J. and Zhou, H. (2007). Partially linear hazard regression for multivariate survival data. Journal of the American Statistical Association,102, 538-551.
    [17]Cai, J., Fan, J., Jiang, J. and Zhou, H. (2008). Partially linear hazard regression with varying-coefficients for multivariate survival data. Journal of the Royal Statistical Society Series B,70,141-158.
    [18]Candes, E. J. and Tao, T. (2007). The Dantzig selector:statistical estimation when p is much larger than n (with discussion). Annals of Statistics,35,2313-2404.
    [19]Carroll, R., Fan, J., Gijbels, I. and Wand, M. (1997). Generalized partially linear single-index models. Journal of the American Statistical Association,92,477-489.
    [20]Chen, K., Guo, S., Sun, L. and Wang, J. (2010). Global partial likelihood for nonpara-metric proportional hazards models. Journal of the American Statistical Association, 105,750-760.
    [21]Chen, Y. Q. and Wang, M. C. (2000). Analysis of accelerated hazards models. Journal of the American Statistical Association,95,608-618.
    [22]Cook, R. J. and Lawless, J. F. (2007). The statistical analysis of recurrent events. New York:Springer.
    [23]Cox, D. R. (1972). Regression models and life-table (with discussion). Journal of the Royal Statistical Society Series B,4,187-220.
    [24]Clayton, D. and Cuzick, J. (1985). Multivariate generalizations of the proportional hazards model. Journal of the Royal Statistical Society Series A,148,82-117.
    [25]Dave, S. S., Wright, G., Tan, B., et al. (2004). Prediction of survival in follicu-lar lymphoma based on molecular features of tumor-infiltrating immune cells. New England Journal of Medicine,351,2159-2169.
    [26]Diabetic Retinopathy Study Research Group. (1981). Diabetic retinopathy study. Investigative Ophthalmology and Visual Science,21,149-208.
    [27]Donoho, D. L. (2000). High-dimensional data analysis:the curses and blessings of dimensionality. Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century.
    [28]Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Annals of statistics,32,407-499.
    [29]Fan, J. and Gijbels, I. (1996). Local polynomial modelling and its applications. Lon-don:Chapman and Hall.
    [30]Fan, J., Gijbels, I. and King, M. (1997). Local likelihood and local partial likelihood in hazard regression. Annals of Statistics,25,1661-1690.
    [31]Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association,96,1348-1360.
    [32]Fan, J. and Li, R. (2002). Variable selection for Cox's proportional hazards model and frailty model. Annals of Statistics,30,74-99.
    [33]Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging num-ber of parameters. Annals of Statistics,32,928-96.
    [34]Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality:feature selection in knowledge discovery. Proceedings of the International Congress of Math-ematicians,3,595-622.
    [35]Fan, J., Lin, H. and Zhou, Y. (2006). Local partial likelihood estimation for life time data. Annals of Statistics,34,290-325.
    [36]Fan, J. and Lv, J. (2010) A selective overview of variable selection in high dimensional feature space. Statistica Sinica,20,101-148.
    [37]Fan, J. and Lv, J. (2011). Non-concave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory,57,5467-5484.
    [38]Fan, J., Zhang, C. and Zhang, J. (2001) Generalized likelihood ratio statistics and Wilks phenomenon. Annals of Statistics,29,153-193.
    [39]Fleming, T. R. and Harrington, D. P. (1991). Counting processes and survival anal-ysis. John Wiley & Sons.
    [40]Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics,35,109-148.
    [41]Friedman, J., Hastie, T., Hofling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization, Annals of Applied Statistics,1,302-332.
    [42]Fu, W. J. (1998). Penalized regressions:the bridge versus the lasso. Journal of Com-putational and Graphical Statistics,7,397-416.
    [43]Gaiffas, S. and Guilloux, A. (2012). High dimensional additive hazards models and the lasso. Electronic Journal of Statistics,6,522-546.
    [44]Gentleman, R. and Crowley, J. (1991). Local full likelihood estimation for the pro-portional hazards model. Biometrics,47,1283-1296.
    [45]Gorst-Rasmussen, A. and Scheike, T. (2012). Coordinate descent methods for the penalized semiparametric additive hazards model. Journal of Statistical Software, 47,1-17.
    [46]Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli,10,971-988.
    [47]Hardle, W., Gao, J. and Liang, H. (2000). Partially linear models. Heidelberg: Springer.
    [48]Heckman, N. (1986). Spline smoothing in a partly linear model. Journal of the Royal Statistical Society Series A,48,244-248.
    [49]Huang, J. (1999). Efficient estimation of the partly linear additive Cox model. Annals of Statistics,27,1536-1563.
    [50]Huang, J., Horowitz, J. L. and Wei, F. (2010). Variable selection in nonparametric additive models. Annals of Statistics,38,2282-2313.
    [51]Huang, J., Ma, S., Xie, H. and Zhang, C. (2009). A group bridge approach for variable selection. Biometrika,96,339-355.
    [52]Huang, J., Sun, T., Ying, Z., Yu, Y. and Zhang, C. H. (2013). Oracle inequalities for the lasso in the Cox model. Annals of Statistics,41,1142-1165.
    [53]Hougaard, P. (1987). Modelling multivariate survival. Scandinavian Journal of Statis-tics.14,291-304.
    [54]Hougaard, P. (2000). Analysis of multivariate survival data. New York:Springer.
    [55]Johnson, B. A., Lin, D. Y. and Zeng D. L. (2008). Penalized estimating functions and variable selection in semiparametric regression models. Journal of the American Statistical Association,103,672-680.
    [56]Johnson, B. A. (2008). Variable selection in semiparametric linear regression with censored data. Journal of Royal Statistical Society Series B,70,351-370.
    [57]Kalbfleisch, J. and Prentice, R. (2011). The statistical analysis of failure time data. John Wiley & Sons. [58] Kim, Y., Choi, H. and Oh, H. S. (2008). Smoothly clipped absolute deviation on high
    dimensions. Journal of the American Statistical Association,103,1665-1673. [59] Kosorok, M. R. (2008). Introduction to empirical processes and semiparametric in-
    ference. New York:Springer. [60] Lawless, J. F. (1982). Statistical models and methods for lifetime data. John Wiley
    & Sons. [61] Lee, E. W., Wei, L. J., Amato, D. A. and Leurgans, S. (1992). Cox-type regression
    analysis for large numbers of small groups of correlated failure time observations. In Survival Analysis:State of the Art (pp.237-247). Netherlands:Springer. [62] Leng, C. and Ma, S. (2007). Path consistent model selection in additive risk model
    via lasso. Statistics in Medicine,26,3753-3770. [63] Li, R. and Liang, H. (2008). Variable selection in semiparametric regression modeling.
    Annals of Statistics,36,261-286. [64] Liang, K. Y., Self, S. G. and Chang, Y. (1993). Modeling marginal hazards in mul-
    tivariate failure time data. Journal of the Royal Statistical Society Series B,55, 441-453.
    [65]Lin, D. Y. (1994). Cox regression analysis of multivariate failure time data:the marginal approach. Statistics in Medicine,13,223-2247.
    [66]Lin, D. Y. and Ying, Z. (1994). Semiparametric analysis of the additive risk model. Biometrika,81,61-71.
    [67]Lin, W. and Lv, J. (2013). High-dimensional sparse additive hazards regression. Journal of the American Statistical Association,108,247-264.
    [68]Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Annals of Statistics,37,3498-3528.
    [69]Ma, S. and Huang, J. (2007). Clustering threshold gradient descent regularization: with applications to microarray studies. Bioinformatics,23,466-472.
    [70]Martinussen, T. and Scheike, T. (2009). Covariate selection for the semiparametric additive risk model. Scandinavian Journal of Statistics,36,602-619.
    [71]Meinshausen, N. and Buhlmann, P. (2006). High dimensional graphs and variable selection with the Lasso. Annals of Statistics,34,1436-1462.
    [72]Moertel, C., Fleming, T., MacDonald, J., Haller, D., Laurie, J., Goodman, P., Unger-leider, J., Emerson, W., Tormey, D., Glick, J., Veeder, M. and Maillard, J. (1990). Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma. New England Journal of Medicine,332,352-358.
    [73]Murphy, S. A. (1995). Asymptotic theory for the frailty model. Annals of Statistics, 23,182-198.
    [74]Nielsen, G. G., Gill, R. D., Andersen, P. K. and Sorensen, T. (1992). Acounting process approach to maximum likelihood estimation in frailty models. Scandinavian Journal of Statistics,19,25-44.
    [75]Oakes, D. (1989). Bivariate survival models induced by frailties. Journal of the American Statistical Association,84,487-493.
    [76]O'Sullivan, F. (1988). Nonparametric estimation of relative risk using splines and cross-validation. SIAM Journal on Scientific and Statistical Computing,9,531-542.
    [77]Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory,7,186-199.
    [78]Prentice, R. L., Williams, B. J. and Peterson, A. V. (1981). On the regression analysis of multivariate failure time data. Biometrika,68,373-379.
    [79]Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. Annals of Statistics,35,1012-1030.
    [80]Tibshirani, R. and Hastie, T. (1987). Local likelihood estimation. Journal of the American Statistical Association,82,559-567.
    [81]Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society Series B,58,267-288.
    [82]Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine,16,385-395.
    [83]Spiekerman, C. F. and Lin, D. Y. (1998). Marginal regression models for multivariate failure time data. Journal of the American Statistical Association,93,1164-1175.
    [84]Speckman, P. (1988). Kernel smoothing in partial linear models. Journal of the Royal Statistical Society Series B,50,413-434.
    [85]van de Geer, S. (1995). Exponential inequalities for martingales with application to maximum likelihood estimation for counting processes. Annals of Statistics,23, 1779-1801.
    [86]van der Vaart, A. W. (1998). Asymptotic statistics. New York:Cambridge University Press.
    [87]van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes:with applications to statistics. New York:Springer.
    [88]Wang, S., Nan B., Zhou, N. and Zhu J. (2009). Hierarchically penalized Cox regres-sion for censored data with grouped variables and its oracle property. Biometrika, 96,307-322.
    [89]Wang, H. and Xia, Y. (2009). Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association,104,747-757.
    [90]Wei, L., Lin, D. and Weissfeld, L. (1989). Regression analysis of multivariate incom-plete failure time data by modeling marginal distributions. Journal of the American Statistical Association,84,1065-1073.
    [91]Wu, T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Annals of Applied Statistics,2,224-244.
    [92]Yin, G., Li, H. and Zeng, D. (2008). Partially linear additive hazards regression with varying coefficients. Journal of the American Statistical Association,103,1200-1213.
    [93]Yatchew, A. (2003). Semiparametric regression for the applied econometrician. Cam-bridge University Press, Cambridge.
    [94]Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B,68,49-67.
    [95]Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics,28,894-942.
    [96]Zhang, H. H. and Lu, W. (2007). Adaptive lasso for Cox's proportional hazard model. Biometrika,94,691-703.
    [97]Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research,7,2541-2567.
    [98]Zhao, P., Rocha, G. and Yu, B. (2009). Grouped and hierarchical model selection through composite absolute penalties. Annals of Statistics,37,3468-3497.
    [99]Zhou, N. and Zhu, J. (2010). Group variable selection via hierarchical lasso and its oracle property. Statistics and Its Interface,3,557-574.
    [100]Zou, H. and Zhang, H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics,37,1733-1751.
    [101]Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association,101,1418-1429.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700