用户名: 密码: 验证码:
生存分析中删失数据比例对Cox回归模型影响的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
目的和意义
     在生存数据研究中,Cox回归能处理不同生存时间分布的删失数据,无疑是生存分析中最常用最经典的方法。在实际应用中,删失比例很大并应用Cox回归进行生存分析的情形并不鲜见。此时,Cox估计结果的可靠性和准确性如何?Cox模型是否对删失比例没有任何限制?这些问题国内外尚无系统研究的报道。本课题旨在研究删失比例大小对Cox模型分析结果的影响,继而确定应用Cox模型进行生存分析时删失比例的限度。这一问题的解决不仅对删失数据研究具有重要影响,还将为生存分析应用领域提供一个可参考的标准,从而增强危险因素分析的可靠性,提升科学研究结论的质量。
     方法
     根据Cox的偏似然算法,回归系数由事件和删失发生的秩序确定,而并非具体的生存时间取值,删失数据的信息只体现在偏似然函数的风险集中。但若删失比例很大,必然导致回归结果的偏倚。本研究将从随机模拟的角度探讨删失数据对Cox模型分析结果的影响,考察Cox回归模型在不同删失比例条件下结果的偏倚性、准确性和有效性。
     一、参数设置
     1.协变量个数:单因素和多因素情形,多因素情形考虑协变量个数为2、4和8。在多因素情形下,设置部分协变量为无关因素,以考察Cox模型筛选影响因素的能力。
     2.生存分布:在已知的生存分布中,只有指数分布、Weibull分布、Gompertz分布满足Cox比例风险假定。分别设置生存时间的分布为以上这3种类型。
     3.删失分布:考察Ⅰ型删失和Ⅲ型删失(随机删失)。Ⅰ型删失设置为截尾分布,Ⅲ型删失设置为指数分布和均匀分布。
     4.协变量类型:离散型和连续型随机变量,取值分布有两点分布、正态分布、均匀分布、Gamma分布等。
     5.样本量大小:以协变量个数的倍数来设置,单因素情形设置为协变量个数的20,40,80……200倍;多因素情形还考虑10倍以及500倍。以样本量和协变量个数倍数的大小来划分,可将样本大小分为3个等级:
     样本量为协变量个数的20倍以下,定义为小样本;
     样本量为协变量个数的20倍~100倍,定义为中等样本;
     样本量为协变量个数的100倍以上,定义为大样本。
     6.模拟重复次数:所有参数组合条件下重复抽样500次。
     二、评价指标设置
     1.偏倚性:回归系数的相对误差(MAD)和回归系数正负性改变的比率(BIAS)。不同删失比例条件下回归系数估计值的相对误差称为MAD,而回归系数估计值的正负号发生改变的比例,以评价指标BIAS标志。MAD和BIAS数值越小,偏倚就越小。
     2.准确性:回归系数标准差比率(Stdratio)。不同删失比例条件下回归系数标准差的大小与完整数据下的相比,比值以评价指标Stdratio标志。Stdratio越小(越接近1),结果的准确性越高。
     3.有效性:回归结果显著性比率(Propower)。以完整数据的Cox回归结果显著性为前提条件,计算不同删失比例条件下回归结果显著性所占的比例,以指标Propower标志。Propower数值越大,结果的有效性越高。
     三、模拟研究过程
     1.根据生存时间的分布规律构造出完整数据。
     根据不同的生存分布类型,求出累积基准风险函数的反函数,设置不同的分布参数和协变量,产生相应条件下生存时间的完整数据。
     2.从完整数据中根据删失数据的分布随机抽样,产生不同删失比例的若干数据集。
     先根据删失分布类型和删失比例的设置,运用迭代计算,确定删失分布中参数的取值,然后生成删失时间数据。结合生存时间和删失时间,继而产生不同删失比例下含删失的生存数据集。
     3.再以完整数据建立的Cox模型为金标准,从参数估计、显著性检验等方面评价不同删失比例下Cox结果的准确性和可靠性,计算不同删失情形下评价指标数值。
     4.对不同删失比例条件下评价指标的变化趋势进行分析。
     各项评价指标都是删失比例的单调函数,为了研究单调的特性,引入了差分的概念。一阶差分的正负性代表函数的增减性。二阶差分代表单调变化的加速度,其数值围绕0附近表示函数近似呈线性单调;偏离0越远则函数递增(递减)趋势越大。
     结果
     一、结果的偏倚性。
     以回归系数的相对误差(MAD)和回归系数正负性改变的比率(BIAS)指标来刻画。
     1.在不同生存分布类型和协变量类型下评价指标MAD和BIAS结果类似。
     2.在删失分布为Ⅰ型删失(截尾分布)情形下偏倚略小,在Ⅲ型删失各种分布类型下结果近似。
     3.受回归系数大小的影响,回归系数越小,MAD数值会越大。
     4.随着删失比例的增大,MAD和BIAS数值逐渐增大,在删失较大时会出现加速增大(加速偏倚)的现象。加速偏倚的位置和样本量大小有关:
     小样本情形,删失比例在70%后偏倚加速增大;
     中等样本情形,删失比例在80%后偏倚加速增大;
     大样本情形,删失比例在90%后偏倚加速增大。
     二、结果的准确性。
     以回归系数标准差的比率(Stdratio)来刻画。
     Stdratio的变化主要和删失比例有关:其随着删失比例的增大而不断增大,在删失比例70%时中位数数值达到1.7以上且这种增大趋势会“加速”。Stdratio的增大和加速增大的趋势不受样本量大小的影响,在各种参数条件下数值接近。
     三、结果的有效性。
     以回归结果显著性比率(Propower)来刻画。
     Propower与协变量的标准差、样本量的大小等因素都有关,但它总是随着删失比例的增大而不断下降。
     四、极端值的分布
     在小样本和大删失的情形下,比较容易出现极端值的现象。取Stdratio数值大于100做为极端值来描述其分布,此时MAD最小值达到4.5,最大值超过1000,Cox回归的估计毫无意义可言。与Ⅲ型删失相比,Ⅰ型删失较少出现极端值现象。在小样本情形下,极端值的出现应引起重视。在单因素情形下,若事件数(死亡例数)小于10,极端值出现的可能性达到5%,若事件数小于6,极端值出现可能性上升到20%。
     结论
     删失比例的增大会造成Cox模型分析结果的准确性、有效性下降,偏倚性增大。在删失比例超过70%后,Stdratio中位数数值超过1.7且加速增大,结果的准确性大大下降。指标Propower数值总是随着删失比例的增大而不断下降。
     在小样本情形下,删失比例超过70%后,偏倚加速增大且极端值的可能出现应引起重视。中等样本情形下,删失比例超过80%后,偏倚加速增大。大样本情形下,删失比例超过90%后,偏倚加速增大。
     为了提高结论的准确性和可靠性,在应用Cox模型进行生存分析时,应检查删失比例是否超过最大限度:样本量为协变量个数20倍以内,删失比例不宜超过70%;样本量为协变量个数20~100倍之间,删失比例不宜超过80%;样本量为协变量个数100倍以上,删失比例不宜超过90%。
     总而言之,本研究揭示了删失比例对Cox模型结果的影响,根据课题的研究结果确定了应用Cox模型进行生存分析时删失比例的限度,为实际应用提供了参考依据。
Objective
     The Cox regression is one of the most common methods in survival study.It is widely focused on what the reliability and accuracy of the Cox regression are for the survival data with largely censoring proportion in practice.It is lack of systematic research about these issues at present.The aim of this study is to explore the effects of censoring proportions on the Cox regression model and determine the limits of censoring proportion when using the Cox model.The solution of these issues not only has an important influence on the study of censored data,but also provides a standard reference for the applications in survival analysis,so as to enhance the reliability of the analysis of risk factors and the quality of scientific research findings.
     Methods
     In term of the algorithm of the Cox partial likelihood estimate,we known that the regression coefficients are determined by the order of events occurrence and censoring,rather than specific values of survival time,and the censoring only provide information for the hazard function set of the Cox partial likelihood function.The estimate of Cox regression would be biased when lagerly censoring proportion.In this study,the Monte Carlo method was used to detect the bias,accuracy and reliability of the Cox model under the different censoring proportions. Parameters setting.
     1.Covariates.The single factor and multi-factors,two,four and eight covariates respectively,were taken into consideration.And irrespective factors would be considered in multi-factors analysis to evaluate the ability of filter factors in Cox model.
     2.Survival distribution.Of the known survival distribution,only three types satisfy the Cox proportional hazard assumptions.The survival times were simulated respectively based on following three distributions,the exponential distribution, Weibull distribution and Gompertz distribution.
     3.Censoring distribution.TypeⅠcensoring was set to truncation distribution,typeⅢcensoring(random censoring) was set to exponential distribution and uniform distribution.
     4.Types of covariates.Discrete and continuous random variable were implemented. Common distribution,such as two-point distribution,uniform distribution,normal distribution,Gamma distribution,were of intrest.
     5.Sample size.The sample sizes were determined upon the times of covariate number.The 20,40,80,…200 times were set in single factor analysis,besides these 10 and 500 times in multi-factors analysis.Sample size can be divided into three levels using the times between the sample size and the numbers of covariates: If sample size is less than 20 times the covariate number,defined as small sample size. If sample size is between 20 times and 100 times the covariate number,defined as moderate sample size.If sample size is more than 100 times the covariate number, defined as large sample size.
     6.Simulation repetition:500 replications are run for each simulation. Criteria of evaluation.
     1.Bias.The relative mean absolute deviation of the regression coefficient(MAD) and the relative signed error of the regression coefficient(BIAS) were applied to assess the bias.MAD is the relative absolute deviation of the regression coefficient under censoring data to complete data.BIAS is the relative signed error of the regression coefficient under censoring data to complete data.The smaller the BIAS and MAD, the less the bias.
     2.Accuracy.The ratio of standard deviation under censoring data to that under complete data(Stdratio) was used to measure the accuracy.More close to 1 of the Stdratio value,more accurate.
     3.Validity.The ratio of significance of censoring data over complete data(Propower) was employed to evaluate the validity.The larger the Propower,the more valid. Proceding of simulation study.
     1.The complete survival data sets were simulated based on three types of survival distributions mentioned above with different parameters after the inverse function of the cumulative baseline hazard function was calculated.
     2.Using iterative calculations,the different censoring time data sets were generated from the simulated complete data set by means of random sampling under various censoring conditions.Combining censoring time and survival time,censored survival data sets were produced with different censoring proportion.
     3.The golden standard was defined as the estimations of the Cox model under the complete data.The investigated models of interest with different censoring proportion were evaluated with respect to parameter estimation as well as significant test,and so forth.The designed evaluating criteria which were calculated from the censoring data models were compared with that of the golden standard model.
     4.The results of simulations were analyzed in terms of criteria of evaluation under various censoring conditions.
     The criteria of evaluation were the monotone function of censoring proportion.In order to study the properties of monotonicity,the concept of difference was introduced.The positive and negative changes of first-order difference represent the monotonous of the function,whereas the changes of second-order difference represent the acceleration.The function is approximation linear when second-order difference is around zero.And the function will accelerate increase(or decrease) when second-order difference deviate from zero.
     Results
     Bias.
     Bias of the Cox regression model was mainly described by MAD and BIAS.
     1.The results of MAD and BIAS were similar under different types of distribution and covariates.
     2.Less bias occurred under typeⅠcensoring.The results were similar under various distribution when typeⅢcensoring was under investigated.
     3.MAD which was influenced by the magnitude of regression coefficient was larger while the value of the coefficient was smaller.
     4.The bias increased gradually with the increase of censoring proportion.More over, the bias would be accelerated in the case of large censoring proportion.The position of the bias acceleration associated with sample size.The relationships of these two were listed as follows in term of the sample size and its multiple links with the numbers of covariates.Small sample size(below 20 times),accelerated bias occured at 70%censoring.Moderate sample size(20 to 100 times),accelerated bias occured at 80%censoring.Large sample size(above 100 times),accelerated bias occured at 90% censoring.
     Accuracy.
     The deviation of regression coefficients was described by Stdratio.
     The value of Stdratio is mainly determined by the proportion of censoring.It is a monotonous increased function of the censoring proportion,and this upward trend will accelerate at 70%censoring.The increased and accelerated trend will not be affected by the sample size which could concluded from the graphics.At the same time,the values approximate the same under various parameter conditions.
     Validity.
     Validity of the Cox regression model was described by Propower.
     The value of Propower was influenced by the covariate variation,sample size,as well as other study factors.And as a rule,it decreases gradually with the increasing of censoring proportion.
     Extreme values.
     Extreme values occur frequently in small samples and large censoring.When Stdratio is greater than 100,the minimum MAD is 4.5,the maximum value is more than 1000, therefore the estimation produced by Cox regression analysis makes no sense. Compared with randomly censoring,less extreme values were detected when typeⅠcensoring.More attention should be paid to the appearance of extreme value when sample size is small.If the events count less than 10,the occurrence of extreme value was assumed to happen with the probability of 5%,and the probability rise to 20% while the events count less than 6.
     Conclusion
     The increasing censoring proportion will make bias increased,accuracy and validity decreased.The accuracy of the outcome is supposed to drop dramatically with the larger acceleration of Stdratio when censoring proportion is 70%or more.That the enlarging acceleration of bias and the incidence of extreme value should be noted when the censoring is over 70%with small sample size.As with the moderate sample size,the bias was assumed to accelerate increasingly while the censoring proportion is 80%or more.In the case of large sample size the bias acceleration would boost up while the censoring proportion is 90%or more.
     Whenever someone conducts survival analysis with the Cox regression model,it is suggested that the censoring proportion should be less than 70%if sample size is within 20 times of covariates number,less than 80%if sample size is between 20 to 100 times of covariates number,and less than 90%if sample size is over 100 times of covariates number.
     It comes to conclusion that censoring proportion should be limited to reasonable level for the Cox regression model to conduct survival analysis in practice.
引文
1.Andersen PK,Borgan O,Gill RD.Statistical models based on counting processes[M].London:Springer,1993.
    2.Kleinbaum DG,Klein M.Survival Analysis:A self-learning text(2~(nd))[M].London:Springer,2005.
    3.Klein JP,Moeschberger ML.Survival analysis techniques for censored and truncated data[M].London:Sprinker,2003.
    4.Lawless JF.Statistical models and methods for lifetime data[M].New York:Wiley,2002.
    5.Fleming TR,Harrington D.Counting processes and survival analysis[M].New York:Wiley,2002.
    6.Blossfeld HP,Rohwer G.Techniques of event history modeling:new approaches to causal analysis[J].NJ:Lawence Erlbaum,2002.
    7.Box-Steffensmeier JM and Jones BS.Event history modeling:a guide for social scientist[M].New York:Cambridge University Press,2004.
    8.Cox DR.Regression models and life-tables[J].Journal of the Royal Statistical Society.Series B,1972,34(13):187-220.
    9.Cox DR.Partial likelihood[J].Biometrika,1975,62(2):269-276.
    10.Alan Agresti.An introduction to categorical data analysis[M].John Wiley & Sons,1996.
    11.Anderson GL,Fleming TR.Model misspecification in proportional hazards regression[J].Biometrika,1995,82(3):527-541.
    12.Ducrocq V,Besbes B,Protais M.Genetic improvement of laying hens viability using survival analysis[J].Genet Sel Evol,2000,32(1),23-40.
    13.李荣,朱慧明.删失试验寿命的贝叶斯威布尔生存回归模型[J].统计与决策,2006,24:20-22.
    14.宛新荣,王梦军,王广和等.具有左截断、右删失寿命数据类型的生命表编制方法[J]. 动物学报,2001,47(1):101-107.
    15.吴耀国,周杰,王柱等.随机删失数据下基于EM算法的Weibull分布参数估计[J].四川大学学报(自然科学版),2005,42(5):910-913.
    16.薛留根.随机删失下半参数回归模型的估计理论[J].数学年刊A辑(中文版),1999,20(6):745-754.
    17.卢学文.随机删失场合基于Synthetic Data的回归函数的核估计及强相合性[J].应用概率统计,1996,12(1):29-36.
    18.贺宪民,贺佳,范思昌.临床随访资料的多元分析方法[J].第二军医大学学报,2001,22(1):83-85.
    19.Anderson CA,McRae AF,Visscher PM.A simple linear regression method for quantitative trait loci linkage analysis with censored observations[J].Genetics,2006,173(3):1735-45.
    20.Kneib T.Mixed model-based inference in geoadditive hazard regression for interval-censored survival times[J].Computational Statistics & Data Analysis,2006,51(2):777-792.
    21.Pons O.Semi-parametric estimation for a semi-Markov process with left-truncated and right-censored observations[J].Statistics & Probability Letters,2006,76(9):952-958.
    22.Dabrowska DM,Ho WT.Estimation in a semiparametric modulated renewal process[J].Statistica Sinica,2006,16(1):93-119.
    23.Wang HM,Jones MP,Storer BE.Comparison of case-deletion diagnostic methods for Cox regression[J].Statistics in Medicine.2006,25(4):669-683.
    24.Tian L,Zucker D,Wei LJ.On the cox model with time-varying regression coefficients[J].Journal of the American Statistical Association,2005,100(469):172-183.
    25.Messaci F.Computation of predictive densities in the Bayesian Cox-Dirichlet model with fixed censoring[J].Comptes Rendus Mathematique,2005,341(4):259-264.
    26.Braekers R,Veraverbeke N.Cox's regression model under partially informative censoring[J].Communications in Statistics-theory and Methods,2005,34(8):1793-1811.
    27.Heller G,Simonoff JS.Prediction in censored survival data:a comparison of proportional hazards and linear regression models[J].Biometrics,1992,48:101-115.
    28.Lee ET,Go OT.Survival analysis in public health research.Annual Review of Public Health[J],1997,18(1):105-134.
    29.Lee ET,Wang JW.Statistical methods for survival data analysis(3rd)[M].New Jersey:John Wiley & Sons,2003.
    30.Francisco LN.Extended Hazard Regression Model for Reliability and Survival Analysis[J].Lifetime Data Analysis,1997,3(4):367-381.
    31.陈兵,骆福添.Buckley-James模型在生存分析中的应用[J].中国医院统计,2006,13(2):138-140.
    32.Orbe J,Ferreira E,N(?)ez-Ant(?)n V.Comparing proportional hazards and accelerated failure time models for survival analysis[J].Statistics in Medicine,2002,21:3493-3510.
    33.Burton A,Altman DG,Royston P,Holder RL.The design of simulation studies in medical statistics[J].Statistics in Medicine,2006,25:4279-4292.
    34.Lachin JM.Sample size determination,in Encyclopedia of Biostatistics[M].Wiley:New York,1998,4693-4704.
    35.Ralf B,Thomas A,Maria B.Generating survival times to simulate Cox proportional hazards models[J].Statistics in Medicine,2005,24:1713-1723.
    36.Chen K,Jin ZZ and Ying ZL.Semiparametric analysis of transformation models with censored data[J].Biometrika,2002,89(3),659-668.
    37.Gelfand AE,Ghosh SK,Christiansen C,Soumerai SB,McLaughlin TJ.Proportional hazards models:a latent competing risk approach[J].Journal of the Royal Statistical Society,Series C,2000,49:85-397.
    38.孙涛,张宏建.基于一阶差分的粗差剔除方法[J].仪器仪表学报,2002,23(2):197-199.
    39.《现代应用数学手册》编委会编.现代应用数学手册,计算与数值分析卷[M].北京:清华大学出版社,2007,38-68.
    40.Kneib T,Fahrmeir L.A mixed model approach for structured hazard regression[J].Scandinavian Journal of Statistics,2007,34:207-228.
    41.Yin GS,Hu JH.Two simulation methods for constructing confidence bands under the additive risk model[J].Journal of biopharmaceutical statistics.2004,14(2):389-402.
    42.K(?)chenhoff H,Bender R,Langner I.Effect of Berkson measurement error on parameter estimates in Cox regression models[J].Lifetime Data Analysis.2007,13(2):261-272.
    43.Montenegro LCC,Colosimo EA,Cordeiro GM,Cruz FRB.Bias correction in the Cox regression model[J].Journal of Statistical Computation and Simulation,2004,74:379-386.
    44.Colosimo EA,Silva AF,Cruz FRB.Bias evaluation in the proportional hazards model[J].J.Statist.Comp.Simul.,2000,65(3):191-201.
    45.Hossain A,Zimmer W.Comparison of estimation methods for weibull parameters:complete and censored samplesfJ].Journal of Statistical Computation and Simulation,2002,73(2):145-153.
    46.Schoenfeld DA.Sample-size formula for the proportional-hazards regression model[J].Biometrics,1983,39(2):499-503.
    47.Langner I,Bender R,Lenz-Tonjes R,KUchenhoff H,Blettner M.Bias of maximum likelihood estimates in logistic and cox regression models:a comparative simulation study[J].http://epub.ub.uni-muenchen.de/1737.2007.
    48.Ng'Andu NH.An empirical comparison of statistical tests for assessing the proportional hazards assumption of cox's model[J].Statistics in Medicine,1997,16(6):611-626.
    49.Zelterman D,LE CT,Louis TA.Bootstrap techniques for proportional hazards models with censored observations[J].Statistics and Computing,1996,6,191-199.
    50.Hemyari P.Robustness of the quartiles of survival time and survival probability[J].Biopharmaceutical Statistics,2000,10(3):299-318.
    51.Slasor P,Laird N.Joint models for effcient estimation in proportional hazards regression models[J].Statistics in Medicine,2003;22:2137-2148.
    52.Louzada-Neto F.Extended hazard regression model for reliability and survival analysis[J].Lifetime Data Analysis,1997,3:367-381.
    53.Johnson ME,Tolley HD,Bryson MC,Goldman AS.Covariate analysis of survival data:a small-sample study of Cox's model[J].Biometrics,1982,38(3):685-98
    54.Loughin TM.On the bootstrap and monotone likelihood in the cox proportional hazards regression model[J].Lifetime data analysis,1998,4(4):393-403.
    55.Liao HW.A simulation study of estimators in stratified proportional hazards models[J].http://www.ats.ucla.edu/stat/sas/library/nesug98/p118.pdf,1998.
    56.Richardson DB.Power calculations for survival analyses via monte carlo estimation[J].American Journal of Industrial Medicine,2003,44:532-539.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700