心血管病流行病调查中缺失数据填补方法的比较及模拟研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
目的
     心血管疾病是世界范围内严重危害人类健康的疾病,近年来研究显示,其发病率和死亡率在发展中国家日益增高,针对这一类慢性疾病,很多大规模的流行病学调查研究开展起来,为心血管病的预防提供了新线索和大样本数据的证据。然而,由于人的社会属性和心理特点,常常导致一些科研资料存在不完整数据的情况,即存在缺失数据。对于缺失比例在一定范围内的数据,过去的做法多是直接删除,这种直接的做法虽然简单易行,但却会减少观测的样本量,从而影响分析结果的检验效能。近些年,插补类方法正得到越来越多专家和学者的认可,相应的新方法发展迅速。本研究利用单一插补和多重插补等方法处理缺失数据,重点对多重插补类方法之间的差别进行比较,期望寻找到适用于常规慢性流行病学调查研究中缺失数据的填补策略与方法。
     方法
     以心血管病领域的一个大样本、多变量数据集为基础,采用蒙特卡洛技术,按照完全随机缺失机制模拟该数据集在5%、10%、20%、30%四种缺失比例下,单个不同类型变量(包括连续变量、二值变量、有序变量和名义变量)的缺失情形,以及单调缺失模式两个变量缺失,或任意缺失模式两个变量缺失情形。每种缺失情形模拟500次。每次模拟中,分别采用单一插补、联合模型(joint modeling, JM)多重插补策略、全条件定义(fully conditional specification, FCS)多重插补策略对缺失后的数据集进行处理。然后,收集各次模拟时不同方法的处理效果评价指标取值,并对这些取值进行汇总分析,比较这些方法的处理效果。
     结果
     对于单变量缺失而言,联合模型(joint modeling, JM)多重插补策略对缺失的单个连续变量插补时,可获得最为接近完整数据集的整体均数;联合模型(joint modeling, JM)多重插补策略对缺失的单个名义变量插补时,可获得对缺失个体值最高的插补正确率。但全条件定义(fully conditional specification, FCS)多重插补策略,则在对单个连续变量个体缺失值的插补方面精确度更高,插补后模型的参数偏差也更小;且全条件定义(fully conditional specification, FCS)多重插补策略对单个二值变量个体缺失值的插补方面精确度方面也更高。对单个缺失的分类变量而言,判别分析法插补正确率高于logistic回归插补法。就多重插补次数而言,单个缺失的连续变量,插补15次效果最好,但10次以上效果提升幅度有限:单个缺失的二值变量、名义变量,插补5次效果最好。
     对于单调缺失模式多变量缺失,联合模型(joint modeling, JM)多重插补策略对个体缺失值的插补方面精确度高于全条件定义(fully conditional specification,FCS)多重插补策略。在连续变量与二值变量、连续变量与有序变量、连续变量与名义变量单调缺失的插补中,全条件定义(fully conditional specification, FCS)多重插补策略对连续变量在个体缺失值的插补精确性方面高于联合模型(joint modeling,JM)多重插补策略,但联合模型(joint modeling, JM)多重插补策略对分类变量的插补正确率高于全条件定义(fully conditional specification, FCS)多重插补策略。
     对于任意缺失模式多变量缺失,在连续变量与名义变量缺失的插补中,预测均数匹配法(regpmm)与判别函数法(discrim)联用,对连续变量在个体值的插补精确度上更好,对名义变量的插补准确率也较高。四种缺失比例情形综合考量,FCS(regpmm+discrim)插补5次处理效果整体最好。
     结论
     本研究以心血管病研究领域的一个大样本完整数据集为基础,采用模拟缺失的方法,构造了不同类型变量缺失情况。对于单个变量缺失,联合模型(joint modeling,JM)多重插补策略适用于名义变量,而全条件定义(fully conditional specification,FCS)多重插补策略适用于二值变量和连续型变量;对于单调缺失模式多个连续变量缺失,联合模型(joint modeling, JM)多重插补策略精度更高,对于既有连续变量又有离散变量缺失,联合模型(joint modeling, JM)多重插补适用于其中连续变量,全条件定义(fully conditional specification, FCS)多重插补策略适用于其中离散变量;对于任意缺失模式多变量缺失,全条件定义(fully conditional specification,FCS)多重插补策略精度较高。
Objective
     Cardiovascular disease is a serious disease to human health worldwide. Recent studies have shown that the incidence and mortality were increasing in developing countries. For this chronic disease, many large-scale epidemiological researches carried out,and provided new clues and evidence of a large sample for the prevention of cardiovascular disease. However, due to the social and psychological characteristics of people, there was a number of incomplete data in the scientific information, named missing data. For the proportion of missing data within a certain range, the past approach was deleting the data directly. While simple, but it will reduce sample of observations, and affect the test power of results. In recent years, the imputation methods were recognized by more experts, and developed rapidly. In this study, single and multiple imputation methods are applied for handling missing data, focused on the differences between many multiple imputation methods, and we expect to find appropriate methods and strategies for chronic epidemiological studies.
     Methods
     We took Jmte Carlo techniques to simulate the different types of single variable (including continuous variables, binary variables, ordinal variables and nominal variables) missing at random, two variables jmotone missing, or two variables random missing at5%,10%,20%, and30%missing proportions, based on a large sample of cardiovascular disease and multivariate data sets. We simulated500times in each scenario deletion. In each simulation, were used delete method, a single imputation method, joint modeling multiple imputation method, and FCS multiple imputation method for missing data set after processing. Then, collected evaluated values of different methods in each time, and compared treatment effects.
     Results
     For single variable missing, the joint modeling multiple imputation method can get overall mean value closed to complete data set if it was single continuous variable missing; If it was a single nominal variable missing, jmotone joint modeling imputation method may get the highest correct rate for the missing individual. But FCS multiple imputation method can get greater accuracy and smaller parameter deviation for single continuous variable missing, and the same to a single binary variable missing. For a single categorical variable, the discriminant analysis method was better than the logistic regression imputation method. To multiple imputation times, the imputation15times were the best, but more than10times the effect enhanced limited for single continuous variable missing; single missing binary variables and nominal variables,5times were best.
     For jmotone multivariate missing, joint modeling multiple imputation method was better than FCS multiple imputation method. In binary variable and continuous variable, ordinal variable and continuous variable, nominal variable and continuous variable imputation, FCS multiple imputation method had higher accuracy than joint modeling multiple imputation method for continuous variable, but joint modeling imputation multiple imputation method had higher correct rate to another categorical variable.
     For random multivariate missing, in continuous variables and nominal variables missing imputation, regpmm and discrim associated had high accuracy for continuous variables and nominal variable. For four kinds of situations,5times FCS (regpmm+discrim) imputation were best.
     Conclusion
     In our study, we used simulation methods to construct different types of variable missing. For a single variable missing, joint modeling multiple imputation method was suitable for nominal variables, and FCS multiple imputation method adapt to binary variables and continuous variables; for jmotone multiple continuous variables missing, jmotone joint modeling imputation can get higher accuracy; for both continuous variables and discrete variables missing, joint modeling multiple imputation applied to continuous variable and FCS multiple imputation method was suitable for discrete variables; for multivariate random missing, FCS multiple imputation can get higher precision.
引文
[1]Acock A C. Working with missing values. Journal of Marriage and Family,2005,67(4): 1012-1028.
    [2]Rubin D B. Inference and missing data. Biometrika,1976,63(3):581-592.
    [3]Adam Davey, Jyoti Savla. Statistical Power Analysis with Missing Data:A Structural Equation Modeling Approach. New York:Routledge Taylor & Francis Group,2010:47-65.
    [4]Michael J. Daniels, Joseph W. Hogan. Missing Data in Longitudinal Studies Strategies for Bayesian Modeling and Sensitivity Analysis. Chapman & Hall/CRC,2008:89-94.
    [5]Schafer J L. Multiple imputation:a primer. Statistical methods in medical research,1999,8(1): 3-15.
    [6]Roderick J Little, Donald B Rubin. Statistical analysis with missing data.2nd edition. John Wiley & Sons, inc. New Jersey,2002:4-10.
    [7]Stef van Buuren. Flexible Imputation of Missing Data. New York:CRC Press.2012:6-10.
    [8]Donders A R T, van der Heijden G J M G, Stijnen T, et al. Review:a gentle introduction to imputation of missing values. Journal of clinical epidemiology,2006,59(10):1087-1091.
    [9]Sinharay S, Stern H S, Russell D. The use of multiple imputation for the analysis of missing data: Psychological methods,2001,6(4):317.
    [10]Craig K. Enders. Applied missing data analysis. New York:the Guilford press.2010:37-55.
    [11]Cottrell G, Cot M, Mary J Y. Multiple imputation of missing at random data:General points and presentation of a Jmte-Carlo method. Revue d'epidemiologie et de sante publique,2009,57(5): 361-372.
    [12]Lee K J, Simpson J A. Introduction to multiple imputation for dealing with missing data. Respirology,2014,19(2):162-167.
    [13]Sinharay S, Stern H S, Russell D. The use of multiple imputation for the analysis of missing data. Psychological methods,2001,6(4):317.
    [14]Cheng P E. Nonparametric estimation of mean functionals with data missing at random[J]. Journal of the American Statistical Association,1994,89(425):81-87.
    [15]Van der Heijden G J M G, T Donders A R, Stijnen T, et al. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research:a clinical example[J]. Journal of clinical epidemiology,2006,59(10):1102-1109.
    [16]Royston P. Multiple imputation of missing values[J]. Stata Journal,2004,4:227-241.
    [17]Qi L, Wang Y F, He Y. A comparison of multiple imputation and fully augmented weighted estimators for Cox regression with missing covariates[J]. Statistics in medicine,2010,29(25): 2592-2604.
    [18]金勇进,朱琳.不同差补方法的比较.数理统计与管理,2000,19(2):50-54.
    [19]武建虎,贺佳,贺宪民,等.多变量缺失数据的不同处理方法及分析结果比较[J].第二军医大学学报,2004,25(9):1013-1016.
    [20]茅群霞.缺失值处理统计方法的模拟比较研究及应用.四川大学硕士学位论文,2005.
    [21]梁霞.缺失数据的多重插补及其改进.中南大学硕士学位论文,2007.
    [22]Haji-Maghsoudi S, Haghdoost A A, Rastegari A, et al. Influence of pattern of missing data on performance of imputation methods:an example from national data on drug injection in prisons. International Journal of Health Policy and Management,2013,1(1):69-77.
    [23]Shah A D, Bartlett J W, Carpenter J, et al. Comparison of random forest and parametric imputation models for imputing missing data using mice:a caliber study. American Journal of Epidemiology,2014:kwt312.
    [24]Sterne J A C, White I R, Carlin J B, et al. Multiple imputation for missing data in epidemiological and clinical research:potential and pitfalls. BMJ:British Medical Journal,2009, 338.
    [25]Vergouw D, Heymans M W, van der Windt D A W M, et al. Missing data and imputation:a practical illustration in a prognostic study on low back pain. Journal of manipulative and physiological therapeutics,2012,35(6):464-471.
    [26]赵飞.疾病监测资料中缺失值最佳填充次数的研究.中国卫生统计,2009,26(5):455-458.
    [27]张熙.多重填补在随机干预试验研究中的应用.中国卫生统计,2011,28(5):537-539.
    [28]张彪.流行病学现场研究中缺失数据插补方法的应用及比较研究.北京协和医学院硕士学位论文,2012.
    [29]Bartlett J W, Seaman S R, White I R, et al. Multiple imputation of covariates by fully conditional specification:accommodating the substantive model. arXiv preprint arXiv:1210.6799,2012.
    [30]Ounpuu S, Negassa A, Yusuf S, et al. INTERHEART, a global study of risk factors for myocardial infarction. Am Heart J,2001,141:711-21.
    [31]Yusuf S, Hawken S, Ounpuu S, et al:On behalf of the INTERHEART Study Investigators. Effect of potentially modifiable risk factors associated with myocardial infarction in 52 countries (the INTERHEART study):case-control study. Lancet,2004,364:937-52.
    [32]McPherson R, Pertsemlidis A, Kavaslar N, et al:A comjm allele on chromosome 9 associated with coronary heart disease. Science,2007,316:1488-91.
    [33]Samani NJ, et al:Genomewide association analysis of coronary artery disease. N Engl J Med, 2007,357:443-53.
    [34]Clarke R, Peden F, Opewell C, et al:Genetic Variants Associated with Lp(a) Lipoprotein Level and Coronary Disease. The New England Journal of Medicine,2009,361:2518-28.
    [35]Teo KK, Liu LS, Chow CK, et al:Potentially modifiable risk factors associated with myocardial infarction in China:the INTERHEART China study. Heart,2009,95:1857-64.
    [36]Guo J, Li W, Yusuf S, et al. Influence of dietary patterns on the risk of acute myocardial infarction (AMI) in China population:the INTERHEART China study. Chinese Medical Journal, 2013,126:464-70.
    [37]Guo J, Li W, Wang Y, et al. Influence of socioeconomic status on acute myocardial infarction (AMI) in China population:the INTERHEART China study. Chinese Medical Journal,2012,125: 4214-20.
    [38]Chinese National Center for Cardiovascular Diseases. Report on cardiovascular diseases in China (2008-2009). Beijing:Encyclopedia of China Publishing House,2009.
    [39]Hawken S, Ounpuu S, Dans T, et al. Effect of potentially modifiable risk factors associated with myocardial infarction in 52 countries (the INTERHEART study):case-control study. Lancet, 2004,364:937-52.
    [40]Rosengren A, Subramanian SV, Islam S, et al. Education and risk for acute myocardial infarction in 52 high, middle and low-income countries:INTERHEART case-control study. Heart,2009, 95:2014-22.
    [41]Gonzalez-Zobl G, Grau M, Munoz MA, et al. Socioeconomic Status and Risk of Acute Myocardial Infarction:Population-Based Case-Control Study. Rev Esp Cardiol,2010, 63:1045-53.
    [42]Kristiina Manderbacka and Marko Elovainio. The Complexity of the Association Between Socioeconomic Status and Acute Myocardial Infarction. Rev Esp Cardiol 2010;63:1015-18
    [43]Schaufelberger M, Rosengren A. Heart failure in different occupational classes in Sweden. Eur Heart J,2007,28:212-8.
    [44]Iqbal R, Anand S, Ounpuu S, et al. Dietary Patterns and the Risk of Acute Myocardial Infarction in 52 Countries:Results of the INTERHEART Study. Circulation,2008,118:1929-37.
    [45]Amani R, Noorizadeh M, Rahmanian S, et al. Nutritional related cardiovascular risk factors in patients with coronary artery disease in IRAN:A case-control study. Nutrition Journal,2010, 9:70-5.
    [46]Oliveira A, Rodriguez-Artalejo F, Gaio R, et al. Major habitual dietary patterns are associated with acute myocardial infarction and cardiovascular risk markers in a southern European population. J Am Diet Assoc,2011,111:241-50.
    [47]Olinto MT, Gigante DP, Horta B, et al. Major dietary patterns and cardiovascular risk factors ajmg young Brazilian adults. Eur J Nutr,2011,17:1-4.
    [48]Nagata C, Nakamura K, Oba S, et al; Association of intakes of fat, dietary fibre, soya isoflavones and alcohol with uterine fibroids in Japanese women. Br J Nutr,2009,101:1427-31.
    [49]Teo K, Chow CK, Vaz M, et al. The Prospective Urban Rural Epidemiology (PURE) study: Examining the impact of societal influences on chronic noncommunicable diseases in low-, middle-, and high-income countries. Am Heart J,2009,158:,1-7.
    [50]Guo J, Li W, Wu Z, et al. Association Between 9p21.3 Genomic Markers and Coronary Artery Disease in East Asians:A Meta-analysis Involving 9813 Cases and 10710 Controls. Molecular Biology Reports,2013,40:337-43.
    [51]郭晋,李卫,刘欣等.染色体9p21和1p13上单核苷酸多态性(SNP)位点与中国人群急性心肌梗死(AMI)的关联研究:中国急性心梗研究.第二军医大学学报,2011,32:822-6.
    [52]Helgadottir A, Thorleifsson G, Manolescu A, et al:A comjm variant on chromosome 9p21 affects the risk of myocardial infarction. Science,2007,316:,1491-93.
    [53]Brand J P L. Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Erasmus MC:University Medical Center Rotterdam,1999.
    [54]Van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification[J]. Statistical methods in medical research,2007,16(3):219-242.
    [55]SAS Institute Inc.2013. SAS/STAT(?) 13.1 User's Guide. Cary, NC:SAS Institute Inc., 2013:5035-5171.
    [56]Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, New York:John Wiley & Sons.,1987:13,166-167.
    [57]Heitjan D F, Little R J A. Multiple imputation for the fatal accident reporting system[J]. Applied Statistics,1991,40(1):13-29.
    [58]Schenker N, Taylor J M G. Partially parametric techniques for multiple imputation[J]. Computational Statistics & Data Analysis,1996,22(4):425-446.
    [59]Lavori P W, Dawson R, Shera D. A multiple imputation strategy for clinical trials with truncation of patient data[J]. Statistics in medicine,1995,14(17):1913-1925.
    [60]Allison P D. Multiple imputation for missing data:A cautionary tale[J].2000.
    [61]Schafer J L, Analysis of Incomplete Multivariate Data, New York:Chapman & Hall.,1997:11-154.
    [62]SAS Institute Inc.2013. SAS/STAT(?) 13.1 User's Guide. Cary, NC:SAS Institute Inc., 2011:4592.
    [63]David GK, Klein M. Logistic regression a self-learning text.3rd edition. New York: Springer-Verlag,2010:464-488
    [64]Warner P. Ordinal logistic regression. J Fam Plann Reprod Health Care,2008,34(3):169-170
    [65]Abreu MN, Siqueira AL, Caiaffa WT. Ordinal logistic regression in epidemiological studies. Rev Saude Publica 2009,43(1):183-19421
    [66]Laurikkala J, Kentala E, Juhola M, et al. Treatment of missing values with imputation for the analysis of otologic data[J]. Studies in health technology and informatics,1999:428-431.
    [67]Bloch D A, Silverman B W. Jmotone discriminant functions and their applications in rheumatology[J]. Journal of the American Statistical Association,1997,92(437):144-153.
    [68]Yuan Y. Multiple imputation using SAS software[J]. Journal of Statistical Software,2011,45(6): 1-25.
    [1]Acock A C. Working with missing values. Journal of Marriage and Family,2005,67(4): 1012-1028.Stef van Buuren.
    [2]Flexible Imputation of Missing Data. New York:CRC Press.2012:6-10.
    [3]Donders A R T, van der Heijden G J M Q Stijnen T, et al. Review:a gentle introduction to imputation of missing values. Journal of clinical epidemiology,2006,59(10):1087-1091.
    [4]Sinharay S, Stern H S, Russell D. The use of multiple imputation for the analysis of missing data. Psychological methods,2001,6(4):317.
    [5]Craig K. Enders. Applied missing data analysis. New York:the Guilford press.2010:37-55.
    [6]Cottrell G, Cot M, Mary J Y. Multiple imputation of missing at random data:General points and presentation of a Jmte-Carlo method. Revue d'epidemiologie et de sante publique,2009,57(5): 361-372.
    [7]Lee K. J, Simpson J A. Introduction to multiple imputation for dealing with missing data. Respirology,2014,19(2):162-167.
    [8]Sinharay S, Stern H S, Russell D. The use of multiple imputation for the analysis of missing data. Psychological methods,2001,6(4):317.
    [9]Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, New York:John Wiley & Sons.,1987:13,166-167.
    [10]Heitjan D F, Little R J A. Multiple imputation for the fatal accident reporting system[J]. Applied Statistics,1991,40(1):13-29.
    [11]Schenker N, Taylor J M G. Partially parametric techniques for multiple imputation[J]. Computational Statistics & Data Analysis,1996,22(4):425-446.
    [12]Lavori P W, Dawson R, Shera D. A multiple imputation strategy for clinical trials with truncation of patient data[J]. Statistics in medicine,1995,14(17):1913-1925.
    [13]Allison P D. Multiple imputation for missing data:A cautionary tale[J].2000.
    [14]Schafer J L, Analysis of Incomplete Multivariate Data, New York:Chapman & Hall.,1997:11-154.
    [15]SAS Institute Inc.2013. SAS/STAT(?) 13.1 User's Guide. Cary, NC:SAS Institute Inc., 2011:4592.
    [16]David GK, Klein M. Logistic regression a self-learning text.3rd edition. New York: Springer-Verlag,2010:464-488
    [17]Warner P. Ordinal logistic regression. J Fam Plann Reprod Health Care,2008,34(3):169-170
    [18]Abreu MN, Siqueira AL, Caiaffa WT. Ordinal logistic regression in epidemiological studies. Rev Saude Publica 2009,43(1):183-19421
    [19]Laurikkala J, Kentala E, Juhola M, et al. Treatment of missing values with imputation for the analysis of otologic data[J]. Studies in health technology and informatics,1999:428-431.
    [20]Bloch D A, Silvennan B W. Jmotone discriminant functions and their applications in rheumatology[J]. Journal of the American Statistical Association,1997,92(437):144-153.
    [21]Yuan Y. Multiple imputation using SAS software[J]. Journal of Statistical Software,2011,45(6): 1-25.
    [22]Cheng P E. Nonparametric estimation of mean functionals with data missing at random[J]. Journal of the American Statistical Association,1994,89(425):81-87.
    [23]Van der Heijden G J M G, T Donders A R, Stijnen T, et al. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research:a clinical example[J]. Journal of clinical epidemiology,2006,59(10):1102-1109.
    [24]Royston P. Multiple imputation of missing values[J]. Stata Journal,2004,4:227-241.
    [25]Qi L, Wang Y F, He Y. A comparison of multiple imputation and fully augmented weighted estimators for Cox regression with missing covariates[J]. Statistics in medicine,2010,29(25): 2592-2604.
    [26]金勇进,朱琳.不同差补方法的比较.数理统计与管理,2000,19(2):50-54.
    [27]武建虎,贺佳,贺宪民,等.多变量缺失数据的不同处理方法及分析结果比较[J].第二军医大学学报,2004,25(9):1013-1016.
    [28]茅群霞.缺失值处理统计方法的模拟比较研究及应用.四川大学硕士学位论文,2005.
    [29]梁霞.缺失数据的多重插补及其改进.中南大学硕士学位论文,2007.
    [30]Haji-Maghsoudi S, Haghdoost A A, Rastegari A, et al. Influence of pattern of missing data on performance of imputation methods:an example from national data on drug injection in prisons. International Journal of Health Policy and Management,2013,1(1):69-77.
    [31]Shah A D, Bartlett J W, Carpenter J, et al. Comparison of random forest and parametric imputation models for imputing missing data using mice:a caliber study. American Journal of Epidemiology,2014:kwt312.
    [32]Sterne J A C, White I R, Carlin J B, et al. Multiple imputation for missing data in epidemiological and clinical research:potential and pitfalls. BMJ:British Medical Journal,2009, 338.
    [33]Vergouw D, Heymans M W, van der Windt D A W M, et al. Missing data and imputation:a practical illustration in a prognostic study on low back pain. Journal of manipulative and physiological therapeutics,2012,35(6):464-471.
    [34]赵飞.疾病监测资料中缺失值最佳填充次数的研究.中国卫生统计,2009,26(5):455-458.
    [35]张熙.多重填补在随机干预试验研究中的应用.中国卫生统计,2011,28(5):537-539.
    [36]张彪.流行病学现场研究中缺失数据插补方法的应用及比较研究.北京协和医学院硕士学位论文,2012.
    [37]Bartlett J W, Seaman S R, White I R, et al. Multiple imputation of covariates by fully conditional specification:accommodating the substantive model. arXiv preprint arXiv:1210.6799,2012.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700