A principal component method to impute missing values for mixed data
详细信息    查看全文
  • 作者:Vincent Audigier ; François Husson…
  • 关键词:Missing values ; Mixed data ; Imputation ; Principal component method ; Factorial analysis of mixed data
  • 刊名:Advances in Data Analysis and Classification
  • 出版年:2016
  • 出版时间:March 2016
  • 年:2016
  • 卷:10
  • 期:1
  • 页码:5-26
  • 全文大小:698 KB
  • 参考文献:Benzécri JP (1973) L’analyse des données. L’analyse des correspondances. Dunod, Tome II
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRef MATH
    Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Cross-validation of component model: a critical look at current methods. Anal Bioanal Chem 390:1241–1251CrossRef
    Cornillon PA, Guyader A, Husson F, Jégou N, Josse J, Kloareg M, Matzner-Løber E, Rouvière L (2012) R for Statistics. Chapman and Hall/CRC, Boca Raton
    de Leeuw J, Mair P (2009) Gifi methods for optimal scaling in R: The package homals. J Statist Software 31(4):1–20, URL http://​www.​jstatsoft.​org/​v31/​i04/​
    Escofier B (1979) Traitement simultané de variables quantitatives et qualitatives en analyse factorielle. Les cahiers de l’analyse des données 4(2):137–146
    Gifi A (1990) Nonlinear multivariate analysis. Wiley, ChichesterMATH
    Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman and Hall/CRC.
    Husson F, Josse J (2012) missMDA: Handling missing values with/in multivariate data analysis (principal component methods). URL http://​www.​agrocampus-ouest.​fr/​math/​husson , r package version 1.4
    Ilin A, Raiko T (2010) Practical approaches to principal component analysis in the presence of missing values. J Mach Learn Res 99:1957–2000, URL http://​dl.​acm.​org/​citation.​cfm?​id=​1859890.​1859917
    Josse J, Husson F (2011) Selecting the number of components in PCA using cross-validation approximations. Comput Statist Data Anal 56(6):1869–1879CrossRef MathSciNet
    Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21MathSciNet
    Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150:28–51MATH
    Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29:91–116CrossRef MathSciNet
    Kiers HAL (1991) Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika 56:197–212CrossRef MathSciNet MATH
    Kiers HAL (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62:251–266CrossRef MathSciNet MATH
    Lafaye de Micheaux P, Drouilhet R, Liquet B (2011) Le logiciel R. Springer, ParisCrossRef MATH
    Lang DT, Swayne D, Wickham H, Lawrence M (2012) rggobi: Interface between R and GGobi. URL http://​CRAN.​R-project.​org/​package=​rggobi , r package version 2.1.19
    Lebart L, Morineau A, Werwick KM (1984) Multivariate descriptive statistical analysis. Wiley, New YorkMATH
    Little RJA, Rubin DB (1987, 2002) Statistical analysis with missing data. Wiley series in probability and statistics, New York
    Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11:2287–2322MathSciNet MATH
    Michailidis G, de Leeuw J (1998) The Gifi system of descriptive multivariate analysis. Statist Sci 13(4):307–336CrossRef MathSciNet MATH
    Peters A, Hothorn T (2012) ipred: Improved Predictors. URL http://​CRAN.​R-project.​org/​package=​ipred , R package version 0.9-1
    R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL http://​www.​R-project.​org/​ , ISBN 3-900051-07-0
    Rubin DB (1976) Inference and missing data. Biometrika 63:581–592CrossRef MathSciNet MATH
    Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, LondonCrossRef MATH
    Stekhoven D, Bühlmann P (2011) Missforest - nonparametric missing value imputation for mixed-type data. Bioinformatics 28:113–118
    Tenenhaus M, Young FW (1985) An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50:91–119CrossRef MathSciNet MATH
    Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(62001):520–525CrossRef
    van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Statist Method Med Res 16:219–242CrossRef MATH
    van Buuren S, Boshuizen H, Knook D (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18:681–694CrossRef
    van der Heijden P, Escofier B (2003) Multiple correspondence analysis with missing data. In: Analyse des correspondances, Presse universitaire de Rennes, pp 153–170
    Vermunt JK, van Ginkel JR, van der Ark LA, Sijtsma K (2008) Multiple imputation of incomplete categorical data using latent class analysis. Sociol Methodol 33:369–397
  • 作者单位:Vincent Audigier (1)
    François Husson (1)
    Julie Josse (1)

    1. Agrocampus Ouest, 65 rue de St-Brieuc, 35042, Rennes, France
  • 刊物类别:Mathematics and Statistics
  • 刊物主题:Mathematics
    Statistics
    Statistical Theory and Methods
    Statistics for Business, Economics, Mathematical Finance and Insurance
    Statistics for Life Sciences, Medicine and Health Sciences
    Statistics for Engineering, Physics, Computer Science, Chemistry and Geosciences
    Statistics for Social Science, Behavorial Science, Education, Public Policy and Law
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1862-5355
文摘
We propose a new method to impute missing values in mixed data sets. It is based on a principal component method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the principal components. Because the imputation uses the principal axes and components, the prediction of the missing values is based on the similarity between individuals and on the relationships between variables. The properties of the method are illustrated via simulations and the quality of the imputation is assessed using real data sets. The method is compared to a recent method (Stekhoven and Buhlmann Bioinformatics 28:113–118, 2011) based on random forest and shows better performance especially for the imputation of categorical variables and situations with highly linear relationships between continuous variables.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700