详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
Missing data handling is often encounted in data analysis and machine learning,the usual practice is first to impute the data,such as mean imputation, KNN imputation, hot deck imputation, cold deck imputation, regression imputation,multiple imputation,then modeling in the completed data.However,imputaton is time-consuming and sometimes inappropriate imputation may cause large errors or false results,thereby affecting the subsequent analysis of the model.In this paper,we study the methods of treating missing data for classification,the aim is to constructing a classification model without imputation.
     We firstly combine Graph-based semi-supervised learning with missing data and construct a Graph-based semi-supervised learning model which can handle missing data automatically by constructing similar weights in missing data.Then,we realize our algrithom by R. Finally, I perform some exeriments in UCI data(including Letters,Spam,Diabetes,Wine,Segment).The experiment conclusion as follows:
     1:To deal with missing data using claasical statistic imputation(stochastic imputation, mean imputation,median imputation)fistly,then compare with Graph-based semi-supervised learning after imputation.The experiment results show that our method is slightly better than classical methods.
     2: Compare with classical supervised learning model(where data have none missing value) ,the proposed method (where data is incomplete by remove some data artificially) has similar results ,indicating that our methods is reasonable,which is very convenient (needn’t imputation)when data contaning missing value.
     3: Compare with traditional methods(impute the data firstly,then model on the complete data), The experiment results show that our method is bettter than traditional methods,And our method do not fill missing data ,has a comparative advantage.
    [2] Little R.J.A, Rubin D.B. Statistical Analysis with Missing Data[M]. second edition,New York :John Wiley and Sons,2002
    [3] Schafer J.Analysis of Incomplete Multivariate Data [M].London: Chapman and Hall,1997
    [5] Huang Xiaolu, Zhu Qiuming. A pseudo-nearest-neighbor approach for missing data recovery onGaussian random data sets[J]. Pattern Recognition Letters ,2002 (23):1613–1622
    [6] Batista G.E, Monard M.C. A Study of K-Nearest Neighbor as a Model-Based Method to Treat Missing Data[J].Proceedings of Argentine Symposium on Artificial Intelligence, 2001(30):1-9
    [7] Rubin D.B. Multiple Imputation for Nonresponse in Surveys[M]. New York:Wiley, 1987
    [8] Buuren S.V, Multiple imputation of discrete and continuous data by fully conditional specification[J]. Statistical Methods in Medical Research, 2007,16(3), 219–242
    [9] Buuren S.V,Brand J.P.L,Rubin D.B.Fully condional specification in multivariate imputation[J].Journal of statistical Computation and Simulation,2006,76(12):1049-1064
    [10] Tan M.T,Tian Guo-Liang, Wang Kai.Bayesian Missing Data Problems[M].New York :CRC Press,2010
    [11] Dempster A.P, Laird N.M, Rubin D.B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society, 1977(39):1-38.
    [12] McLachlan G.J,Krishnan.T.The EM Algorithm and Extensions[M].New York:Wiley,1997
    [13] Jose M. Jerez a, Ignacio Molina , Pedro J. Garcia-Laencina ,Missing data imputation using statistical and machine learning methods in a real breast cancer problem[J]. Artificial Intelligence in Medicine .2010,(50) :105–115
    [14] Benjamin, M.Marlin. Missing Data Problem in Machine Learning[D]. Graduate Department of Computer Science University of Toronto.2008
    [15] Gal Chechik, Geremy Heitz, Gal Elidan,Max-margin Classification of Data with Absent Features[J]. Journal of Machine Learning Research 9 ,2008: 1-21
    [21]周惠彬.经济统计调查中缺失数据的修复[J].统计与决策,2 004, 10: 12-14
    [22]刘鹏.缺失数据处理方法的比较研究[J].计算机科学,2 004, 31(10):155-157
    [23]庞新生.缺失数据处理中相关问题的探讨[J].统计与信息论坛,2 004, 19
    [24] O. Chapelle, B. Sch?lkopf, Al. Zien .2006 . Semi-supervised Learning[M] .MIT Press, 2006
    [25] Zhu X .J. Semi-Supervised Learning with Graphs[D]. Pennsylvania : Carnegie Mellon Universiy,2005
    [26] Kamal Nigam,Andrew Mccallum,Sebastian Thrun,et al.Text classification from labeled and unlabeled documents using EM.Machine Learning.2000.39:103-134
    [27] A.Blum and T.Mitchell.Combining labeled and unlabeled data with co-training[A].In Proceedingsof the 11th Annual Conference on Computational Learning Theory[C].Wisconsin,MI:1998.92-100
    [28] Goldma S ,Zhou Y. enhancing supervised leaming with unlabeled data[A].In Proceedings of the l7th International Conference on Machine Learning[C].San Francisco,CA,2000:327-334
    [29] Zhou Z H and Li M. Tri-training:exploiting unlabeled data using three classifiers[J].IEEE Transactions on Knowledge and Data Engineering.2005.17(11):1529-1541
    [31] Y. Grandvalet, Y. Bengio. Semi-supervised learning by entropy minimization[A].Advances in Neural Information Processing Systems 17[C], Cambridge, MA: MIT Press, 2005:529-536
    [32] N. D. Lawrence, M. I. Jordan. Semi-supervised learning via Gaussian processes[A]. Advances inNeural Information Processing Systems 17[C].Cambridge, MA: MIT Press, 2005:753-760
    [33] A.Blum , S.Chawla. Learning from labeled and unlabeled data using graphmincuts[A].In Prnceedings of the 18th International Conference on Machine Learning[C]. USA:Morgan Kaufmann Publishers Inc.,2001:19-26
    [34] Zhu X.J,Ghahramani Z.B, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions[A].Proceedings of Twentieth International Conference on Machine Learning[C], Washington DC:AAAI,2003: 912–919
    [35] Zhou D.Y,Bousquet O,Lal T, et al. Learning with local and global consistency[A]. Advances in Neural Information Processing System16[C].USA:MIT Press, 2004: 321-328
    [36] M. Belkin,P. Niyogi,V. Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Machine Learning Research, 2006,7:2399–2434
    [38] K.Hechenbichler, K.Schliep.Weighted k-Nearest-Neighbor Techniques and Ordi-nal Classification[Z],2004. http://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper399.ps
    [41] M. Belkin,P. Niyogi,V. Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Machine Learning Research, 2006,7:2399–2434
    [44] Goldberg A.B.New Directions in Semi-Supervised Learning[D].Madison:University of Wisco nsin-Madison,2010
    [45] Wilson D. R, Martinez T.R. Improved Heterogeneous Distance Functions[J]. Journal of ArtificialIntelligence Research,1997,6:1-34

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700