带缺失数据的半监督图学习
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
在数据分析和机器学习中,经常会遇到数据有缺失的情况,通常的做法都是先对数据进行插补,如均值插补、近邻插补、热卡插补、冷卡插补、回归插补、多重插补等。然后再在插补后的完全数据集上进行建模。然而,对数据进行插补费时费力,有时候会由于插补不当而造成与原始数据出现偏差,进而影响后续的整个建模分析。本文以分类为例,对数据有缺失时的处理办法进行了初步研究,目的在于构造一个不用插补的分类模型。
     本文首次将缺失数据与基于图的半监督学习结合起来,通过构造数据有缺失时的相似权值,提出了能自动处理带缺失数据的半监督图算法,并通过R语言实现本文算法。在机器学习常用数据库UCI中选取Letters,Spam,Diabetes,Wine,Segment数据集进行实验,得出以下结论:
     一:利用统计中处理缺失数据的经典补值方法(随机插补、均值插补、中位数插补)将特征数据补齐,利用插补后的半监督方法与本文不补的数据处理方法进行结果比较,实验结果表明本文方法略好于这些经典方法。
     二:将不缺失的数据集,人为删除一部分数据,形成缺失数据,将本文方法用于处理缺失数据,将经典的监督学习方法处理原有的完全数据集,进行结果比较。实验结果表明本文方法的效果略弱于监督方法,而本文方法是在数据有缺失时进行的,从而证明了本文方法在处理这类带缺失数据的分类问题时是一种合理的方法。
     三:将本文方法与传统的先对数据进行插补,再在完全数据集上建模的方法比较。实验结果表明本文方法要略好于传统的处理缺失数据建模问题的方法。且本文方法不用填充缺失数据,省去了插补的麻烦,具有相对的优势。
Missing data handling is often encounted in data analysis and machine learning,the usual practice is first to impute the data,such as mean imputation, KNN imputation, hot deck imputation, cold deck imputation, regression imputation,multiple imputation,then modeling in the completed data.However,imputaton is time-consuming and sometimes inappropriate imputation may cause large errors or false results,thereby affecting the subsequent analysis of the model.In this paper,we study the methods of treating missing data for classification,the aim is to constructing a classification model without imputation.
     We firstly combine Graph-based semi-supervised learning with missing data and construct a Graph-based semi-supervised learning model which can handle missing data automatically by constructing similar weights in missing data.Then,we realize our algrithom by R. Finally, I perform some exeriments in UCI data(including Letters,Spam,Diabetes,Wine,Segment).The experiment conclusion as follows:
     1:To deal with missing data using claasical statistic imputation(stochastic imputation, mean imputation,median imputation)fistly,then compare with Graph-based semi-supervised learning after imputation.The experiment results show that our method is slightly better than classical methods.
     2: Compare with classical supervised learning model(where data have none missing value) ,the proposed method (where data is incomplete by remove some data artificially) has similar results ,indicating that our methods is reasonable,which is very convenient (needn’t imputation)when data contaning missing value.
     3: Compare with traditional methods(impute the data firstly,then model on the complete data), The experiment results show that our method is bettter than traditional methods,And our method do not fill missing data ,has a comparative advantage.
引文
[1]朱晓锋.缺失值填充的若干问题研究[D].桂林:广西师范大学,2008
    [2] Little R.J.A, Rubin D.B. Statistical Analysis with Missing Data[M]. second edition,New York :John Wiley and Sons,2002
    [3] Schafer J.Analysis of Incomplete Multivariate Data [M].London: Chapman and Hall,1997
    [4]邓银燕.缺失数据填充方法及实证分析[D].西安:西北大学,2010
    [5] Huang Xiaolu, Zhu Qiuming. A pseudo-nearest-neighbor approach for missing data recovery onGaussian random data sets[J]. Pattern Recognition Letters ,2002 (23):1613–1622
    [6] Batista G.E, Monard M.C. A Study of K-Nearest Neighbor as a Model-Based Method to Treat Missing Data[J].Proceedings of Argentine Symposium on Artificial Intelligence, 2001(30):1-9
    [7] Rubin D.B. Multiple Imputation for Nonresponse in Surveys[M]. New York:Wiley, 1987
    [8] Buuren S.V, Multiple imputation of discrete and continuous data by fully conditional specification[J]. Statistical Methods in Medical Research, 2007,16(3), 219–242
    [9] Buuren S.V,Brand J.P.L,Rubin D.B.Fully condional specification in multivariate imputation[J].Journal of statistical Computation and Simulation,2006,76(12):1049-1064
    [10] Tan M.T,Tian Guo-Liang, Wang Kai.Bayesian Missing Data Problems[M].New York :CRC Press,2010
    [11] Dempster A.P, Laird N.M, Rubin D.B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society, 1977(39):1-38.
    [12] McLachlan G.J,Krishnan.T.The EM Algorithm and Extensions[M].New York:Wiley,1997
    [13] Jose M. Jerez a, Ignacio Molina , Pedro J. Garcia-Laencina ,Missing data imputation using statistical and machine learning methods in a real breast cancer problem[J]. Artificial Intelligence in Medicine .2010,(50) :105–115
    [14] Benjamin, M.Marlin. Missing Data Problem in Machine Learning[D]. Graduate Department of Computer Science University of Toronto.2008
    [15] Gal Chechik, Geremy Heitz, Gal Elidan,Max-margin Classification of Data with Absent Features[J]. Journal of Machine Learning Research 9 ,2008: 1-21
    [16]金勇进,朱琳.不同插补方法的比较[J].数理统计与管理,2000,19(2):50-54
    [17]金勇进.调查中的数据缺失及处理一缺失数据及其影响[J].数理统计与管理,2001,20(1):59-61
    [18]金勇进.缺失数据的偏差校正[J].数理统计与管理,2001,20(4):58-60
    [19]金勇进.缺失数据的插补调整[J].数理统计与管理,2001,20(5):47-53
    [20]金勇进.缺失数据的加权调整[J].数理统计与管理,2001,20(5):61-64
    [21]周惠彬.经济统计调查中缺失数据的修复[J].统计与决策,2 004, 10: 12-14
    [22]刘鹏.缺失数据处理方法的比较研究[J].计算机科学,2 004, 31(10):155-157
    [23]庞新生.缺失数据处理中相关问题的探讨[J].统计与信息论坛,2 004, 19
    [24] O. Chapelle, B. Sch?lkopf, Al. Zien .2006 . Semi-supervised Learning[M] .MIT Press, 2006
    [25] Zhu X .J. Semi-Supervised Learning with Graphs[D]. Pennsylvania : Carnegie Mellon Universiy,2005
    [26] Kamal Nigam,Andrew Mccallum,Sebastian Thrun,et al.Text classification from labeled and unlabeled documents using EM.Machine Learning.2000.39:103-134
    [27] A.Blum and T.Mitchell.Combining labeled and unlabeled data with co-training[A].In Proceedingsof the 11th Annual Conference on Computational Learning Theory[C].Wisconsin,MI:1998.92-100
    [28] Goldma S ,Zhou Y. enhancing supervised leaming with unlabeled data[A].In Proceedings of the l7th International Conference on Machine Learning[C].San Francisco,CA,2000:327-334
    [29] Zhou Z H and Li M. Tri-training:exploiting unlabeled data using three classifiers[J].IEEE Transactions on Knowledge and Data Engineering.2005.17(11):1529-1541
    [30]占惠融.基于图的半监督学习算法研究[D].武汉:华中科技大学,2009
    [31] Y. Grandvalet, Y. Bengio. Semi-supervised learning by entropy minimization[A].Advances in Neural Information Processing Systems 17[C], Cambridge, MA: MIT Press, 2005:529-536
    [32] N. D. Lawrence, M. I. Jordan. Semi-supervised learning via Gaussian processes[A]. Advances inNeural Information Processing Systems 17[C].Cambridge, MA: MIT Press, 2005:753-760
    [33] A.Blum , S.Chawla. Learning from labeled and unlabeled data using graphmincuts[A].In Prnceedings of the 18th International Conference on Machine Learning[C]. USA:Morgan Kaufmann Publishers Inc.,2001:19-26
    [34] Zhu X.J,Ghahramani Z.B, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions[A].Proceedings of Twentieth International Conference on Machine Learning[C], Washington DC:AAAI,2003: 912–919
    [35] Zhou D.Y,Bousquet O,Lal T, et al. Learning with local and global consistency[A]. Advances in Neural Information Processing System16[C].USA:MIT Press, 2004: 321-328
    [36] M. Belkin,P. Niyogi,V. Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Machine Learning Research, 2006,7:2399–2434
    [37]薛毅,陈立萍.统计建模与R软件[M].北京:清华大学出版社,2006
    [38] K.Hechenbichler, K.Schliep.Weighted k-Nearest-Neighbor Techniques and Ordi-nal Classification[Z],2004. http://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper399.ps
    [39]邓乃杨,田英杰.数据挖掘中的新方法:支持向量机[M].北京:科学出版社,2004
    [40]瓦普尼克.统计学习理论[M].北京:电子工业出版社,2004
    [41] M. Belkin,P. Niyogi,V. Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Machine Learning Research, 2006,7:2399–2434
    [42]郝建柏.基于图的半监督学习模型研究与分类器设计[D].合肥:中国科学技术大学
    [43]胡崇海.基于图的半监督机器学习[D].杭州:浙江大学,2008
    [44] Goldberg A.B.New Directions in Semi-Supervised Learning[D].Madison:University of Wisco nsin-Madison,2010
    [45] Wilson D. R, Martinez T.R. Improved Heterogeneous Distance Functions[J]. Journal of ArtificialIntelligence Research,1997,6:1-34

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700