基于GMDH的缺失数据插补方法研究

英文题名：The Research for Method of Missing Data Interpolation Based on GMDH
作者：张智勇
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：GMDH算法 ; EM算法 ; K最邻近算法 ; 缺失数据
英文关键词：GMDH algorithm ; EM algorithm ; K-nearest algorithm ; Missing data
学位年度：2007
导师：贺昌政
学科代码：1201
学位授予单位：四川大学
论文提交日期：2007-03-28

摘要

随着信息技术的发展与人们收集数据能力的不断提高，数据库、数据仓库以及internet技术的应用普及，人们积累的数据越来越多，数据挖掘技术应运而生并不断发展。现有的数据挖掘算法大部分是建立在理想的数据集上的，而在实际中，由于各种原因，我们收集的数据往往是不完全的，或多或少存在数据缺失。在这种情况下对缺失数据的通常处理方法就是先估计缺失数据，然后在完全数据集的基础上进行数据挖掘。现在应用最多的缺失数据插补方法有回归插补方法，神经网络插补方法，K最邻近插补方法等。但是，这些方法在处理噪声数据时存在一些不足，比如，在噪声数据下回归插补缺失数据与神经网络插补缺失数据容易产生过拟合；在K值较小的情况下，K最邻近算法插补缺失数据容易受到噪声数据的干扰。
     GMDH方法具有有效处理噪声数据的特点。本文以缺失数据的理论为基础，引入了面向噪声数据的GMDH方法，建立了基于GMDH的缺失数据插补方法体系，用于噪声数据下的缺失数据插补。
     在用GMDH来插补缺失数据的过程中，根据数据缺失模式的不同，假设了不同的数据缺失机制，从而采用了不同的方法与GMDH结合来插补缺失数据。在单变量数据缺失模式，随机缺失机制下，用EM算法与GMDH结合，建立变量之间的GMDH模型，根据模型来估计缺失数据。在多变量数据缺失模式，忽略数据缺失机制的情况下，用K最邻近算法与GMDH结合，建立相似样本之间的GMDH模型，通过模型估计缺失数据。本文的主要工作如下：
     1．首先在数据缺失模式为单变量数据缺失，数据缺失机制为随机数据缺失情况下：
     (1)提出了用EM算法与GMDH算法结合来插补缺失数据的新方法，并给出了该方法的基本假设，设计了该方法的基本步骤，编制了该方法的相应程序。
     (2)通过理论分析、数值实验和对中国经济数据的实证研究，对基于GMDH的缺失数据插补与回归插补进行了比较研究，揭示了用该方法来插补在噪声数据下的单变量数据缺失的有效性，显示了该方法较回归方法的优越性。
     2．其次在数据缺失模式为多变量数据缺失，数据缺失机制为可忽略数据缺失情况下：
     (1)提出了用K最邻近算法与GMDH算法结合来插补缺失数据的新方法，并给出了该方法的基本假设，设计了该方法的基本步骤，编制该方法的相应程序。
     (2)通过理论分析，中国各省国内生产总值的实证研究对基于GMDH的缺失数据插补与K最邻近算法插补进行了比较研究，揭示了用该方法来插补噪声数据下的多变量数据缺失的有效性，显示了该方法较K最邻近算法的优越性。
     因此，在这些工作的基础上，本文的创新点主要体现在下面几个方面：
     1．在对缺失数据的插补过程中，本文研究了噪声数据下的缺失数据插补：
     (1)在对单变量缺失模式，随机缺失机制下情形下，将GMDH算法与EM算法结合，通过迭代来插补缺失数据减小了噪声数据对缺失数据插补的影响；并在实际例子中通过对缺失数据的范围增加限制性条件，加快了迭代速度，克服了缺失数据比较多，而已观察数据比较少时不能建立模型的问题。
     (2)在对多变量缺失模式，忽略数据缺失机制情形下，将GMDH算法与最邻近算法相结合，消除了噪声数据对缺失数据插补的影响，减小了K值选取在插补过程中的重要性；并通过GMDH算法的内外准则提高了对缺失数据估计的准确性。
     2．在对缺失数据的插补过程中，本文还将数据缺失模式和机制与缺失数据的插补方法联系起来，从而为不同缺失数据下选用不同的方法来插补缺失数据提供了理论依据。
With the development of information technology and the continuous improvement of people's capacity to collect data, the wider use of database, Data Warehouse and internet technologies, People accumulate more and more data.Data mining technology Came into being and go on development alone with data.However, the majority of data mining algorithms are based on the ideal data set, but in reality, Due to various reasons, the collected data is often incomplete, and there is more or less missing data, In this case, the usual methods for handling missing data is to estimate missing data, based on estimates, We conducted data mining.Now the most widely used method of missing data interpolation is regression interpolation,neural network interpolation, K-nearest interpolation.But when processing noise data, these methods exists certainly insufficient, for instance, under the noise data, regression interpolation and neural network interpolation are vulnerable to over fitting to noise interference. When K is very small, K nearest interpolation is vulnerable to noise interference.
     GMDH method is a good way to deal with small samples and noise data.Based on the theory of missing data, this paper introduced the GMDH method oriented noise data, and established the missing data interpolation method on system noise data.
     According to different model of missing data, assuming a different mechanism of missing data, this paper combined different algorithm with the GMDH algorithm to estimate missing values. In a single-variables missing model and MAR missing mechanism, this paper combined GMDH algorithm with the EM algorithm, according to the the relationship between the variables, established GMDH models to estimated the missing data.In the multi - variable model,and ignored the missing data mechanism, this paper combined GMDH algorithm with the K-nearest algorithm, according to the the relationship between the samples, established GMDH models between the samples to estimate missing data according to the similar models.Therefore, the main task of this article is:
     1. At first, the data loss model is single - variable missing data, the data loss mechanism is MAR loss:
     (1) This paper presents the new methods based GMDH and EM, gives the basic assumption of this new methods to establish missing data, designs the basic steps of interpolation algorithm, and write the corresponding procedures.
     (2) Through a theoretical analysis, numerical study and the Experimental of the Chinese economy, this paper compare the interpolation method based on GMDH missing data and the interpolation method based on regression., and show the effectiveness and superiority to the estimates of the missing values in the interpolation algorithm-based GMDH in the noise data through a comparison.
     2. Secondly, the data loss model is multi - variable missing data model, the data loss mechanism can be neglected.
     (1) This paper presents the new methods based GMDH and K-nearest algorithm, gives the basic assumption of this new methods to establish missing data, designs the basic steps of interpolation algorithm and write the corresponding procedures.
     (2) Through a theoretical analysis, and the Experimental of the Chinese economy,this paper compare the interpolation method based on GMDH missing data and the interpolation method based on regression.and show that the effectiveness and superiority to the estimates of the missing values in the interpolation algorithm-based GMDH in the noise data through a comparison.
     According to the interpolation process of missing data, the paper points to the main innovation in the following areas:
     1. In the process of missing data interpolation, this paper study the missing data interpolation under the noise data
     (1) When the data loss model is single - variable missing data, the data loss mechanism is MAR loss, We combined GMDH algorithm with the EM algorithm to estimate missing values, though iterative algorithm, reduce the noise impact on the estimated data of the missing data, and through adding restrictions in the actual conditions, therefore accelerated the iterative pace and overcome the shortcomings of not building modle in the circumstances of more missing data,only relatively few observations.
     (2) When the data loss model is multi - variable missing data model, the data loss mechanism can be neglected, We combined GMDH algorithm with the K-nearest algorithm to eliminate missing data, reduce the noise impact on the estimated value of missing data, and the importance of the K value in the interpolation process, and improve accuracy of estimates through the internal and external criteria of GMDH algorithm.
     2. In the process of missing data interpolation, we combined the models and mechanisms of missing data with the interpolation method of missing data, and provide a theoretical basis to choose different interpolation algorithm to estimate the missing values under different missing data models and mechanisms.

引文

[1] R.J A Little and D B Rubin. Statistical Analysis with Missing Data[M]. New York: Wiley,1987.
    [2] R.J A Little, D B Rubin. Statistical Analysis with missing data[M]. New York.. Jvohn Wiley&Sons inc, 2002.
    [3] Joseph L Schafer and Maren K Olsen. Multiple Imputations for Multivariate Missing-data Problem[EB/OL]. Lecture, 1998, Mar9. http://www.stat.psu.edu/～jls/misoftwa.Html.
    [4] D B Rubin, Joseph L Schafer. Multiple Imputations for Missing-data Problems[EB/OL]. Lecture. http://www.tat.su.du/～jls/misoftwa.html.
    [5] 武建虎，贺佳，贺宪民，程红岩．多变量缺失数据的不同处理方法及分析结果比较[J]．第二军医大学学报，2004，25(9)：1013-1016．
    [6] Dempster A P, Laird M,Rubin D B. Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society, 1977, Series B, 39: 1-33.
    [7] R J A Little and D B Rubin. The Analysisy of Social Science Data with Missing Values[J]. Sociological Methods and Reasearch, 1990, 18: 292-326.
    [8] 金勇进．缺失数据的插补调整[J]．数理统计与管理，2001，20(5)：47-53．
    [9] R J A Little. Inference about means form incomolete multivariate data[J]. Biometrika,1976, (63): 593-604.
    [10] R J A Little. Small sample inference about means from bivariate normal data with missing values[J]. Comput Statist Data Analysis, 1988, (7): 161-178.
    [11] 李序颖。基于空间自回归模型的缺失值插补方法[J]．数据统计与管理，2005，24(5)：45-50．
    [12] Buck F. A method of estimation of missing values in multivariate data suitable for use with an electronic computer[J]. Journal of the Royal Statistical Socicty, 1960, Soc.B, 22:302-306.
    [13] R J A Little. Maximum likelihood inference for multiple regression with missing values,simulation study[J]. J Roy Statist, 1979, Soc.B 11: 76-87..
    [14] R J A Little. Regression with missing x's Areview[J]. Jam.Statist,1992, assoc.88:1001-1012.
    [15] Xiangyi Meng, Nathaniel Schenker. Maximun likelihood estimation for linear regression models with right censored outcomes and missing predictors[J]. Computational Statistics& Data Analysis, 1999, (9):472-483.
    [16] A L Bello. Imputationtechniques in regression analysis: Looking closely at their implementation[J]. Computational Statistics& Data Analysis, 1995, 20: 472-483.
    [17] Brigitte Holt, Robert A, Benfer Jr. Estimating missing data: an iterative regression approach[J]. Journal ofHumanEvolution, 2000, 39: 289-296.
    [18] A C Atkinson, Tsung-Chi Cheng. On robust linear regression with incomplete data[J]. Computational Statistics & Data Analysis , 2000, 33: 361-80.
    [19] 靳蕃著．神经计算智能基础原理方法[M]．成都：西南交通大学，2000．
    [20] 何凯涛，陈明，张治国，Jacques Yvon．用人工神经网络进行空间不完备数据的插补[J]．地质通报，2005，24(5)：476-479．
    [21] Wei Wei, Tang Ying. A Generic Neural Network Approach for Filling Missing Data in Data Mining[C]. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2003, 1: 862-867.
    [22] Wang Shuang-Cheng, Yuan Sen-Miao. Research on Leaming Bayesian Networks Structure with Missing Data[J]. Journal of Software, 2004, 15: 1024-1048.
    [23] 贝叶斯分类器的增量学习及缺失数据处理的研究[D]．南宁：广西师范大学，2005．4．
    [24] I Wasito, B Mirkin. Nearest neighbour approach in the least-squares data imputation algorithms[J].Information Science, 2005, 169: 1-25.
    [25] Ito Wastio, Boris Mirkin. Nearest neighours in least-squares data imputation algorithms with different missing patterns[J]. Computational Statistics& Data Analysis, 2006, 50: 926-949.
    [26] FredrikADahl. Convergence of randomk-nearest-neighourimputation[J]. Computational Statistics& Data Analysis, 2006, (7): 1-5.
    [27] Jonsson Per, Wohlin Claes. An Evaluation of k-Nearest Neighbour Imputation Using Likert data[C]. Proceedings-10th International Symposium on Software Metrics, METRICS 2004：108-118．
    [28] 边肇淇，张学工等编著．模式识别[M]．北京：清华大学出版社，2004．
    [29] Pawlak Z. Rough Set-Theoretical Aspects of Reasoning about Data[M]. Dordrecht, Boston, London: Kluwer Academic Publishers, 1991.
    [30] 刘伟．基于粗集理论的数据挖掘中数据预处理的方法研究[D]．长春：吉林理工大学，2005．12．
    [31] 李然．粒计算的高效知识约简算法与缺失数据处理[D]．兰州：兰州大学，2006．5．．
    [32] 张振华．粗集理论及其在数据预处理过程中的应用[D]．昆明：昆明理工大学，2003．3．
    [33] K Pelckmans, J De Brabanter, J A K Suykens, B De Moor. Handling missing values in support vectoe machine classifiers[J]. Neural Networks, 2005, 18.. 684-692.
    [34] Honda Katsuhiro, Ichihashi Hidetomo. Linear Fuzry Clustering Techniques with Missing Values and Their Application to Local Principal Component Analysis[J]. IEEE Transactions on Fuzzy Systems, 2004, 12:183-193.
    [35] 李晓菲．数据预处理的算法的研究与应用[D]．成都：西南交通大学，2006．6．
    [36] 杨涛．基因表达缺失数据填充方法研究[D]．长沙：湖南大学，2003．12．
    [37] A Ragel, B Cremilleux. MVC-a preprocessing method to deal with missing values[J]. Knowledge-BasedSystems, 1999, 12: 285-291.
    [38] Ragel A, Cremilleux B. .MVC-a Preprocessing Method to Deal with Missing Values[C]. Proceedings of the 1998 SGES International Conference on Knowledge-Based Systems and Applied Artificial Intelligence(ES98), 1999, 12:285-291.
    [39] Rubin D B. Inference and missing data (with discussion)[J]. Biometrika, 1976a, 63:581-592.
    [40] Demissie S, La Valley MP, Horton NJ, et al. Bias due to missing exposure data using complete-case analysis in the proportional hazards regression model[J]. Statistics in Medicine, 2003, 22: 545-557.
    [41] Abraham, W Todd, Russell, et al. Missing data:: a review of current methods and applications in epidemiological research[J]. Current Opinion in Psychiatry, 2004, 17(4):: 315-321.
    [42] James M Robins, Naisyin Wang. Inference for imputation estimators[J]. Biometrika 2000, 87(1): 113-124.
    [43] http://www.knowledgeminer.net.
    [44] A G Ivaldmenko. Heristic Self-organizing in problem of engineering cybernetics[J]. Automatica, 1967, 6: 207-219.
    [45] M(?)ller JA, LernkeF. Self-organizing data mining[M]. Berlin, Hamburg: LibriBooks,2000.
    [46] Tichonov, A N V Ja Arwetjev. Metody resenija nekottektnych zadac[M]. Nauka: Moskva,1974.
    [47] Cherkassky V, ulier F. Learnining from Data: Concepts, Theory, and Methods[M]. New York: JWiley&Sonslnc, 1998: 388-421.
    [48] Aksenova T I, Ju P Jurackovskij. Charakterizacija nesmescermoj struktury i uslovija ee J-optimal' nosti[J]. Avtomatika 3, 1998, 4: 34-37.
    [49] Madala H R, Ivakhnenko A G. Inductive learning algorithms for complex systems modeling[M]. Boca Raton, London, Tokyo: CRCPress.Inc, 1994.
    [50] C W J Granger. Intestigating Causal Relations by Econometrics Models and Cross Spectral Methods[J]. Econometrics, 1969, (7): 428-438.
    [51] 矛群霞．缺失值处理统计方法的模拟比较研究[D]．成都：四川大学，2005．4．
    [52] Madala H. R, Ivaldmenko A G. Inductive learning algorithms for complex systems modeling[[M]. Boca Raton, London, Tokyo: CRCPress.Inc, 1994.
    [53] Ivakhnenko A. G, V S Stepasko. Pomechoustojcivost' modelirovanija[M]. Keiv: Naukova durnka, 1985.
    [54] 张宾，贺昌政．GMDH算法的终止法则研究[J]．吉林大学学报(信息科学版)，2005，23(5)：1-6．
    [55] Dodge Y. Analysis ofexiperiments with missing data[M]. NewYork: Wiley, 1985.
    [56] Healy M J R and Westmacott M. Missing Values in experiments analyzed on Automatic computers[J]. ApplStatist, 1956, (5): 203-206.
    [57] Boyles R A. On the convergence of the EM algorithm[J]. Journal of the Royal Statistical Socicty, 1983, ScricsB, 45: 47-50.
    [58] Wu C F. The convergence of the EM algorithm[J]. TheAnnals tit Statistics, 1983, (11): 95-103．
    [59] 程兴新．EM算法的收敛性[J]．北京大学学报(自然科学版)，1987，(3)：1-6．
    [60] 王兆军．EM算法收敛的必要条件[J]．南开大学学报，1994，(2)：85-88．
    [61] Kim D K, Taylor J M G. The restricted EM algorithm for maximum likelihood estimation under linear restrictions on the parameters[J]. Journal of the American Statistical Association, 1995, 430: 708-716.
    [62] 贺昌政．自组织数据挖掘与经济预测[M]．北京：科技出版社，2005．

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700