分散度量模型中的变量选择

英文题名：Study on the Variable Selection Problems in Dispersion Modeling
作者：王大荣
论文级别：博士
学科专业名称：概率论与数理统计
中文关键词：变量选择 ; 双重广义线性模型 ; 异方差回归模型 ; Profile似然函数 ; 扩展拟似然函数 ; AIC准则 ; BIC准则 ; 惩罚函数 ; SCAD ; 异常点
英文关键词：variable selection ; double generalized linear models ; heteroscedastic regression models ; profile likelihood function ; extended quasi-likelihood function ; AIC ; BIC ; penalized function ; SCAD ; outlier
学位年度：2009
导师：张忠占
学科代码：070103
学位授予单位：北京工业大学
论文提交日期：2009-05-01

摘要

建模过程中的一个重要问题是如何从众多解释变量当中选取重要变量,即变量选择问题.已有大量文献从不同的角度研究了线性模型和广义线性模型中的变量选择问题.随着科学技术的深入发展,人们面临着越来越复杂的数据和模型结构,多重回归模型是其中重要的一类,它可以更好的解释数据变化的原因和规律.然而,当前文献大多集中于均值回归模型的变量选择,对分散度量参数赋予一个模型结构后,关于均值-分散度量参数联合建模结构下的变量选择问题却鲜有研究.我们的研究发现,如果把适用于均值模型的方法直接套用到联合建模结构中有可能会引起一些问题或做出错误的推断,因此有必要针对这样的复杂模型结构展开相关变量选择问题的研究.本文研究了均值和分散度量参数联合建模结构下的变量选择问题,以及变量选择思想方法的应用问题,主要取得了以下三点成果.
     针对异方差回归模型,我们研究了均值和方差联合建模结构下的同时变量选择问题.当均值模型中参数个数相对样本量较大时,方差模型中参数的极大似然估计通常是有偏的,使用这样的估计值进行变量选择将会增加模型的风险.从修正偏差的角度出发,我们采用了调整的profile似然函数作为损失函数,并基于信息论的理论基础,提出了一个新的变量选择准则PICa.与经典方法不同的是,该准则同时考虑了均值模型和方差模型中的信息,并对不同模型中的变量施以恰当的惩罚力度,达到了同时选择变量的效果.我们证明了,在一定的正则条件下,该准则具有如下渐近优良性:对均值模型,PICa准则具有模型选择的相合性;对方差模型,当样本量足够大时,由PICa准则选出的模型出现拟合不足现象的概率趋于零.Monto Carlo模拟研究显示,在许多常见情况下,新的准则优于传统方法.
     针对双重广义线性模型,一方面,我们针对经典的变量选择方法,利用扩展拟似然函数,推广了经典的AIC准则,并通过模拟和实例分析验证了该准则的有效性.另一方面,我们还研究了高维数据中的变量选择问题.当变量个数较大,而数据量不够大时,传统的子集选择法很难区分众多的可能模型,同时因其计算量太大而难以实施.对双重广义线性模型,不仅要估计均值模型中的参数,还要估计散度模型中的参数,计算将更加繁重.我们提出了一类非凹惩罚扩展拟似然方法,证明了所得估计具有Oracle性质,并提出了一种快速的新算法.同时,考虑到估计的优良性质依赖于罚函数中调谐参数的选择,我们从模型选择的相合性角度出发,改进了罚函数中调谐参数的选取方法.
     “变量选择”的思想方法作为建模的主要组成部分,对于衡量数据与模型拟合的程度具有本质的反映,因此,也可以用于建模的其他问题.我们针对回归分析中异常数据和变量变换相互影响的问题,从变量选择角度,结合模型选择的广义信息准则与构造变量方法,提出了一类数据变换与异常点的同时诊断方法.该方法同时考虑由是否存在异常点以及是否需要变换所组成的四种备选模型,在某些情况下,既可以减轻异常点对数据变换的强影响,又避免了变换数据对于异常点的掩盖效应.文章通过模拟与实例验证了该方法的有效性,并与文献中的方法进行了比较.
Variable selection is fundamental to statistical modeling.A large number of researchers have been devoting into the variable selection problems.With the development of modern technology.more and more complicated data and models have emerged.Hierarchical regression models which can analyze data better are the important part of them.However.many references are concerned with the variable selection of the mean regression model.and there are few methods proposed for the mean and dispersion joint modeling.According to our research,we find that the methods of variable selection which are adequate for mean models may fail to be directly extended to the hierarchical regression models.Thus,it is necessary to study the variable selection problems for complicated models.This dissertation is concerned with the study on variable selection problems of mean and dispersion joint modeling.Purthermore,the idea of variable selection is applied to the data diagnosis field.Our research results include the following three conclusions.
     Fot the heteroscedastic regression models,the simultaneous variable selection for mean model and variance model is discussed.When the number of mean parameters is a large fraction of the sample size,the MLEs of variance parameters can be seriously biased.And the model risk would be increased based on such estimators.And we propose a criterion named PICa based on the adjusted profile log-likelihood function which has been used to reduce the bias of the variance component estimators.Our method is different from the conventional ones in that it combines the information of mean model and the inrormation of variance model. and PICa put suitable weights on mean and variance variable penalty.Thus it can simultaneously select the variables for mean and variance models.Under regular conditions.we prove that PICa has the following asymptotic properties:for the mean model,PICa is consistent for model selection;and for the variance model, the probability of underfltting is zero.Monto Carlo simulations show that PICa performs better than conventional methods in many usual situations.
     For the double generalized linear models,on the one hand,we propose a variable selection criterion based on the extended quasi-likelihood.The new criterion is an extension of Akaike's information criterion.And its performance is investigated through simulation studies and a real data application.On the other hand, the variable selection problems for high dimensional generalized linear models with dispersion modeling are studied.When there are many variables and data is not enough,subset selection methods may not distinguish the large numbers of candidate models,and it's hard to put into practice for the heavy computations.We propose a class of non-concave penalized extended quasi-likelihood method,prove the Oracle property of the resulting estimates and put forward a new arithmetic for the new procedure.At the same time,considering that the property of estimates depends on the penalty function,we improve the choice of tuning parameters in the penalty function from the angle of consistency for model selection.
     As a part of modeling strategy,variable selection is an important tool to reflect the essence of data fitting.Thus,it can also be applied to other fields of statistical modeling.We focus on the mask effects between diagnosis of outliers and of response transformation in regression analysis.Based on the idea of variable selection,a simultaneous diagnosis method is proposed by constructing covariates and employing the generalized information criterion.The efficiency of the proposed approach is compared with naive methods throuch a Monte Carlo simulation and two examples.

引文

1 J.Fan,R.Li.Statistical Challenges with High Dimensionality:Feature Selection in Knowledge Discovery.M Sanz-Sole,J Soria,J L Varona,et al,eds.Proceedings of the International Congress of Mathematicians.Zurich:European Mathematical Society,2006.Vol.Ⅲ:595-622.
    2 H.Akaike.Information Theory and an Extension of the Maximum Likelihood Principle.B N Petrov,F Csaki,eds.Proceedings of the Second International Symposium on Information Theory.Budapest,1973.267-281.
    3 G.Schwarz.Estimating the Dimensions of a Model.Annals of statistics.1978,6:461-464.
    4 K.Takeuchi.Distribution of Information Statistics and a Criterion of Model Fitting.Suri-Kagaku(Mathematical Sciences).1976,153:12-18(in Japanese).
    5 C.M.Hurvich,C.L.Tsai.Regression and Time Series Model Selection in Small Samples.Biometrika.1989,76:297-307.
    6 A.McQuarrie,R.Shumway,C.L.Tsai.The model selection criterion AICu.Statistics and Probability Letters.1997,34:285-292.
    7 S.Konishi,G.Kitagawa.Generalised information criteria in model selection.Biometrika.1996,83:875-890.
    8 D.J.Spiegelhalter,N.G.Best,B.P.Carlin,A.van der Linde.Bayesian Measures of Model Complexity and Fit.Journal of the Royal Statistical Society,Series B.2002,64(4):583-639.
    9 N.L.Hjort,G.Claeskens.Frequentist Model Average Estimators[J].Journal of the American Statistical Association.2003,98:879-899.
    10 G Claeskens,N L Hjort.The focused information criterion[J].Journal of the American Statistical Association,2003,98:900-916.
    11 张尧庭.线性模型与广义线性模型.统计教育.1995,4:18-23.
    12 J.A.Nelder,R.W.M.Wedderburn.Generalized Linear Models.J.Roy.Statist.Soc.Set.A.1972,135(3):370-384.
    13 P.McCullagh,J.A.Nelder.Generalized Linear Models(2nd ed.).London:Chapman &:Hall,1989.
    14 Y.Lee,J.A.Nelder,Y.Pawitan.Generalized Linear Models with Random Effects:Unified Analysis via H-likelihood.London:Chapman &:Hall/CRC,2006.
    15 陈希孺,王松桂.近代回归分析.合肥:安徽教育出版社.1987.
    16 B.Jφrgensen.Some Properties of Exponential Dispersion Models.Scandinavian Journal of Statistics.1986,13:187-198.
    17 R.W.M.Wedderburn.Quasi-Likelihood Function,Generalized Linear-Models,and Gauss-Newton Method.Biometrika.1974,61:439-447.
    18 J.A.Nelder,D.Pregibon.An Extended Quasi-Likelihood Function.Biometrika.1987,74(2):221-232.
    19 D.Pregibon.Review of Generalized Linear Models(by P.McCullagh and J.A.Nelder).Ann.Statist.1984,12:1589-1596.
    20 B.Jφrgensen.Exponential Dispersion Models(with discussion).Journal of Royal Statistical Society B.1987,49:127-162.
    21 B.Jφrgensen.The Theory of Dispersion Models.London:Chapman & Hall.1997.
    22 唐年胜.非线性再生散度模型.北京:科学出版社.2007.
    23 R.E.Park.Estimation with Heteroscedastic Error Terms.Econometrica.1966,34:888.
    24 A.C.Harvey Estimating Regression Models with Mutiplicative Heteroscedasticity.Econometrica.1976,44,460-465.
    25 M.Aitkin.Modelling Variance Heterogeneity in Normal Regression Using GLIM.Appl.Statist.1987,36:332-339.
    26 G.K.Smyth.Generalized Linear Models with Varying Dispersion.Journal of the Royal Statistical Society,Series B.1989,51:47-60.
    27 J.A.Nelder,Y.Lee.Generalized Linear Models for the Analysis of Taguchitype Experiments.Applied Stochastic Models and Data Analysis.1991,7:107-120.
    28 Y.Lee,J.A.Nelder.Generalized linear models for the analysis of qualityimprovement experiments.The Canadian Journal of Statistics.1998,26(1),95-105.
    29 G.K.Smyth,A.P.Verbyla.Adjusted Likelihood Methods for Modelling Dispersion in Generalized Linear Models.Environmetrics.1999,10:696-709.
    30 G.K.Smyth.An Efficient Algorithm for REML in Heteroscedastic Regression.Journal of Graphical and Computational Statistics.2002,11:836-847.
    31 韦博成,林会官,吕庆哲。回归模型中异方差或变离差检验问题综述.应用概率统计.2003,5:210-220.
    32 Y.Lee,J.A.Nelder.Hierarchical Generalized Linear Models(with Discussion).J.R.Statist.Soc.B.1996,58(4):619-678.
    33 Y.Lee,J.A.Nelder.Double Hierarchical Generalized Linear Models.Appl Statist.2006,55(2):139-185.
    34 K.P.Burnham,D.R.Anderson.Model Selection and Inference-a Practical Information-Theoretic Approach.2nd ed.New York:Springer.2002.362-371.
    35 M.Ishiguro,Y.Sakamoto,G.Kitagawa.Bootstrapping Log Likelihood and EIC,an Extension of AIC.Annals of the Institute of Statistical Mathematics.1997,49:411-434.
    36 R.Shibata.Bootstrap Estimate of Kullback-Leibler Information for Model Selection.Statistica Sinica,1997,7:375-394.
    37 W.Pan.Bootstrapping likelihood for model selection with small samples.Journal of Computational and Graphical Statistics.1999,8:687-698.
    38 H.Yanagihara,T.Tonda,C.Matsumoto.Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition.Journal of Multivariate Analysis.2006,97:1965-1975.
    39 C.R.Rao,Y.Wu.On Model Selection.P Lahiri,eds.ISM Lecture Notes-Monograph Series.Ohio:Beachwood.2001,38:1-57.
    40 A.D.R.McQuarrie,C.L.Tsai.Regression and Time Series Model Selection.Singapore:World Scientific.1998.8-9.
    41 C.M.Hurvich,C.L.Tsai.Regression and Time Series Model Selection in Small Samples.Biometrika.1989,76:297-307.
    42 N.Sugiura.Further Analysis of the Data by Akaike's Information Criterion and the Finite Corrections.Communications in Statistics - Theory and Methods.1978,7:13-26.
    43 J.E.Cavanaugh.A large-sample model selection criterion based on Kullback's symmetric divergence.Statistics and Probability Letters.1999,42:333-343.
    44 A.K.Seghouane,M.Bekara.A small sample model selection criterion based on Kullback's symmetric divergence.IEEE Transactions on Signal Processing.2004,52:3314-3323.
    45 X T Shen,H C Huang.Optimal Model Assessment,Selection,and Combination [J].Journal of the American Statistical Association,2006,101:554-568.
    46 G.Schwarz.Estimating the dimension of a model.Annals of Statistics.1978,6:461-464.
    47 D.Haughton.On the choice of a model to fit data from an exponential family.Ann.Statist.1988,16:342-355.
    48 R.Shibata.Asymptotically efficient selection of the order of the model for estimating parameters of a linear process.Ann.Statist.1980.8:147-164.
    49 R.Shibata.An optimal selection of regression variables.Biometrika.1981,68:45-54.
    50 M.Stone.Comments on model selection criteria of Akaike and Schwarz.J.Royal Statist.Soc.,Ser.B.1979,41:276-278.
    51 R.Nishii.Asymptotic properties of criteria for selection of variables in multiple regression.Ann.Statist.1984,12:758-765.
    52 Y.H.Yang.Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation.Biometrika.2005,92:937-950.
    53 J.O.Berger.Bayesian Analysis:A Look at Today and Thoughts of Tomorrow.Journal of the American Statistical Association.2000,95:1269-1276.
    54 E.I.George,R.E.McCulloch.Approaches for Bayesian variable selection.Statistica Sinica.1997,7:339-373.
    55 A Miller.Subset Selection in Regression[M].London:Chapman and Hall,2002.155-158.
    56 C.L.Mallows.Some comments on C_p.Technometrics.1973.15:661-675.
    57 H.Akaike.Fitting autoregressive models for prediction.Annals of the Institute of Statistical Mathematics.1969,21:243-247.
    58 D.M.Allen.The relationship between variable selection and data augmentation and a method for prediction.Technometrics.1974,16:125-127.
    59 D.P.Foster,E.I.George.The Risk Inflation Criterion for Multiple Regression.Ann.Statist.1994,22:1947-1975.
    60 J.Shao.An asymptotic theory for linear model selection.Statistica Sinica.1997,7:221-264.
    61 L.Breiman.Better subset regression using the nonnegative garrote.Technometrics.1995,37:373-384.
    62 L.Breiman.Heuristics of Instability and Stabilization in Model Selection.The Annals of Statistics.1996,24:2350-2383.
    63 M.Yuan,Y.Lin.On the non-negative garrotte estimator.J.R.Statist.Soc.B.2007,69:143-161.
    64 R.Tibshirani.Regression Shrinkage and Selection via the Lasso.Journal of the Royal Statistical Society,Set.B.1996,58:267-288.
    65 H.Zou,T.Hastie.Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B.2005,67:301-320.
    66 W.J.Fu.Penalized Regression:The Bridge Versus the Lasso.Journal of Computational and Graphical Statistics.1998,7:397-416.
    67 M.Osborne,B.Presnell,B.Turlach.On the Lasso and Its Dual.Journal of Computational and Graphical Statistics.2000,9:319-337.
    68 B.Efron,T.Hastie,I.Johnstone,et al.Least angle regression(with discussions).Annals of Statistics.2004,32:409-499.
    69 D.Donoho,I.Johnstone.Ideal spatial adaptation via wavelet shrinkages.Biometrika.1994,81:425-455.
    70 I.E.Frank,J.H.Friedman.A statistical view of some chemometrics regression tools.Technometrics.1993,35:109-148.
    71 J.Fan.Comments on "Wavelets in statistics:a review" by A.Antoniadis.J.Italian Statist.Assoc.1997,6:131-138.
    72 J.Fan,R.Li.Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American Statistical Association.2001,96:1348-1360.
    73 D.Hunter,R.Li.Variable selection using MM algorithms.Annals of Statistics.2005,33:1617-1642.
    74 H.Zou,R.Li.One-step sparse estimates in nonconcave penalized likelihood models(with discussion).Annals of Statistics.2008,36(4):1509-1533.
    75 H.Zou.The adaptive Lasso and its oracle properties.J.Amer.Statist.Assoc.2006,101:1418-1429.
    76 E.Candes,T.Tao.The Dantzig selector:statistical estimation when p is much larger than n(with discussion).Annals of Statistics.2007,35:2313-2351.
    77 R.E.Schapire.The strength of weak learnability.Machine Learning.1990,5:197-227.
    78 P.Buhlmann.Boosting for high-dimensional linear models.Annals of Statistics.2006,34:559-583.
    79 J.H.Friedman.Fast sparse regression and classification(manuscript)[J/OL].http://www-stat.stanford.edu/jhf/ftp/GPSpaper.pdf,2008.
    80 I.Guyon,A.Elisseeff.An introduction to variable and feature selection.The Journal of Machine Learning Research.2003,3:1157-1182.
    81 J.H.Friedman.Recent advances in predictive(machine) learning.Journal of Classification.2006,23:175-197.
    82 T.Hastie,R.Tibshirani,J.H.Friedman.The Elements of Statistical Learning:Data Mining,Inference,and Prediction.New York:Springer.2001.
    83 A.D.Robert.A Statistical Approach to Neural Networks for Pattern Recognition.New Jersey:Wiley.2007.
    84 D.Pregibon.Data Analytic Methods for Generalized Linear Models.Ph.D thesis.University of Toronto,1979.
    85 D.W.Hosmer,B.Jovanovic,S.Lemeshow.Best subsets logistic regression.Biometrics.1989,45(4):1265-1270.
    86 B.Efron.How biased is the apparent error rate of a prediction rule? Journal of The American Statistical Association.1986,81(394):461-470.
    87 Man Jin,Yi-xin Fang,Lin-cheng Zhao.Variable selection in generalized linear models with canonical link functions.Statistics and Probability Letters.2005,71(4):371-382.
    88 C.M.Hurvich,C.L.Tsai.Model selection for extended Quasi-likelihood Models in Small samples.Biometrics.1995,51(3):1077-1084.
    89 W.Pan.Akaike's Information Criterion in Generalized Estimating Equations.Biometrics.2001,57:120-125.
    90 M.Hansen B.Yu.Model selection and minimum description length principle.JASA.2001,96:746-774.
    91 E.Cantoni,J.M.Flemming,E.Ronchetti.Variable Selection for Marginal Longitudinal Generalized Linear Models,Biometrics.2005,61,507-514.
    92 J.D.Taylor,A.P.Verbyla.Joint modelling of location and scale parameters of the t-distribution.Statistical Modelling.2004,4(2):91-112.
    93 A.P.Verbyla.Modelling variance heterogeneity:residual maximum likelihood and diagnostics.Journal of the Royal Statistical Society,B.1993,55(2):493-508.
    94 Jin-Guan Lin Bo-Cheng Wei,Nan-Song Zhang.Varying Dispersion Diagnostics for Inverse Gaussian Regression Models.Journal of Applied Statistics.2004,31(10):1157-1170.
    95 韦博成,林会官,吕庆哲.回归模型中异方差或变离差检验问题综述.应用概率统计,2003,19(2):210-220.
    96 McCullagh P,Tibshirani R.A simple method for the adjustment of profile likelihoods.J Roy Statist Soc Ser B,52:325-344(1990)
    97 M.Durban,I.D.Cuttie.Adjustment of the profile likelihood for a class of normal regression models.Scandinavian Journal of Statistics.2000,27(3):535-542.
    98 H.D.Patterson.R.Thompson.Recovery if inter-block information when block sizes are unequal.Biometrika.1971.58:545-554.
    99 J.Jiang.REML estimation:asymptotic behavior and related topics.Annals of statistics.1996,24(1):255-286.
    100 K.P.Burnham,D.R.Anderson.Model Selection and Inference-a Practical Information-theoretic Approach.2nd ed.Springer,New York.2002
    101 A.McQuarrie,R.Shumway,C.L.Tsai.The model selection criterion AICu.Statistics and Probability Letters.1997,34(3):285-292.
    102 Y.Lee,J.A.Nelder.Robust design via generalized linear models,Journal of Quality Technology.2003,35(1):2-12.
    103 M.J.Crowder.Beta-binomial anova for proportions.Applied Statistics.1978,27(1):34-37.
    104 D.A.Williams.The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity.Biometrics.1975,31(4):949-952.
    105 J.J.Pignatiello,J.S.Ramberg.Contribution to discussion of off-line quality-control parameter design,and the Taguchi method.Journal of Quality Technology.1985,17:198-206.
    106 J.Q.Fan,H.Peng.Nonconcave penalized likelihood with a diverging number of parameters.The Annals of Statistics.2004.32(3):928-961.
    107 A.W.van der Vaart.Asymptotic Statistics.London:Cambridge Univ.Press.1998.
    108 D.R.Cox,N.Reid.Parameter Orthogonality and Approximate Conditional Inference.Journal of the Royal Statistical Society.Series B.1987,49(1):1-39.
    109 Hansheng Wang,Runze Li,Chih-Ling Tsai.Tuning parameter selectors for the smoothly clipped absolute deviation method.Biometrika.2007,94(3):553-568.
    110 Y.Lee,J.A.Nelder,Generalized linear models for the analysis of qualityimprovement experiments.The Canadian Journal of Statistics,1998,26(1),95-105.
    111 Nelder,J.A.,Lee,Y.Likelihood,quasi-Likelihood and pseudolikelihood:some comparisons.J.Roy.Statist.Soc.Set.B,1992,54(1):273-284.
    112 A.C.Atkinson.Regression diagnostics,transformations and constructed variables(with discussion).Journal of the Royal Statistical Society.Series B.1982,44(1):1-36.
    113 R.D.Cook,S.Weisberg.Residuals and Influence in Regression.New York:Chapman and Hall.1982.
    114 R.D.Cook,P.C.Wang.Transformations and influential cases in regression.Technometrics.1983,25(4):337-343.
    115 R.Tibshirani.Estimating transformations for regression via additivity and variance stabilization.Journal of the American Statistical Association.1988,83(402):394-405.
    116 C.Ip Wai,W.Heung,S.G.Wang,Z.Z.Jia.A GIC rule for assessing data transformation in regression.Statistics & Probability Letters.2004,68:105-110.
    117 王松桂,史建红等编著.线性模型引论.北京:科学出版社,2004:174-175.
    118 P.Zhang.On the convergence rate of model selection criteria.Communications in statistics:Theory and methods.1993,22(10):2765-2775.
    119 赵志源.复杂疾病危险因素的统计分析方法.北京:中国科学院数学与系统科学研究院.2007.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700