基于核的正则化学习算法

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于核的正则化学习算法

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Learning with Kernel Based Regularization Schemes
作者：肖铨武
论文级别：博士
学科专业名称：概率论与数理统计
中文关键词：学习理论 ; 核方法 ; 正则化
英文关键词：learning theory ; kernel methods ; regularization
学位年度：2009
导师：方兆本 ; 周定轩
学科代码：070103
学位授予单位：中国科学技术大学
论文提交日期：2009-05-18

摘要

学习理论是机器学习算法的数学基础.它在许多科学技术领域中都有着重要的应用.在这篇论文中,我们主要考虑基于核的正则化学习算法.像支持向量机这样采用凸损失函数和再生核希尔伯特空间的Tikhonov正则化方法已在文献中有较多的研究;而本文中我们将引入一些非标准的框架,并对新框架下的算法作深入的分析.
     首先,我们将讨论带有e~1正则项的回归算法.这一算法的假设空间是依赖于数据样本的,并且可以使用非对称的核函数.样本相关的假设空间将为误差分析带来一项额外的假设误差,这一点与通常情况下样本无关的假设空间有本质的区别.在对正则化误差、样本误差和假设误差分别进行估计之后,我们得到了误差的完整估计,这一误差与核函数的性质、输入空间、边缘分布密度、以及回归问题的回归函数有关.在选定合适的正则化参数之后,我们获得了学习算法的学习率.在核函数为欧氏空间上的高阶光滑函数时,e~1正则化算法的学习率还可以进一步改进.
     其次,我们分析了从非一致概率测度中抽样的二分类问题.这里的主要目标是要对分类器的额外分类误差给出令人满意的估计,而我们的结果表明非一致抽样与独立同分布抽样的情形是类似的.同时,通过一个关于多分类器误差分析的比较定理,我们还可以对多分类问题作出误差估计.
     最后,我们考虑了e~1正则化方法带来的稀疏性,这一问题最近引起了研究人员的广泛注意.这里,我们在理论和应用方面都给出了一些讨论,以供将来进一步研究.
Learning theory is the mathematical foundation for machine learning algorithms which have important applications in many areas of science and technology.In this thesis,we mainly consider learning algorithms involving kernel based regularization schemes.While algorithms like support vector machine given by Tikhonov regularization schemes associated with convex loss functions and reproducing kernel Hilbert spaces have been extensively studied in the literature,we introduce some non-standard settings and provide insightful analysis for them.
     Firstly,we study a regression algorithm with(?)~1 regularizer stated in a hypothesis space trained from data or samples by a nonsymmetric kernel.The data dependent nature of the algorithm leads to an extra error term called hypothesis error,which is essentially different from regularization schemes with data independent hypothesis spaces.By dealing with regularization error,sample error and hypothesis error,we estimate the total error in terms of properties of the kernel,the input space,the marginal distribution,and the regression function of the regression problem.Learning rates are derived by choosing suitable values of the regularization parameter.An improved error decomposition approach is used in our data dependent setting.
     Secondly,we consider the binary classification problem by learning from samples drawn from a non-identical sequence of probability measures.Our main goal is to provide satisfactory estimates for the excess misclassification error of the produced classifiers.Similar results can be obtained for multi-class classification because we give a comparison theory for error analysis of multi-class classifiers.
     Finally,we consider the sparsity issue for(?)~1 regularization schemes.This topic has attracted a lot of attention recently and we give some discussion for further study on both theoretical and practical aspects.

引文

[1]Adams R A,Fournier J F.2003.Sobolev Spaces,2nd ed.Elsevier.
    [2]Aronszajn N.1950.Theory of reproducing kernels.Trans.Amer.Math.Soc,68:337-404.
    [3]Bach F R,Jordan M I.2002.Kernel independent component analysis.J.Mach.Learn.Res.,3:1-48.
    [4]Bickel P J,Li B.2006.Regularization in statistics.Test,15:217-344.
    [5]Bickel P J,Ritov Y,Tsybakov A.2009.Simultaneous analysis of lasso and Dantzig selector.To appear on Ann.Statist.
    [6]Blanchard G,Bousquet O,Massart P.2008.Statistical performance of support vector machines.Ann.Stat.,36:489-531.
    [7]Boucheron S,Bousquet O,Lugosi G.2005.Theory of classification:A survey of some recent advances.ESAIM Probab.Statist.,9:323-375.
    [8]Bunea F.2008.Honest variable selection in linear and logistic regression models via e~1 and e~1 +e~2 penalization.Electron.J.of Stat.,2:1153-1194.
    [9]Candes E,Romberg J.2007.Sparsity and incoherence in compressive sampling.Inverse Problems,23:969-985.
    [10]Candes E,Romberg J,Tao T.2006.Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequency information.IEEE Trans.Inform.Theory,52:489-509.
    [11]Candes E,Romberg J,Tao T.2006.Stable signal recovery from incomplete and inaccurate measurements.Comm.Pure.Appl.Math.,59:1207-1223.
    [12]Candes E,Wakin M.2008.An introduction to compressive sampling.IEEE Signal Process.Mag.,25:21-30.
    [13]Caruana R,Niculescu-Mizil A.2006.An empirical comparison of supervised learning algorithms.Proceedings of the 23rd International Conference on Machine Learning.161-168.
    [14]Chapelle O,Vapnik V,Bousquet O,Mukheijee S.2002.Choosing multiple parameters for support vector machines.Mach.Learn.,46:131-159.
    [15]Chawla N V,Bowyer K W,Hall L O,Kegelmeyer W P.2002.SMOTE:Synthetic minority over-sampling technique.J.Artif.Intell.Res.,16:321-357.
    [16]Chen D R,Wu Q,Ying Y,Zhou D X.2004.Support vector machine soft margin classifiers:Error analysis.J.Mach.Learn.Res.,5:1143-1175.
    [17]Cherkassky V,Ma Y.2005.Support vector machines and regularization.Proceedings of the 7th LASTED International Conference on Signal and Image Processing.172-177.
    [18]Cristianini N,Shawe-Taylor J.2000.An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods.Cambridge University Press.
    [19]Cucker F,Smale S.2002.On the mathematical foundations of learning.Bull.Amer.Math.Soc.,39:1-49.
    [20]Cucker F,Zhou D X.2007.Learning Theory:An Approximation Theory Viewpoint.Cambridge University Press.
    [21]De Vito E,Caponnetto A,Rosasco L.2005.Model selection for regularized least-squares algorithm in learning theory.Found.Comput.Math.,5:59-85.
    [22]De Vito E,Rosasco L,Caponnetto A,Giovannini U D,Odone F.2005.Learning from examples as an inverse problem.J.Mach.Learn.Res.,6:883-904.
    [23]Devroye L,Gyorfi L,Lugosi G.1996.A Probabilistic Theory of Pattern Recognition.Springer.
    [24]Donoho D.2006.Compressed sensing.IEEE Trans.Inform.Theory,52:1289-1306.
    [25]Donoho D L.2006.For most large underdetermined systems of linear equations the minimal l_1-norm solution is also the sparsest solution.Comm.Pure.Appl.Math.,59:797-829.
    [26]Donoho D L,Tsaig Y.2008.Fast solution of l_1-norm minimization problems when the solution may be sparse.IEEE Trans.Inform.Theory,54:4789-4812.
    [27]Duda R O,Hart P E,Stork D G.2000.Pattern Classification,2nd ed.Wiley.
    [28]Efron B,Hastie T,Johnstone I,Tibshirani R.2004.Least angle regression.Ann.Stat.,32:407-499.
    [29]Elad M,Bruckstein A M.2002.A generalized uncertainty principle and sparse representation in pairs of bases.IEEE Trans.Inform.Theory,48:2558-2567.
    [30]Evgeniou T,Pontil M,Poggio T.2000.Regularization networks and support vector machines.Adv.Comput.Math.,13:1-50.
    [31]Fan J,Li R.2001.Variable selection via nonconcave penalized likelihood and its oracle properties.J Am.Stat.Assoc,96:1348-1360.
    [32]Fan R E,Chen P H,Lin C J.2005.Working set selection using second order information for training support vector machines.J.Mach.Learn.Res.,6:1889-1918.
    [33]Fellenz W A.2001.Reduced support vector selection by linear programs.Proceedings of the 6th International Work-Conference on Artificial and Natural Neural Networks.677-684.
    [34]Fukumizu K,Bach F R,Gretton A.2007.Statistical consistency of kernel canonical correlation analysis.J.Mach.Learn.Res.,8:361-383.
    [35]Fung G,Mangasarian O L.2000.Data selection for support vector machine classifiers.Proceedings of the 6th ACM Sigkdd International Conference on Knowledge Discovery and Data Mining.193-198.
    [36]Girosi F.1998.An equivalence between sparse approximation and support vector machines.Neural Comput.,6:1455-1480.
    [37]Hesterberg T C,Choi N H,Meier L,Fraley C.2008.Least angle and l_1 penalized regression:A review.Stat.Surv.,2:61-93.
    [38]Hoerl A E,Kennard R W.1970.Ridge regression:Biased estimation for nonorthogonal problems.Technometrics,12:55-67.
    [39]Hofmann T,Scholkopf B,Smola A J.2008.Kernel methods in machine learning.Ann.Stat.,36:1171-1220.
    [40]Hu T,Zhou D X.2009.Online classification with samples drawn from non-identical distributions.Submitted to J.Mach.Learn.Res.
    [41]Jetter K,St(o|¨)ckier J,Ward J D.1999.Error estimates for scattered data interpolation on spheres.Math.Comp.,68:733-747.
    [42]Kecman V,Hadzic I.2000.Support vectors selection by linear programming.Proceedings of die IEEE-INNS-ENNS International Joint Conference on Neural Networks,volume 5.193-198.
    [43]Koltchinskii V.2009.Sparsity in penalized empirical risk minimization.Ann.Inst.H.Poincare Prob.Stat.,45:7-57.
    [44]Kotsiantis S B.2007.Supervised machine learning:A review of classification techniques.Informatica,31:249-268.
    [45]Kubat M,Matwin S.1997.Addressing the curse of imbalanced training sets:One-sided selection.Proceedings of the 14th International Conference on Machine Learning.179-186.
    [46]Lanckriet G R G,Cristianini N,Bartlett P,Ghaoui L E,Jordan M I.2004.Learning the kernel matrix with semidefinite programming.J.Mach.Learn.Res.,5:27-72.
    [47]Micchelli C A,Pontil M.2005.Learning the kernel function via regularization.J.Mach.Learn.Res.,6:1099-1125.
    [48]Micchelli C A,Xu Y,Zhang H.2006.Universal kernels.J.Mach.Learn.Res.,7:2651-2667.
    [49]Mukherjee S,Wu Q.2006.Estimation of gradients and coordinate covariation in classification.J.Mach.Learn.Res.,7:2481-2514.
    [50]Mukherjee S,Zhou D X.2006.Learning coordinate covariances via gradients.J.Mach.Learn.Res.,7:519-549.
    [51]Niyogi P,Girosi F.1996.On die relationship between generalization error,hypothesis complexity,and sample complexity for radial basis functions.Neural Comput.,8:819-842.
    [52]Opfer R.2006.Multiscale kernels.Adv.Comput.Math.,25:357-380.
    [53]Pan J,Yang Q,Yang Y,Li L,Li F T,Li G W.2007.Cost-sensitive-data preprocessing for mining customer relationship management databases.IEEE Intell.Syst.,22:46-51.
    [54]Pan Z W,Xiang D H,Xiao Q W,Zhou D X.2008.Parzen windows for multi-class classification.J.Complexity,24:606-618.
    [55]Pan Z W,Xiao Q W.2009.Least-square regularized regression with non-i.i.d.sampling.Accepted by J.Stat.Plann.Infer.
    [56]Parzen E.1962.On the estimation of a probability density function and mode.Ann.Math.Stat.,33:1065-1076.
    [57]Pinelis I.1994.Optimum bounds for the distributions of martingales in banach spaces.Ann.Prob.,22:1679-1706.
    [58]Plackett R L.1950.Some theorems in least squares.Biometrika,37:149-157.
    [59]Rao R B,Yakhnenko O,Krishnapuram B.2008.KDD Cup 2008 and the workshop on mining medical data.ACM SIGKDD Explor.Newsl.,10:34-38.
    [60]Scholkopf B,Smola A,Müller K R.1998.Nonlinear component analysis as a kernel eigenvalue problem.Neural Comput.,10:1299-1319.
    [61]Shawe-Taylor J,Bartlett P L,Williamson R C.1998.Structural risk minimization over data-dependent hierarchies.IEEE Trans.Inform.Theory,44:1926-1940.
    [62]Shawe-Taylor J,Cristianini N.2004.Kernel Methods for Pattern Analysis.Cambridge University Press.
    [63]Smale S,Zhou D X.2004.Shannon sampling and function reconstruction from point values.Bull.Amer.Math.Soc,41:279-305.
    [64]Smale S,Zhou D X.2007.Learning theory estimates via integral operators and their approximations.Constr.Approx.,26:153-172.
    [65]Smale S,Zhou D X.2009.Online learning wim Markov sampling.Anal.Appl.,7:87-113.
    [66]Smola A J,Scholkopf B.2004.A tutorial on support vector regression.Stat.Comput.,14:199—222.
    [67]Steinwart 1.2003.Sparseness of support vector machines.J.Mach.Learn.Res.,4:1071-1105.
    [68]Steinwart I,Scovel C.2007.Fast rates for support vector machines using Gaussian kernels.Ann.Stat.,35:575-607.
    [69]Sun Y,Kamel M S,Wong A K C,Wang Y 2007.Cost-sensitive boosting for classification of unbalanced data.Pattern Recognition,40:3358-3378.
    [70]Suykens J A K,Vandewalle J.1999.Least squares support vector machine classifiers.Neural Process.Lett.,9:293-300.
    [71]Tibshirani R.1996.Regression shrinkage and selection via the lasso.J.Roy.Stat.Soc.B,58:267-288.
    [72]Tikhonov A N,Arsenin V A.1977.Solution of Ill-posed Problems.Winston & Sons.
    [73]Tsybakov A B.2004.Optimal aggregation of classifiers in statistical learning.Ann.Stat.,32:135-166.
    [74]Vapnik V.1998.Statistical Learning Theory.Wiley.
    [75]Wahba G.1990.Spline Models for Observational Data.Cambridge University Press.
    [76]Wendland H.2001.Local polynomial reproduction and moving least squares approximation.IMA J.Numer.Anal.,21:285-300.
    [77]Wu Q,Ying Y,Zhou D X.2006.Learning rates of least-square regularized regression.Found.Comput.Math.,6:171-192.
    [78]Wu Q,Ying Y,Zhou D X.2007.Multi-kernel regularized classifiers.J.Complexity,21:108-134.
    [79]Wu Q,Zhou D X.2005.SVM soft margin classifiers:Linear programming versus quadratic programming.Neural Comput.,17:1160-1187.
    [80]Wu Q,Zhou D X.2006.Analysis of support vector machine classification.J.Comput.Anal.Appl.,8:99-119.
    [81]Wu Q,Zhou D X.2008.Learning with sample dependent hypothesis spaces.Comput.Math.Appl.,56:2896-2907.
    [82]Wu Z M,Schaback R.1993.Local error estimates for radial basis function interpolation of scattered data.IMA J.Numer.Anal.,13:13-27.
    [83]Xiang D H,Zhou D X.2009.Classification with Gaussians and convex loss.To appear on J.Mach.Learn.Res.
    [84]Xiao Q W,Pan Z W.2009.Learning from non-identical sampling for classification.To appear on Adv.Comput.Math.
    [85]Xiao Q W,Zhou D X.2009.Learning by non-symmetric kernels with data dependent spaces and e~1-regularizer.To appear on Taiwanese J.Math.
    [86]Xu Y S,Zhang H Z.2007.Refinable kernels.J.Mach.Learn.Res.,8:2083-2120.
    [87]Yao Y,Rosasco L,Caponnetto A.2007.On early stopping in gradient descent learning.Constr.Approx.,26:289-315.
    [88]Yen S J,Lee Y S.2006.Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset.Proceedings of the 2nd International Conference on Intelligent Computing.731-740.
    [89]Ying Y.2006.Convergence analysis of online algorithms.Adv.Comput.Math.,27:273-291.
    [90]Zhang T.2004.Statistical behavior and consistency of classification methods based on convex risk minimization.Ann.Stat.,32:56-85.
    [91]Zhang T.2009.Some sharp performance bounds for least squares regression with l_1 regular-ization.To appear on Ann.Statist.
    [92]Zhou D X.2002.The covering number in learning theory.J.Complexity,18:739-767.
    [93]Zhou D X.2003.Capacity of reproducing kernel spaces in learning theory.IEEE Trans.Inform.Theory,49:1743-1752.
    [94]Zhou D X.2008.Derivative reproducing properties for kernel methods in learning theory.J.Comput.Appl.Math.,220:456-463.
    [95]Zhou D X.2008.Some learning schemes generated by scaling.Technical Report 30/2008,Workshop on Learning Theory and Approximation,Mathematisches Forschungsinstitut Ober-wolfach.
    [96]Zhu J,Hastie T.2005.Kernel logistic regression and the import vector machine.J.Comp.Graph.Stat.,14:185-205.
    [97]Zhu J,Rosset S,Hastie T,Tibshirani R.2003.1-norm support vector machines.Processing of the 7th Annual Conference on Neural Information Processing Systems.49-56.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700