基于数据的学习：埃尔米特算法与黎曼流形上的法向量估计

英文题名：Learning from Data: Hermite Learning and Normal Estimation on Riemannian Manifolds
作者：石磊
论文级别：博士
学科专业名称：计算数学
中文关键词：埃尔米特学习 ; 法向量估计 ; 流形学习 ; 样本依赖假设空间 ; 系数正则化算法
英文关键词：Hermite learning ; normal estimation ; manifold learning ; sample dependent hypothesis space ; coefficients regularization scheme
学位年度：2010
导师：周定轩 ; 邓建松
学科代码：070102
学位授予单位：中国科学技术大学
论文提交日期：2010-04-01

摘要

在本文中,我们主要研究学习理论中关于回归,流形学习和数据分析的一些算法。我们将详细地讨论这些算法的设计,并从逼近论的观点讨论其渐近性质。
     论文的第一部分,在再生核Hilbert空间中最小二乘回归正则化算法的框架下,我们研究了基于梯度样本数据的学习问题。在表示定理的帮助下,算法的求解归结为求解一个线性方程组,系数矩阵中涉及核函数值的Gramian矩阵以及核函数偏导数值的Hessian矩阵。额外的关于梯度的样本值可以提高算法的学习性能。通过运用采样算子分析样本误差和,Sobolev空间中的积分算子分析逼近误差,我们给出该算法的误差分析。
     法向量估计是处理点云数据以及计算机图形学中曲面重构的重要研究课题。在论文的第二部分,我们考虑欧式空间中余维为1的子流形上的法向量估计问题。由于流形是未知的,我们要利用在流形上随机采样得到的样本点来估计法向量。我们提出了一种由核函数构造的学习算法,它实际上是无监督形式的梯度学习。算法的求解归结为求解一个线性代数的特征向量问题。在真实的法向量和采样分布满足一定的条件时,我们得到了关于该算法的误差估计。
     在论文的最后一部分,我们主要讨论样本依赖假设空间中的正则化回归问题。对于给定的一组样本数据,样本依赖假设空间中的函数定义为由核函数和样本数据产生的一族基函数的线性组合,因此空间中的函数完全取决于其线性组合的系数。这种核函数构造的假设空间其依赖样本的特质给学习算法带来很大的灵活性和自适应性。在这种空间里讨论的正则化算法与传统的再生核Hilbert空间中的算法有本质的不同：我们所考虑的核函数不是对称的,从而不具有半正定性,正则化子作为作用在该空间中函数上的泛函,被取为其相应的组合系数的ep范数的p次幂。这种不同增加了误差分析的困难。
     具体来说,我们主要在本文中研究了两种情况：p=1和p=2。当p=1时,e1正则化子经常会使解向量具有稀疏性,从而极大提高算法运行的效率。当p=2时,相应的算法是线性的并且可以通过一个线性方程组来求解。这两种算法都已经被一些文献研究过。在本文中,我们利用关于e2经验覆盖数的中心极限定理得到了学习算法目前为止最好的收敛阶。因为我们的目的是给出一种容量相关的分析方法,对于在误差分析中出现的由非对称核函数构造的函数空间,我们给出了其中的单位闭球关于e2经验覆盖数的性质,这在我们的分析中起了十分关键的作用。
In this thesis, we investigate some algorithms in learning theory for purpose of regression, manifold learning and data analysis. Their design and asymptotic perfor-mance will be discussed in detail from the view point of approximation theory.
     In the first part, the problem of learning from data involving function values and gradients is studied in a framework of least-square regularized regression in reproduc-ing kernel Hilbert spaces. The algorithm is implemented by a linear system with the coefficient matrix involving both block matrices for generating Graph Laplacians and Hessians. The additional data for function gradients improve learning performance of the algorithm. Error analysis is done by means of sampling operators for the sample error and integral operators in Sobolev spaces for the approximation error.
     Normal estimation is an important topic for processing point cloud data and sur-face reconstruction in computer graphics. In the second part of the thesis, we consider the problem of estimating normals for a (unknown) submanifold of a Euclidean space of codimension 1 from random points on the manifold. We propose a kernel based learning algorithm in an unsupervised form of gradient learning. The algorithm can be implemented by solving a linear algebra problem. Error analysis is conducted under conditions on the true normals of the manifold and the sampling distribution.
     In the last part of this thesis, we consider the regression problem by learning with a regularization scheme in a data dependent hypothesis space. For a given set of sam-ples, functions in this hypothesis space are defined to be linear combinations of basis functions generated by a kernel function and sample data, thus are entirely determined by the combination coefficients. The data dependence nature of the kernel-based hy-pothesis space provides flexibility and adaptivity for the learning algorithms. The regu-larization scheme is essentially different from the standard one in a reproducing kernel Hilbert space:the kernel function is not necessarily symmetric or positive semi-definite and the regularizer, as a functional acting on the functions in such kinds of hypothesis spaces, is taken to be the p-th power of the (?)p-norm of the corresponding combination coefficients. The differences lead to additional difficulty in the error analysis.
     To be more specific, we mainly study two cases in this thesis:p = 1 and p= 2. When p = 1, the(?)1-regularizer often leads to sparsity of the solution vector which can greatly improve the efficiency of computations. When p = 2, the corresponding algorithm is linear and is easy to implement by solving a linear system. Both algorithms have been studied in the literature. In this thesis, we apply concentration techniques with (?)2-empirical covering numbers to get the best learning rate for the the learning algorithms. Since our aim is a capacity dependent analysis, we also show that the function spaces involved in the error analysis induced by the non-symmetric kernel function have nice behaviors in terms of the (?)2-empirical covering numbers of its unit ball.

引文

[1]N. Alon, S. Ben-David, N. Cesa-Bianchi and D. Haussler. Scale-sensitive dimen-sions, uniform convergence and learnability. Journal of the ACM,44:615-631, 1997.
    [2]N. Aronszajn. Theory of reproducing kernels. Transactions of the American Math-ematical Society,68:337-404,1950.
    [3]M. Alexa, J. Behr, D. Cohen-Or, S. Fleishman, D. Levin, C. T. Silva. Point set surfaces. In Proceedings of the conference on Visualization'01. IEEE Computer Society:Washington,21-28,2001.
    [4]N. Amenta, Y. J. Kil. Defining point-set surfaces. In ACM SIGGRAPH 2004 Pa-pers. ACM:New York,264-270,2004.
    [5]M. Anthony and P. L. Bartlett. Neural Network Learning:Theoretical Founda-tions. Cambridge University Press,1999.
    [6]R. Bellman. Adaptive Control Process:A Guided Tour. Princeton University Press,1961.
    [7]O. Bousquet, S. Boucheron and G. Lugosi. Introduction to statistical learning the-ory. Advanced Lectures on Machine Learning. Springer,169-207,2004.
    [8]B. E. Boser, I. Guyon and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory. ACM,144-152,1992.
    [9]B. Blanchard, G. Lugosi and N. Vayatis. On the rate of convergencev of regular-ized boosting classifiers. The Journal of Machine Learning Research,4:861-894, 2003.
    [10]N. L. Bowers, H. U. Gerber, J. C. Hickman, D. A. Jones, and C. J. Nesbitt. Actu-arial Mathematics. Society of Actuaries Schaumburg,1986.
    [11]P. L. Bartlett, M. I. Jordan and J. D. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association,101:138-156,2006.
    [12]M. Belkin and P. Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. Learning Theory,486-500.
    [13]M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation,15:1373-1396,2003.
    [14]M. Belkin, J. Sun and Y. Wang. Discrete Laplace operator for meshed surfaces. In Proceedings of the Twenty-fourth Annual Symposium on Computational Geome-try. ACM:New York,278-287,2008.
    [15]M. Belkin, J. Sun and Y. Wang Constructing Laplace operator from point clouds in Rd. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM:Philadelphia,1031-1040,2009.
    [16]R. Bekkerman, R. El-Yaniv, N. Tishby and Y. Winter. Distributional word clusters vs. words for text categorization. The Journal of Machine Learning Research,3: 1183-1208,2003.
    [17]J. D. Boissonnat and F. Cazals. Smooth surface reconstruction via natural neigh-bour interpolation of distance functions. Computational Geometry:Theory and Applications,22:185-203,2002.
    [18]F. Cucker and D. X. Zhou. Learning Theory:An Approximation Theory View-point. Cambridge University Press,2007.
    [19]V. Cherkassky and F. Mulier. Learning from Data:Concepts, Theory, and Meth-ods. Wiley-IEEE Press,2007.
    [20]A. Caponnetto, L. Rosasco and Y. Yao. On early stopping in gradient descent learning. Constructive Approximation,26:289-315,2007.
    [21]C. Cortes and V. Vapnik. Support vector networks. Machine Learning,20:273-297,1995.
    [22]C. Campbell and Y. Ying. Learning coordinate gradients with multi-task kernels. In Proceedings of the 21st Annual Conference on Learning Theory (COLT),2008.
    [23]D. R. Chen, Q. Wu, Y. Ying and D. X. Zhou. Support vector machine soft margin classifiers:error analysis. Journal of Machine Learning Research,5:1143-1175, 2004.
    [24]E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems,23:969-985,2007.
    [25]M. P. Do Carmo and F. Flaherty. Riemannian Geometry. Birkhauser:Boston, 1992.
    [26]L. Devroye, L. Gyorfi and G. Lugosi. A Probabilistic Theory of Pattern Recogni-tion. Springer,1996.
    [27]R. Dudley, E. Gine, and J. Zinn. Uniform and universal Glivenko-Cantelli classes. Journal of Theoretical Probability,4:485-510,1991.
    [28]W. E. Donath and A. J. Hoffman. Lower bounds for the partitioning of graphs. IBM Journal of Research and Development,17:420-425,1973.
    [29]S. Didas, S. Setzer and G. Steidl. Combined (?)2 data and gradient fitting in conjunc-tion with  regularization. Advances in Computational Mathematics,30:79-99, 2009.
    [30]T. Evgeniou, M. Pontil and T. Poggio. Regularization networks and suport vector machine. Advances in Computational Mathematics,13:1-50,2000.
    [31]D. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Op-erators. Cambridge University Press,1996.
    [32]R. Franke. Scattered data interpolation:Tests of some methods. Mathematics of Computation,38:181-200,1982.
    [33]M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Jour-nal,23:298-305,1973.
    [34]F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cybernetics,63:169-176,1990.
    [35]I. Guyon, J. Weston, S. Barnhill and V. Vapnik. Gene selection for cancer classi-fication using support vector machines. Machine Learning,46:389-422,2002.
    [36]T. R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield and E. S. Lander. Molecular classification of cancer:Class discovery and class prediction by gene expression monitoring. Science,286:531-537,1999.
    [37]E. Gine and V. Koltchinskii. Empirical graph Laplacian approximation of Laplace-Beltrami operator:large sample results. IMS Lecture Notes-monogragh Series High Dimensional Probability,51:238-259,2006.
    [38]J. A. Hartigan. Clustering Algorithms. Wiley New York,1975.
    [39]T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. New York Springer,2001.
    [40]H. Hotelling. Analysis of a complex of statistical variables into components. Jour-nal of Educational Psychology,24:417-441,1933.
    [41]D. Hardin, I. Tsamardinos and C. F. Aliferis. A theoretical characterization of lin-ear SVM-based feaure selection. In Proceedings of the Twenty-first International Conference on Machine Learning,2004.
    [42]W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematics. University of California Press: Berkeley,361-380,1960.
    [43]I. Jolliffe. Principal Component Analysis. Springer-Verlag, New York,1986.
    [44]K. Jetter, J. Stoeckler and J. D. Ward. Error estimates for scattered data interpola-tion on sphere. Advances in Computational Mathematics,13:1-50,2000.
    [45]G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized bosting methods. Annals of Statistics,32:30-55,2004.
    [46]S. Lafon. Diffusion maps and geometric harmonics. PhD Thesis, Yale University, 2004.
    [47]Peter D. Lax. Functional Analysis. John Wiley & Sons,2002.
    [48]U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing,17: 395-416,2007.
    [49]C. A. Micchelli, Y. Xu and H. Zhang. Universal kernels. The Journal of Machine Learning Research,7:2651-2667,2006.
    [50]S. Mukherjee and D. X. Zhou. Learning coordinate covariances via gradients. The Journal of Machine Learning Research,7:519-549,2006.
    [51]S. Mukherjee and Q. Wu. Estimation of gradient and coordinate covariation in classification. The Journal of Machine Learning Research,7:2481-2514,2006.
    [52]S. Mukherjee, Q. Wu and D. X. Zhou. Learning gradients and feature selection on manifolds. Bernoulli,16:181-207,2010.
    [53]N. J. Mitra and A. Nguyen. Estimating surface normals in noisy point cloud data. In Proceedings of the Nineteenth Annual Symposium on Computational Geome-try. ACM:New York,322-328,2003.
    [54]J. Mercer. Functions of positive and negtive type and their connection with the theory of intergral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415-446,1909.
    [55]C. McDiarmid. On the method of bounded differences. Surveys in Combinatorics, 141:148-188,1989.
    [56]A. Y. Ng, M. I. Jordan and Y. Weiss. On spectral clustering:Analysis and an al-gorithm. Advances in Neural Information Processing Systems,2:849-856,2002.
    [57]I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. The Annals of Probability,22:1679-1706,1994.
    [58]M. Pauly, R. Keiser, L. P. Kobbelt and M. Gross. Shape modeling with point-sampled geometry. ACM Transactions on Graphics,22:641-650,2003.
    [59]S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear em-bedding. Science,290:2323-2326,2000.
    [60]J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cam-bridge University Press,2004.
    [61]S. Smale and D. X. Zhou, Shannon sampling and function reconstruction from point values. Bulletin-American Mathematical Society,41:279-305,2004.
    [62]S. Smale and D. X. Zhou. Shannon sampling Ⅱ:Connections to learning theory. Applied and Computational Harmonic Analysis,19:285-302,2005.
    [63]S. Smale and D. X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive approximation,26:153-172,2007.
    [64]S. Smale and D. X. Zhou. Geometry on probability spaces. Constructive Approx-imation,30:311-323,2009.
    [65]J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,22:888-905,2000.
    [66]B. Scholkopf, A. Smola and K. R. Muller. Kernel principal component analysis. Artificial Neural Networks-ICANN'97,583-588,1997.
    [67]L. Saul and S. Roweis. Think globally, fit locally:Unsupervised learning of non-linear manifolds. The Journal of Machine Learning Research,4:119-155,2003.
    [68]I. Steinwart and C. Scovel. Fast rates for support vector machines using Gaussian kernels. Annals of Statistics,35:575-607,2007.
    [69]Hongwei Sun and Qiang Wu. Coefficient regularization in least square kernel re- gression. Preprint.
    [70]A. N. Tikhonov and V. Y. Arsenin. Solution of Ill-posed Problems. Winston, Wash-ington, DC,1977.
    [71]R. Tibshirani. Regression shrinkage and selection via the lasso, Journal of the Royal Stastistical Society, Series B,58:267-288,1996.
    [72]V. Vapnik and S. Kotz. Estimation of Dependences Based on Empirical Data. Springer-Verlag New York,1982.
    [73]V. Vapnik. Stastical Learning Thoery. Wiley, New York,1998.
    [74]E. De Vito, L. Rosasco, A. Caponnetto, M. Piana and A. Verri. Some properties of regularized kernel methods. Journal of Machine Learning Research,5:1363-1390,2004.
    [75]E. De Vito, A. Caponnetto and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5:59-85,2005.
    [76]A. W. Van der vaart and J. A. Wellner. Weak Convergence and Emprical Pro-cesses. Springer-Verlag, New York,1996.
    [77]G. Wahba. Spline Models for Observational Data. SIAM,1990.
    [78]Zongmin Wu. Multivariatie compactly supported positive definite radial func-tions. Advances in Computational Mathematics,4:283-292,1995.
    [79]Zongmin Wu. Modelling, Methods and Thoery of Scattered Data Fitting. Science Press, Beijing,2006.
    [80]H. Wendland. Piecewise polynomial, positive definite and compacctly supported radial functions of minimal degree. Advances in Computational Mathematics,4: 389-396,1995.
    [81]H. Wendland. Local polynomial reproduction and moving least squares approxi-mation. IMA Journal of Mumerical Analysis,21:285-300,2001.
    [82]Q. Wu, Y. Ying and D. X. Zhou. Learning rates of least-square regularized regres-sioin. Foundations of Computational Mathematics,2:171-192,2006.
    [83]Q. Wu, Y. Ying, and D. X. Zhou. Multi-kernel regularized classifiers. Jounal of Complexity,23:108-134,2007.
    [84]Q. Wu and D. X. Zhou. Learning with sample dependent hyposisthesis spaces. Computers & Mathematics with Applications,56:2896-2907,2008.
    [85]C. Walder, B. Scholkopf, O. Chapelle. Implicit surface modelling with a globally regularised basis of compact support. Computer Graphics Forum,25:635-644, 2006.
    [86]H. Y. Wang, Q. W. Xiao and D. X. Zhou. Learning with (?)1-regularization for regression. Preprint.
    [87]Q. W. Xiao and D. X. Zhou. Learning by nonsymmetric kernel with data depen-dent spaces and (?)1-regularizer. Taiwan.J.Math. to appear.
    [88]G. B. Ye and D. X. Zhou. Learning and approximation by Gaussians on Rieman-nian manifolds. Advances in Computational Mathematics,29,291-310,2008.
    [89]G. B. Ye and D. X. Zhou. SVM learning and Lp approximation by Gaussians on Riemannian manifolds. Analysis and Applications,7:309-339,2009.
    [90]Y. Ting and D. X. Zhou. Learnability of Gaussians with flexible variances. The Journal of Machine Learning Research,2007.
    [91]D. X. Zhou. The covering number in learning thoery. Journal of Complexity,18: 739-767,2002.
    [92]D. X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Trans-actions on Information Theory,49:1743-1752,2003.
    [93]D. X. Zhou. Derivative reproducing properties for kernel methods in learning the-ory. Journal of Computational and Applied Mathematics,220:456-463,2008.
    [94]T. Zhang. Leave-one-out bounds for kernel methods. Neural Computation,15: 1397-1437,2003.
    [95]T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics,32:56-85,2004.
    [96]T. Zhang. Some sharp performance bounds for least square regression with L1 regularization. Annals of Stastics,37:2109-2144,2009.
    [97]P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research,7:2541-2563,2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700