Tikhonov, Ivanov and Morozov regularization for support vector machine learning
详细信息    查看全文
  • 作者:Luca Oneto ; Sandro Ridella ; Davide Anguita
  • 关键词:Structural risk minimization ; Tikhonov regularization ; Ivanov regularization ; Morozov regularization ; Support vector machine
  • 刊名:Machine Learning
  • 出版年:2016
  • 出版时间:April 2016
  • 年:2016
  • 卷:103
  • 期:1
  • 页码:103-136
  • 全文大小:1,165 KB
  • 参考文献:Anguita, D., Ghio, A., Greco, N., Oneto, L., & Ridella, S. (2010). Model selection for support vector machines: Advantages and disadvantages of the machine learning theory. In International joint conference on neural networks (pp. 1–8).
    Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2011). The impact of unlabeled patterns in rademacher complexity theory for kernel classifiers. In: Advances in neural information processing systems, (pp. 585–593).
    Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2011). In-sample model selection for support vector machines. In International joint conference on neural networks (pp. 1154–1161).
    Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2012). In-sample and out-of-sample model selection and error estimation for support vector machines. IEEE Transactions on Neural Networks and Learning Systems, 23(9), 1390–1406.CrossRef
    Anthony, M. (2001). Discrete mathematics of neural networks: Selected topics. Philadelphia: Society for Industrial Mathematics.CrossRef MATH
    Aronszajn, N. (1951). Theory of reproducing kernels. Cambridge: Harvard University.MATH
    Bach, F. R., Thibaux, R., & Jordan, M. I. (2005). Computing regularization paths for learning multiple kernels. In Advances in neural information processing systems.
    Bartlett, P., Bousquet, O., & Mendelson, S. (2005). Local rademacher complexities. Annals of Statistics, 33(4), 1497–1537.MathSciNet CrossRef MATH
    Bartlett, P., Jordan, M., & McAuliffe, J. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138–156.MathSciNet CrossRef MATH
    Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 525–536.MathSciNet CrossRef MATH
    Bartlett, P. L., & Mendelson, S. (2003). Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3, 463–482.MathSciNet MATH
    Bauschke, H. H., & Combettes, P. L. (2011). Convex analysis and monotone operator theory in Hilbert spaces. New York: Springer.CrossRef MATH
    Berlinet, A., & Thomas-Agnan, C. (2004). Reproducing kernel Hilbert spaces in probability and statistics. New York: Springer.CrossRef MATH
    Bishop, C. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.CrossRef
    Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In Advanced lectures on machine learning (pp. 169–207).
    Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. The Journal of Machine Learning Research, 2, 499–526.MathSciNet MATH
    Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRef MATH
    Collins, M., Schapire, R., & Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. Machine Learning, 48(1–3), 253–285.CrossRef MATH
    Cortes, C., Kloft, M., & Mohri, M. (2013). Learning kernels using local Rademacher complexity. In Advances in neural information processing systems.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.MATH
    Dinuzzo, F., & Schölkopf, B. (2012). The representer theorem for hilbert spaces: A necessary and sufficient condition. Arxiv preprint arXiv:​1205.​1928 .
    Duan, K., Keerthi, S., & Poo, A. (2003). Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing, 51, 41–59.CrossRef
    Elisseeff, A., Evgeniou, T., & Pontil, M. (2005). Stability of randomized learning algorithms. Journal of Machine Learning Research, 6, 55–79.MathSciNet MATH
    Evgeniou, T., Pontil, M., & Elisseeff, A. (2004). Leave one out error, stability, and generalization of voting combinations of classifiers. Machine Learning, 55(1), 71–97.CrossRef MATH
    Fan, R., Chen, P., & Lin, C. (2005). Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, 1889–1918.MathSciNet MATH
    Feldman, V., Guruswami, V., Raghavendra, P., & Wu, Y. (2009). Agnostic learning of monomials by halfspaces is hard. In Annual IEEE symposium on foundations of computer science (pp. 385–394).
    Flannery, B., Press, W., Teukolsky, S., & Vetterling, W. (1992). Numerical recipes in C. New York: Press Syndicate of the University of Cambridge.MATH
    Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1.CrossRef
    Goldstein, A. A. (1977). Optimization of Lipschitz continuous functions. Mathematical Programming, 13(1), 14–22.MathSciNet CrossRef MATH
    Gunter, L., & Zhu, J. (2005). Computing the solution path for the regularized support vector regression. In Neural information processing systems (pp. 481–488).
    Guyon, I., Saffari, A., Dror, G., & Cawley, G. (2010). Model selection: Beyond the Bayesian/frequentist divide. The Journal of Machine Learning Research, 11, 61–87.MathSciNet MATH
    Haagerup, U. (1981). The best constants in the Khintchine inequality. Studia Mathematica, 70(3), 231–283.MathSciNet MATH
    Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. The Journal of Machine Learning Research, 5, 1391–1415.MathSciNet MATH
    Intel: Intel Visual Fortran Composer XE. (2012). http://​software.​intel.​com/​en-us/​articles/​intel-compilers/​ . Accessed: 18/07/2012.
    Ivanov, V. (1976). The theory of approximate methods and their application to the numerical solution of singular integral equations. New York: Springer.MATH
    Keerthi, S., & Gilbert, E. (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46(1), 351–360.CrossRef MATH
    Keerthi, S., & Lin, C. (2003). Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7), 1667–1689.CrossRef MATH
    Keerthi, S., Shevade, S., Bhattacharyya, C., & Murthy, K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637–649.CrossRef MATH
    Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5), 1902–1914.MathSciNet CrossRef MATH
    Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6), 2593–2656.MathSciNet CrossRef MATH
    Lawler, E., & Wood, D. (1966). Branch-and-bound methods: A survey. Operations Research, 14, 699–719.MathSciNet CrossRef MATH
    Lee, W., Bartlett, P., & Williamson, R. (1998). The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5), 1974–1980.MathSciNet CrossRef MATH
    Martein, L., & Schaible, S. (1987). On solving a linear program with one quadratic constraint. Decisions in Economics and Finance, 10(1), 75–90.MathSciNet CrossRef MATH
    McDiarmid, C. (1989). On the method of bounded differences. Surveys in Combinatorics, 141(1), 148–188.MathSciNet MATH
    Mendelson, S. (2003). On the performance of kernel classes. The Journal of Machine Learning Research, 4, 759–771.MathSciNet MATH
    Milenova, B., Yarmus, J., & Campos, M. (2005). SVM in oracle database 10g: Removing the barriers to widespread adoption of support vector machines. In International conference on very large data bases (pp. 1152–1163).
    Morozov, V., Nashed, Z., & Aries, A. (1984). Methods for solving incorrectly posed problems. New York: Springer.CrossRef
    Munder, S., & Gavrila, D. (2006). An experimental study on pedestrian classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1863–1868.CrossRef
    Oneto, L., Ghio, A., Ridella, S., & Anguita, D. (2014). Fully empirical and data-dependent stability-based bounds. IEEE Transactions on Cybernetics. doi:10.​1109/​TCYB.​2014.​2361857 .
    Park, M. Y., & Hastie, T. (2007). L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 659–677.MathSciNet CrossRef
    Pelckmans, K., Suykens, J., & De Moor, B. (2004) Morozov, Ivanov and Tikhonov regularization based ls-SVMs. In Neural information processing systems (pp. 1216–1222).
    Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Methods—Support Vector Learning, 208, 1–21.
    Platt, J. (1999). Using analytic QP and sparseness to speed training of support vector machines. In Advances in neural information processing systems (pp. 557–563).
    Poggio, T., Mukherjee, S., Rifkin, R., Rakhlin, A., & Verri, A. (2002). b. In Uncertainty in geometric computations.
    Poggio, T., Rifkin, R., Mukherjee, S., & Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature, 428(6981), 419–422.CrossRef
    Pontil, M., & Verri, A. (1998). Properties of support vector machines. Neural Computation, 10(4), 955–974.CrossRef
    Schölkopf, B. (2001). The kernel trick for distances. In Neural information processing systems (pp. 301–307).
    Schölkopf, B., Herbrich, R., & Smola, A. (2001). A generalized representer theorem. In Computational learning theory (pp. 416–426).
    Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940.MathSciNet CrossRef MATH
    Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.CrossRef MATH
    Shawe-Taylor, J., & Sun, S. (2011). A review of optimization methodologies in support vector machines. Neurocomputing, 74(17), 3609–3618.CrossRef
    Suykens, J., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural processing letters, 9(3), 293–300.MathSciNet CrossRef MATH
    Tikhonov, A., Arsenin, V., & John, F. (1977). Solutions of ill-posed problems. Washington, DC: Winston.MATH
    Tomasi, C. (2004). Learning theory: Past performance and future results. Nature, 428(6981), 378–378.CrossRef
    Vapnik, V. (1998). Statistical learning theory. New York: Wiley.MATH
    Vapnik, V. (2000). The nature of statistical learning theory. New York: Springer.CrossRef MATH
    Vorontsov, K. (2010). Exact combinatorial bounds on the probability of overfitting for empirical risk minimization. Pattern Recognition and Image Analysis, 20(3), 269–285.CrossRef
    Yuille, A., & Rangarajan, A. (2003). The concave–convex procedure. Neural Computation, 15(4), 915–936.CrossRef MATH
  • 作者单位:Luca Oneto (1)
    Sandro Ridella (1)
    Davide Anguita (2)

    1. DITEN - University of Genoa, Via Opera Pia 11a, 16145, Genoa, Italy
    2. DIBRIS - University of Genoa, Via Opera Pia 13, 16145, Genoa, Italy
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Automation and Robotics
    Computing Methodologies
    Simulation and Modeling
    Language Translation and Linguistics
  • 出版者:Springer Netherlands
  • ISSN:1573-0565
文摘
Learning according to the structural risk minimization principle can be naturally expressed as an Ivanov regularization problem. Vapnik himself pointed out this connection, when deriving an actual learning algorithm from this principle, like the well-known support vector machine, but quickly suggested to resort to a Tikhonov regularization schema, instead. This was, at that time, the best choice because the corresponding optimization problem is easier to solve and in any case, under certain hypothesis, the solutions obtained by the two approaches coincide. On the other hand, recent advances in learning theory clearly show that the Ivanov regularization scheme allows a more effective control of the learning hypothesis space and, therefore, of the generalization ability of the selected hypothesis. We prove in this paper the equivalence between the Ivanov and Tikhonov approaches and, for the sake of completeness, their connection to Morozov regularization, which has been show to be useful when effective estimation of the noise in the data is available. We also show that this equivalence is valid under milder conditions on the loss function with respect to Vapnik’s original proposal. These results allows us to derive several methods for performing SRM learning according to an Ivanov or Morozov regularization scheme, but using Tikhonov-based solvers, which have been thoroughly studied in the last decades and for which very efficient implementations have been proposed.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700