Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalization Error Curves for Analytic Spectral Algorithms under Power-law Decay (2401.01599v3)

Published 3 Jan 2024 in cs.LG, math.ST, and stat.TH

Abstract: The generalization error curve of certain kernel regression method aims at determining the exact order of generalization error with various source condition, noise level and choice of the regularization parameter rather than the minimax rate. In this work, under mild assumptions, we rigorously provide a full characterization of the generalization error curves of the kernel gradient descent method (and a large class of analytic spectral algorithms) in kernel regression. Consequently, we could sharpen the near inconsistency of kernel interpolation and clarify the saturation effects of kernel regression algorithms with higher qualification, etc. Thanks to the neural tangent kernel theory, these results greatly improve our understanding of the generalization behavior of training the wide neural networks. A novel technical contribution, the analytic functional argument, might be of independent interest.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Neural tangent kernel: Convergence and generalization in neural networks, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc., 2018. URL: https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
  2. Wide neural networks of any depth evolve as linear models under gradient descent, in: Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc., 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/0d1a9651497a38d8b1c3871c84528bd4-Abstract.html.
  3. Benign overfitting in linear regression, Proceedings of the National Academy of Sciences 117 (2020) 30063–30070. doi:10.1073/pnas.1907378117. arXiv:1906.11300.
  4. On the saturation effect of kernel ridge regression, in: International Conference on Learning Representations, 2023. URL: https://openreview.net/forum?id=tFvr-kYWs_Y.
  5. Spectrum dependent learning curves in kernel regression and wide neural networks, in: Proceedings of the 37th International Conference on Machine Learning, PMLR, 2020, pp. 1024–1034. URL: https://proceedings.mlr.press/v119/bordelon20a.html.
  6. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime, Advances in Neural Information Processing Systems 34 (2021) 10131–10143.
  7. On the asymptotic learning curves of kernel ridge regression under power-law decay, in: Thirty-Seventh Conference on Neural Information Processing Systems, 2023. URL: https://openreview.net/forum?id=E4P5kVSKlT&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DNeurIPS.cc%2F2023%2FConference%2FAuthors%23your-submissions).
  8. A. Caponnetto, E. De Vito, Optimal rates for the regularized least-squares algorithm, Foundations of Computational Mathematics 7 (2007) 331–368. doi:10.1007/s10208-006-0196-8.
  9. Spectral methods for regularization in learning theory, DISI, Universita degli Studi di Genova, Italy, Technical Report DISI-TR-05-18 (2005). URL: https://www.semanticscholar.org/paper/849ef0790f23c4b3aab40ecf2c47b6127cf4e1a8.
  10. Spectral algorithms for supervised learning, Neural Computation 20 (2008) 1873–1897. doi:10.1162/neco.2008.05-07-517.
  11. G. Blanchard, N. Mücke, Optimal rates for regularization of statistical inverse learning problems, Foundations of Computational Mathematics 18 (2018) 971–1013. doi:10.1007/s10208-017-9359-7.
  12. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces, Applied and Computational Harmonic Analysis 48 (2018) 868–890. doi:10.1016/j.acha.2018.09.009.
  13. Optimal Rates for Regularized Least Squares Regression., in: COLT, 2009, pp. 79–93. URL: http://www.learningtheory.org/colt2009/papers/038.pdf.
  14. S.-R. Fischer, I. Steinwart, Sobolev norm learning rates for regularized least-squares algorithms, Journal of Machine Learning Research 21 (2020) 205:1–205:38. URL: https://www.semanticscholar.org/paper/248fb62f75dac19f02f683cecc2bf4929f3fcf6d.
  15. On the optimality of misspecified kernel ridge regression, in: International Conference on Machine Learning, 2023. URL: https://openreview.net/forum?id=Kg2al3GXBR.
  16. T. Zhang, B. Yu, Boosting with early stopping: Convergence and consistency, The Annals of Statistics 33 (2005) 1538–1579. doi:10.1214/009053605000000255.
  17. On early stopping in gradient descent learning, Constructive Approximation 26 (2007) 289–315. doi:10.1007/s00365-006-0663-2.
  18. On regularization algorithms in learning theory, Journal of complexity 23 (2007) 52–72. doi:10.1016/j.jco.2006.07.001.
  19. A. Caponnetto, Optimal rates for regularization operators in learning theory (2006).
  20. S. Buchholz, Kernel interpolation in Sobolev spaces is not consistent in low dimensions, in: Conference on Learning Theory, PMLR, 2022, pp. 3410–3440. URL: https://proceedings.mlr.press/v178/buchholz22a.html.
  21. Kernel interpolation generalizes poorly, Biometrika (2023) asad048. doi:10.1093/biomet/asad048. arXiv:2303.15809.
  22. T. Liang, A. Rakhlin, Just interpolate: Kernel ”ridgeless” regression can generalize, The Annals of Statistics 48 (2020). doi:10.1214/19-AOS1849. arXiv:1808.00387.
  23. I. Steinwart, C. Scovel, Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs (2012). doi:10.1007/S00365-012-9153-3.
Citations (3)

Summary

We haven't generated a summary for this paper yet.