Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions (2401.01270v1)

Published 2 Jan 2024 in cs.LG

Abstract: Motivated by the studies of neural networks (e.g.,the neural tangent kernel theory), we perform a study on the large-dimensional behavior of kernel ridge regression (KRR) where the sample size $n \asymp d{\gamma}$ for some $\gamma > 0$. Given an RKHS $\mathcal{H}$ associated with an inner product kernel defined on the sphere $\mathbb{S}{d}$, we suppose that the true function $f_{\rho}{*} \in [\mathcal{H}]{s}$, the interpolation space of $\mathcal{H}$ with source condition $s>0$. We first determined the exact order (both upper and lower bound) of the generalization error of kernel ridge regression for the optimally chosen regularization parameter $\lambda$. We then further showed that when $0<s\le1$, KRR is minimax optimal; and when $s\>1$, KRR is not minimax optimal (a.k.a. he saturation effect). Our results illustrate that the curves of rate varying along $\gamma$ exhibit the periodic plateau behavior and the multiple descent behavior and show how the curves evolve with $s>0$. Interestingly, our work provides a unified viewpoint of several recent works on kernel regression in the large-dimensional setting, which correspond to $s=0$ and $s=1$ respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Strong inductive biases provably prevent harmless interpolation. In The Eleventh International Conference on Learning Representations, 2022.
  2. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32, 2019.
  3. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  4. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
  5. On the inconsistency of kernel ridgeless regression in fixed dimensions. SIAM Journal on Mathematics of Data Science, 5(4):854–872, 2023.
  6. A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019.
  7. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR, 2020.
  8. S. Buchholz. Kernel interpolation in Sobolev spaces is not consistent in low dimensions. In P.-L. Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 3410–3440. PMLR, July 2022.
  9. A. Caponnetto. Optimal rates for regularization operators in learning theory. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE COMPUTER SCIENCE AND ARTIFICIAL …, 2006.
  10. A. Caponnetto and E. de Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
  11. A. Caponnetto and Y. Yao. Cross-validation based adaptation for regularization operators in learning theory. Analysis and Applications, 8(02):161–183, 2010.
  12. A. Cotsiolis and N. K. Tavoularis. Best constants for sobolev inequalities for higher order fractional derivatives. Journal of mathematical analysis and applications, 295(1):225–236, 2004.
  13. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
  14. F. Dai and Y. Xu. Approximation Theory and Harmonic Analysis on Spheres and Balls. Springer Monographs in Mathematics. Springer New York, New York, NY, 2013. ISBN 978-1-4614-6659-8 978-1-4614-6660-4. doi: 10.1007/978-1-4614-6660-4.
  15. How rotational invariance of common kernels prevents generalization in high dimensions. In International Conference on Machine Learning, pages 2804–2814. PMLR, 2021.
  16. S.-R. Fischer and I. Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21:205:1–205:38, 2020.
  17. Spherical harmonics and linear representations of lie groups. Differential Geometry and Lie Groups: A Second Course, pages 265–360, 2020.
  18. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.
  19. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
  20. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029 – 1054, 2021. doi: 10.1214/20-AOS1990. URL https://doi.org/10.1214/20-AOS1990.
  21. The three stages of learning dynamics in high-dimensional kernel methods. In International Conference on Learning Representations, 2021.
  22. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022. doi: 10.1214/21-AOS2133. URL https://doi.org/10.1214/21-AOS2133.
  23. H. Hu and Y. M. Lu. Sharp asymptotics of kernel ridge regression beyond the linear regime. arXiv preprint arXiv:2205.06798, 2022.
  24. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  25. N. E. Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1 – 50, 2010. doi: 10.1214/08-AOS648. URL https://doi.org/10.1214/08-AOS648.
  26. Generalization ability of wide neural networks on ℝℝ\mathbb{R}blackboard_R. arXiv preprint arXiv:2302.05933, 2023.
  27. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  28. Kernel interpolation generalizes poorly. arXiv preprint arXiv:2303.15809, 2023a.
  29. On the saturation effect of kernel ridge regression. In International Conference on Learning Representations, Feb. 2023b.
  30. On the asymptotic learning curves of kernel ridge regression under power-law decay. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c.
  31. T. Liang and A. Rakhlin. Just interpolate: Kernel “Ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329 – 1347, 2020. doi: 10.1214/19-AOS1849. URL https://doi.org/10.1214/19-AOS1849.
  32. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, pages 2683–2711. PMLR, 2020.
  33. J. Lin and V. Cevher. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. Journal of Machine Learning Research, 21:147–1, 2020.
  34. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 48:868–890, 2018.
  35. Kernel regression in high dimensions: Refined analysis beyond double descent. In International Conference on Artificial Intelligence and Statistics, pages 649–657. PMLR, 2021.
  36. Optimal rate of kernel regression in large dimensions. arXiv preprint arXiv:2309.04268, 2023.
  37. S. Mei and A. Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
  38. Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022.
  39. T. Misiakiewicz. Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression. arXiv preprint arXiv:2204.10425, 2022.
  40. On the embedding constant of the sobolev type inequality for fractional derivatives. Nonlinear Theory and Its Applications, IEICE, 7(3):386–394, 2016.
  41. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020. doi: 10.1109/JSAIT.2020.2984716.
  42. Reproducing kernels of sobolev spaces on r d and applications to embedding constants and tractability. Analysis and Applications, 16(05):693–715, 2018.
  43. A. Rakhlin and X. Zhai. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pages 2595–2623. PMLR, 2019.
  44. I. Steinwart and A. Christmann. Support vector machines. In Information Science and Statistics, 2008.
  45. I. Steinwart and C. Scovel. Mercer’s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417, 2012.
  46. Optimal rates for regularized least squares regression. In COLT, pages 79–93, 2009.
  47. L. Tartar. An introduction to Sobolev spaces and interpolation spaces, volume 3. Springer Science & Business Media, 2007.
  48. A. Tsigler and P. L. Bartlett. Benign overfitting in ridge regression. Journal of Machine Learning Research, 24(123):1–76, 2023.
  49. Precise learning curves and higher-order scaling limits for dot product kernel regression. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.
  50. Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564 – 1599, 1999. doi: 10.1214/aos/1017939142. URL https://doi.org/10.1214/aos/1017939142.
  51. On the optimality of misspecified spectral algorithms. arXiv preprint arXiv:2303.14942, 2023a.
  52. On the optimality of misspecified kernel ridge regression. arXiv preprint arXiv:2305.07241, 2023b.
Citations (3)

Summary

We haven't generated a summary for this paper yet.