Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions (2401.01270v1)
Abstract: Motivated by the studies of neural networks (e.g.,the neural tangent kernel theory), we perform a study on the large-dimensional behavior of kernel ridge regression (KRR) where the sample size $n \asymp d{\gamma}$ for some $\gamma > 0$. Given an RKHS $\mathcal{H}$ associated with an inner product kernel defined on the sphere $\mathbb{S}{d}$, we suppose that the true function $f_{\rho}{*} \in [\mathcal{H}]{s}$, the interpolation space of $\mathcal{H}$ with source condition $s>0$. We first determined the exact order (both upper and lower bound) of the generalization error of kernel ridge regression for the optimally chosen regularization parameter $\lambda$. We then further showed that when $0<s\le1$, KRR is minimax optimal; and when $s\>1$, KRR is not minimax optimal (a.k.a. he saturation effect). Our results illustrate that the curves of rate varying along $\gamma$ exhibit the periodic plateau behavior and the multiple descent behavior and show how the curves evolve with $s>0$. Interestingly, our work provides a unified viewpoint of several recent works on kernel regression in the large-dimensional setting, which correspond to $s=0$ and $s=1$ respectively.
- Strong inductive biases provably prevent harmless interpolation. In The Eleventh International Conference on Learning Representations, 2022.
- On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32, 2019.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
- On the inconsistency of kernel ridgeless regression in fixed dimensions. SIAM Journal on Mathematics of Data Science, 5(4):854–872, 2023.
- A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019.
- Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR, 2020.
- S. Buchholz. Kernel interpolation in Sobolev spaces is not consistent in low dimensions. In P.-L. Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 3410–3440. PMLR, July 2022.
- A. Caponnetto. Optimal rates for regularization operators in learning theory. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE COMPUTER SCIENCE AND ARTIFICIAL …, 2006.
- A. Caponnetto and E. de Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
- A. Caponnetto and Y. Yao. Cross-validation based adaptation for regularization operators in learning theory. Analysis and Applications, 8(02):161–183, 2010.
- A. Cotsiolis and N. K. Tavoularis. Best constants for sobolev inequalities for higher order fractional derivatives. Journal of mathematical analysis and applications, 295(1):225–236, 2004.
- Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
- F. Dai and Y. Xu. Approximation Theory and Harmonic Analysis on Spheres and Balls. Springer Monographs in Mathematics. Springer New York, New York, NY, 2013. ISBN 978-1-4614-6659-8 978-1-4614-6660-4. doi: 10.1007/978-1-4614-6660-4.
- How rotational invariance of common kernels prevents generalization in high dimensions. In International Conference on Machine Learning, pages 2804–2814. PMLR, 2021.
- S.-R. Fischer and I. Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21:205:1–205:38, 2020.
- Spherical harmonics and linear representations of lie groups. Differential Geometry and Lie Groups: A Second Course, pages 265–360, 2020.
- Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.
- When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
- Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029 – 1054, 2021. doi: 10.1214/20-AOS1990. URL https://doi.org/10.1214/20-AOS1990.
- The three stages of learning dynamics in high-dimensional kernel methods. In International Conference on Learning Representations, 2021.
- Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022. doi: 10.1214/21-AOS2133. URL https://doi.org/10.1214/21-AOS2133.
- H. Hu and Y. M. Lu. Sharp asymptotics of kernel ridge regression beyond the linear regime. arXiv preprint arXiv:2205.06798, 2022.
- Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- N. E. Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1 – 50, 2010. doi: 10.1214/08-AOS648. URL https://doi.org/10.1214/08-AOS648.
- Generalization ability of wide neural networks on ℝℝ\mathbb{R}blackboard_R. arXiv preprint arXiv:2302.05933, 2023.
- Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
- Kernel interpolation generalizes poorly. arXiv preprint arXiv:2303.15809, 2023a.
- On the saturation effect of kernel ridge regression. In International Conference on Learning Representations, Feb. 2023b.
- On the asymptotic learning curves of kernel ridge regression under power-law decay. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c.
- T. Liang and A. Rakhlin. Just interpolate: Kernel “Ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329 – 1347, 2020. doi: 10.1214/19-AOS1849. URL https://doi.org/10.1214/19-AOS1849.
- On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, pages 2683–2711. PMLR, 2020.
- J. Lin and V. Cevher. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. Journal of Machine Learning Research, 21:147–1, 2020.
- Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 48:868–890, 2018.
- Kernel regression in high dimensions: Refined analysis beyond double descent. In International Conference on Artificial Intelligence and Statistics, pages 649–657. PMLR, 2021.
- Optimal rate of kernel regression in large dimensions. arXiv preprint arXiv:2309.04268, 2023.
- S. Mei and A. Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
- Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022.
- T. Misiakiewicz. Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression. arXiv preprint arXiv:2204.10425, 2022.
- On the embedding constant of the sobolev type inequality for fractional derivatives. Nonlinear Theory and Its Applications, IEICE, 7(3):386–394, 2016.
- Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020. doi: 10.1109/JSAIT.2020.2984716.
- Reproducing kernels of sobolev spaces on r d and applications to embedding constants and tractability. Analysis and Applications, 16(05):693–715, 2018.
- A. Rakhlin and X. Zhai. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pages 2595–2623. PMLR, 2019.
- I. Steinwart and A. Christmann. Support vector machines. In Information Science and Statistics, 2008.
- I. Steinwart and C. Scovel. Mercer’s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417, 2012.
- Optimal rates for regularized least squares regression. In COLT, pages 79–93, 2009.
- L. Tartar. An introduction to Sobolev spaces and interpolation spaces, volume 3. Springer Science & Business Media, 2007.
- A. Tsigler and P. L. Bartlett. Benign overfitting in ridge regression. Journal of Machine Learning Research, 24(123):1–76, 2023.
- Precise learning curves and higher-order scaling limits for dot product kernel regression. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564 – 1599, 1999. doi: 10.1214/aos/1017939142. URL https://doi.org/10.1214/aos/1017939142.
- On the optimality of misspecified spectral algorithms. arXiv preprint arXiv:2303.14942, 2023a.
- On the optimality of misspecified kernel ridge regression. arXiv preprint arXiv:2305.07241, 2023b.