Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalization in Kernel Regression Under Realistic Assumptions (2312.15995v2)

Published 26 Dec 2023 in cs.LG, cs.AI, and stat.ML

Abstract: It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise. Many recent works attempt to analyze this phenomenon in the relatively tractable setting of kernel regression. However, as we argue in detail, most past works on this topic either make unrealistic assumptions, or focus on a narrow problem setup. This work aims to provide a unified theory to upper bound the excess risk of kernel regression for nearly all common and realistic settings. Specifically, we provide rigorous bounds that hold for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples. Furthermore, we provide relative perturbation bounds for the eigenvalues of kernel matrices, which may be of independent interest. These reveal a self-regularization phenomenon, whereby a heavy tail in the eigendecomposition of the kernel provides it with an implicit form of regularization, enabling good generalization. When applied to common kernels, our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression. As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. A continuous-time view of early stopping for least squares regression. In The 22nd international conference on artificial intelligence and statistics, pages 1370–1378. PMLR, 2019.
  2. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
  3. Spherical harmonics and approximations on the unit sphere: an introduction, volume 2044. Springer Science & Business Media, 2012.
  4. Eigenvalues of dot-product kernels on the sphere. Proceeding Series of the Brazilian Society of Computational and Applied Mathematics, 3(1), 2015.
  5. Francis Bach. High-dimensional analysis of double descent for linear regression with random projections. arXiv preprint arXiv:2303.01372, 2023.
  6. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  7. A kernel perspective of skip connections in convolutional networks. arXiv preprint arXiv:2211.14810, 2022.
  8. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems, 32, 2019.
  9. Spectral analysis of the neural tangent kernel for deep residual networks. arXiv preprint arXiv:2104.03093, 2021.
  10. Mikhail Belkin. Approximation beats concentration? an approximation view on inference with smooth radial kernels. In Conference On Learning Theory, pages 1348–1361. PMLR, 2018.
  11. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  12. Deep equals shallow for relu networks in kernel regimes. arXiv preprint arXiv:2009.14397, 2020.
  13. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR, 2020.
  14. Spectral bias outside the training set for deep networks in the kernel regime. Advances in Neural Information Processing Systems, 35:30362–30377, 2022.
  15. Mikio Ludwig Braun. Spectral properties of the kernel matrix and their relation to kernel methods in machine learning. PhD thesis, Universitäts-und Landesbibliothek Bonn, 2005.
  16. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):2914, 2021.
  17. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198, 2019.
  18. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
  19. Fast rates for regularized least-squares algorithm. 2005.
  20. Deep neural tangent kernel and laplace kernel have the same rkhs. arXiv preprint arXiv:2009.10683, 2020.
  21. Dimension free ridge regression. arXiv preprint arXiv:2210.08571, 2022.
  22. A theoretical analysis of the test error of finite-rank kernel ridge regression. arXiv preprint arXiv:2310.00987, 2023.
  23. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
  24. Feng Dai. Approximation theory and harmonic analysis on spheres and balls. Springer, 2013.
  25. Jerome Dancis. A quantitative formulation of sylvester’s law of inertia. iii. Linear Algebra and its Applications, 80:141–158, 1986.
  26. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, 2018.
  27. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
  28. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Advances in neural information processing systems, 33:7710–7721, 2020.
  29. Gregory E Fasshauer. Positive definite kernels: past, present and future. Dolomites Research Notes on Approximation, 4:21–63, 2011.
  30. Sobolev norm learning rates for regularized least-squares algorithms. The Journal of Machine Learning Research, 21(1):8464–8501, 2020.
  31. On the spectral bias of convolutional neural tangent and gaussian process kernels. Advances in Neural Information Processing Systems, 35:11253–11265, 2022.
  32. Controlling the inductive bias of wide neural networks by modifying the kernel’s spectrum. arXiv preprint arXiv:2307.14531, 2023.
  33. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR, 2020.
  34. Linearized two-layers neural networks in high dimension. 2021.
  35. Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension. arXiv preprint arXiv:2305.14077, 2023.
  36. Xiaolong Han. Spherical harmonics with maximal l p (2 <<< p ≤\leq≤ 6) norm growth. The Journal of Geometric Analysis, 26(1):378–398, 2016.
  37. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
  38. Matrix analysis. Cambridge university press, 2012.
  39. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 2022.
  40. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  41. Kernel alignment risk estimator: Risk prediction from training data. Advances in neural information processing systems, 33:15568–15578, 2020.
  42. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
  43. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  44. Statistical optimality of deep wide neural networks. arXiv preprint arXiv:2305.02657, 2023a.
  45. On the asymptotic learning curves of kernel ridge regression under power-law decay. arXiv preprint arXiv:2309.13337, 2023b.
  46. Towards an understanding of benign overfitting in neural networks. arXiv preprint arXiv:2106.03212, 2021.
  47. Just interpolate: Kernel “ridgeless” regression can generalize. 2020.
  48. Optimal rates for spectral algorithms with least-squares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 48(3):868–890, 2020.
  49. Machine learning with kernel methods.
  50. Benign, tempered, or catastrophic: A taxonomy of overfitting. arXiv preprint arXiv:2207.06569, 2022.
  51. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022.
  52. Mercer’s theorem, feature maps, and smoothing. In International Conference on Computational Learning Theory, pages 154–168. Springer, 2006.
  53. Theodor Misiakiewicz. Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression. arXiv preprint arXiv:2204.10425, 2022.
  54. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics, 50(5):2816–2847, 2022.
  55. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. In International Conference on Machine Learning, pages 8119–8129. PMLR, 2021.
  56. Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
  57. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
  58. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pages 2595–2623. PMLR, 2019.
  59. Asymptotics of ridge (less) regression under general source condition. In International Conference on Artificial Intelligence and Statistics, pages 3889–3897. PMLR, 2021.
  60. On learning with integral operators. Journal of Machine Learning Research, 11(2), 2010.
  61. A spectral analysis of dot-product kernels. In International conference on artificial intelligence and statistics, pages 3394–3402. PMLR, 2021.
  62. The eigenlearning framework: A conservation law perspective on kernel regression and wide neural networks. arXiv preprint arXiv:2110.03922, 2021.
  63. Regularization with dot-product kernels. Advances in neural information processing systems, 13, 2000.
  64. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, 2020.
  65. Mercer’s theorem on general domains: On the interaction between measures, kernels, and rkhss. Constructive Approximation, 35:363–417, 2012.
  66. An explicit description of the reproducing kernel hilbert spaces of gaussian rbf kernels. IEEE Transactions on Information Theory, 52(10):4635–4643, 2006.
  67. Optimal rates for regularized least squares regression. In COLT, pages 79–93, 2009.
  68. Gabor Szeg. Orthogonal polynomials, volume 23. American Mathematical Soc., 1939.
  69. Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
  70. Benign overfitting in ridge regression. J. Mach. Learn. Res., 24:123–1, 2023.
  71. Ernesto Araya Valdivia. Relative concentration bounds for the spectrum of kernel matrices. arXiv preprint arXiv:1812.02108, 2018.
  72. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
  73. Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks. arXiv preprint arXiv:2109.09304, 2021.
  74. Denny Wu and Ji Xu. On the optimal weighted ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123, 2020.
  75. Precise learning curves and higher-order scaling limits for dot product kernel regression. arXiv preprint arXiv:2205.14846, 2022.
  76. Tensor programs iib: Architectural universality of neural tangent kernel training dynamics. In International Conference on Machine Learning, pages 11762–11772. PMLR, 2021.
  77. A fine-grained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599, 2019.
  78. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com