Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression (2205.14846v3)

Published 30 May 2022 in cs.LG and stat.ML

Abstract: As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes. Currently, theoretical understanding of the learning curves that characterize how the prediction error depends on the number of samples is restricted to either large-sample asymptotics ($m\to\infty$) or, for certain simple data distributions, to the high-dimensional asymptotics in which the number of samples scales linearly with the dimension ($m\propto d$). There is a wide gulf between these two regimes, including all higher-order scaling relations $m\propto dr$, which are the subject of the present paper. We focus on the problem of kernel ridge regression for dot-product kernels and present precise formulas for the mean of the test error, bias, and variance, for data drawn uniformly from the sphere with isotropic random labels in the $r$th-order asymptotic scaling regime $m\to\infty$ with $m/dr$ held constant. We observe a peak in the learning curve whenever $m \approx dr/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84. PMLR, 2020a.
  2. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33:11022–11032, 2020b.
  3. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
  4. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32, 2019.
  5. Large sample covariance matrices without independence structures in columns. Statistica Sinica, pages 425–442, 2008.
  6. William Beckner. Sobolev inequalities, the poisson semigroup, and analysis on the sphere sn. Proceedings of the National Academy of Sciences, 89(11):4816–4819, 1992.
  7. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  8. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR, 2020.
  9. Marchenko–pastur law with relaxed independence conditions. Random Matrices: Theory and Applications, page 2150040, 2021.
  10. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):1–12, 2021.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Statistical mechanics of support vector networks. Physical review letters, 82(14):2975, 1999.
  13. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pages 2280–2290. PMLR, 2020.
  14. Spectra of large block matrices. arXiv preprint cs/0610045, 2006.
  15. Spherical harmonics in p dimensions. arXiv preprint arXiv:1205.3548, 2012.
  16. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR, 2020.
  17. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
  18. Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Physical Review X, 10(4):041044, 2020a.
  19. The gaussian equivalence of generative models for learning with two-layer neural networks. 2020b.
  20. The gaussian equivalence of generative models for learning with shallow neural networks. In Mathematical and Scientific Machine Learning, pages 426–471. PMLR, 2022.
  21. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
  22. Universality laws for high-dimensional learning with random features. arXiv preprint arXiv:2009.07669, 2020.
  23. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  24. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  25. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  26. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, pages 2683–2711. PMLR, 2020.
  27. What causes the test error? going beyond bias-variance via anova. Journal of Machine Learning Research, 22(155):1–82, 2021.
  28. Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34:18137–18151, 2021.
  29. An equivalence principle for the spectrum of random inner-product kernel matrices. arXiv preprint arXiv:2205.06308, 2022.
  30. Why do classifier accuracies show linear trends under distribution shift? arXiv preprint arXiv:2012.15483, 2020.
  31. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
  32. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 2021.
  33. Free probability and random matrices, volume 35. Springer, 2017.
  34. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.
  35. Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148, 2018.
  36. Neural tangents: Fast and easy infinite neural networks in python. arXiv preprint arXiv:1912.02803, 2019.
  37. Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
  38. Covariate shift in high-dimensional random feature regression. arXiv preprint arXiv:2111.08234, 2021a.
  39. Overparameterization improves robustness to covariate shift in high dimensions. Advances in Neural Information Processing Systems, 34, 2021b.
  40. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
  41. Denny Wu and Ji Xu. On the optimal weighted ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123, 2020.
  42. Lechao Xiao. Eigenspace restructuring: a principle of space and frequency in neural networks. arXiv preprint arXiv:2112.05611, 2021.
  43. Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
  44. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522, 2020.
Citations (16)

Summary

We haven't generated a summary for this paper yet.