Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling and renormalization in high-dimensional regression (2405.00592v4)

Published 1 May 2024 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: From benign overfitting in overparameterized models to rich power-law scalings in performance, simple ridge regression displays surprising behaviors sometimes thought to be limited to deep neural networks. This balance of phenomenological richness with analytical tractability makes ridge regression the model system of choice in high-dimensional machine learning. In this paper, we present a unifying perspective on recent results on ridge regression using the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning. We highlight the fact that statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This `deterministic equivalence' allows us to obtain analytic formulas for the training and generalization errors in a few lines of algebra by leveraging the properties of the $S$-transform of free probability. From these precise asymptotics, we can easily identify sources of power-law scaling in model performance. In all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. This allows us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. G. Bachmann, S. Anagnostidis, and T. Hofmann, Advances in Neural Information Processing Systems 36 (2024).
  2. B. Widom, The Journal of Chemical Physics 43, 3898 (1965).
  3. L. P. Kadanoff, Physics Physique Fizika 2, 263 (1966).
  4. K. G. Wilson, Physical review B 4, 3174 (1971a).
  5. K. G. Wilson, Physical Review B 4, 3184 (1971b).
  6. K. G. Wilson and J. Kogut, Physics reports 12, 75 (1974).
  7. J. Cardy, Scaling and renormalization in statistical physics, Vol. 5 (Cambridge university press, 1996).
  8. A. Krogh and J. A. Hertz, Journal of Physics A: Mathematical and General 25, 1135 (1992).
  9. L. H. Dicker, Bernoulli 22, 1 (2016).
  10. E. Dobriban and S. Wager, The Annals of Statistics 46, 247 (2018).
  11. P. Nakkiran, arXiv preprint arXiv:1912.07242  (2019).
  12. M. S. Advani, A. M. Saxe, and H. Sompolinsky, Neural Networks 132, 428 (2020).
  13. P. Sollich, Advances in neural information processing systems 11 (1998).
  14. P. Sollich and A. Halees, Neural computation 14, 1393 (2002).
  15. B. Bordelon, A. Canatar, and C. Pehlevan, in International Conference on Machine Learning (PMLR, 2020) pp. 1024–1034.
  16. A. Canatar, B. Bordelon, and C. Pehlevan, Nature communications 12, 2914 (2021).
  17. S. Spigler, M. Geiger, and M. Wyart, Journal of Statistical Mechanics: Theory and Experiment 2020, 124001 (2020).
  18. C. Louart, Z. Liao, and R. Couillet, The Annals of Applied Probability 28, 1190 (2018).
  19. S. Mei and A. Montanari, Communications on Pure and Applied Mathematics 75, 667 (2022).
  20. B. Adlam and J. Pennington, in International Conference on Machine Learning (PMLR, 2020) pp. 74–84.
  21. S. d’Ascoli, L. Sagun, and G. Biroli, Advances in Neural Information Processing Systems 33, 3058 (2020).
  22. J. A. Zavatone-Veth and C. Pehlevan, in Advances in Neural Information Processing Systems (2023).
  23. O. Dhifallah and Y. M. Lu, arXiv preprint arXiv:2008.11904  (2020).
  24. H. Hu and Y. M. Lu, IEEE Transactions on Information Theory 69, 1932 (2022a).
  25. A. Maloney, D. A. Roberts, and J. Sully, arXiv preprint arXiv:2210.16859  (2022).
  26. F. Bach, SIAM Journal on Mathematics of Data Science 6, 26 (2024).
  27. D. V. Voiculescu, K. J. Dykema, and A. Nica, Free random variables (American Mathematical Society, 1992).
  28. B. Adlam and J. Pennington, Advances in neural information processing systems 33, 11022 (2020b).
  29. S. Ahmad and G. Tesauro, Advances in neural information processing systems 1 (1988).
  30. M. A. Gordon, K. Duh, and J. Kaplan, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021) pp. 5915–5922.
  31. M. Hutter, arXiv preprint arXiv:2102.04074  (2021).
  32. U. Sharma and J. Kaplan, Journal of Machine Learning Research 23, 1 (2022).
  33. S. Arora and A. Goyal, arXiv preprint arXiv:2307.15936  (2023).
  34. R. Dietrich, M. Opper, and H. Sompolinsky, Physical review letters 82, 2975 (1999).
  35. S. Mei, T. Misiakiewicz, and A. Montanari, Applied and Computational Harmonic Analysis 59, 3 (2022), special Issue on Harmonic Analysis and Machine Learning.
  36. T. Misiakiewicz, arXiv  (2022), arXiv:2204.10425 [math.ST] .
  37. H. Hu and Y. M. Lu, arXiv  (2022b), arXiv:2205.06798 [cs.LG] .
  38. J. Pennington and P. Worah, Advances in neural information processing systems 30 (2017).
  39. A. Montanari and B. N. Saeed, in Proceedings of Thirty Fifth Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 178, edited by P.-L. Loh and M. Raginsky (PMLR, 2022) pp. 4310–4312.
  40. B. Bordelon and C. Pehlevan, arXiv preprint arXiv:2106.02713  (2021).
  41. B. Bordelon, A. Atanasov, and C. Pehlevan, arXiv preprint arXiv:2402.01092  (2024).
  42. T. Misiakiewicz and A. Montanari, arXiv preprint arXiv:2308.13431  (2023).
  43. C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning (MIT press Cambridge, MA, 2006).
  44. L. Chizat, E. Oyallon, and F. Bach, Advances in neural information processing systems 32 (2019).
  45. C. Liu, L. Zhu, and M. Belkin, arXiv  (2021), arXiv:2010.01092 [cs.LG] .
  46. A. Atanasov, B. Bordelon, and C. Pehlevan, in International Conference on Learning Representations (2021).
  47. E. Dyer and G. Gur-Ari, arXiv preprint arXiv:1909.11304  (2019).
  48. K. Aitken and G. Gur-Ari, arXiv preprint arXiv:2006.06687  (2020), arXiv:2006.06687 .
  49. J. A. Zavatone-Veth, W. L. Tong, and C. Pehlevan, Physical Review E 105, 064118 (2022).
  50. H. Cramér, Mathematical methods of statistics, Vol. 26 (Princeton university press, 1999).
  51. L. Fahrmeir and H. Kaufmann, The Annals of Statistics 13, 342 (1985).
  52. A. Wei, W. Hu, and J. Steinhardt, in International Conference on Machine Learning (PMLR, 2022) pp. 23549–23588.
  53. S. Mei, A. Montanari, and P.-M. Nguyen, Proceedings of the National Academy of Sciences 115, E7665 (2018).
  54. G. Mel and J. Pennington, in International Conference on Learning Representations (2021).
  55. G. Mel and S. Ganguli, in Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 139, edited by M. Meila and T. Zhang (PMLR, 2021) pp. 7578–7587.
  56. G. H. Golub, M. Heath, and G. Wahba, Technometrics 21, 215 (1979).
  57. A. Caponnetto and E. D. Vito, “Fast rates for regularized least-squares algorithm,” Tech. Rep. (Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, 2005).
  58. A. Caponnetto and E. De Vito, Foundations of Computational Mathematics 7, 331 (2007).
  59. D. LeJeune, H. Javadi, and R. Baraniuk, in International Conference on Artificial Intelligence and Statistics (PMLR, 2020) pp. 3525–3535.
  60. P. Patil and D. LeJeune, in The Twelfth International Conference on Learning Representations (2024).
  61. M. Potters and J.-P. Bouchaud, A First Course in Random Matrix Theory: For Physicists, Engineers and Data Scientists (Cambridge University Press, 2020).
  62. H. Neudecker and A. Wesselman, Linear Algebra and its Applications 127, 589 (1990).
  63. T. Tao, Topics in random matrix theory, Vol. 132 (American Mathematical Society, 2023).
  64. Z. Burda, R. Janik, and M. Nowak, Physical Review E 84, 061125 (2011).
  65. D. Kobak, J. Lomond, and B. Sanchez, Journal of Machine Learning Research 21, 1 (2020).
  66. D. V. Voiculescu, Free probability theory, Vol. 12 (American Mathematical Soc., 1997).
  67. A. Nica and R. Speicher, Lectures on the combinatorics of free probability, Vol. 13 (Cambridge University Press, 2006).
  68. D. Weingarten, Journal of Mathematical Physics 19, 999 (1978).
  69. P. Brouwer and C. Beenakker, Journal of Mathematical Physics 37, 4904 (1996).
  70. B. Collins and S. Matsumoto, Journal of Mathematical Physics 50 (2009).
  71. T. Banica, Letters in Mathematical Physics 91, 105 (2010).
  72. T. Tao and V. Vu, Modern Aspects of Random Matrix Theory 72, 121 (2014).
  73. V. A. Marchenko and L. A. Pastur, Matematicheskii Sbornik 114, 507 (1967).
  74. R. R. Muller, IEEE Transactions on Information Theory 48, 2086 (2002).
  75. J. A. Zavatone-Veth and C. Pehlevan, SciPost Physics Core 6, 026 (2023b).
  76. P. Craven and G. Wahba, Numerische mathematik 31, 377 (1978).
  77. C. Cheng and A. Montanari, arXiv preprint arXiv:2210.08571  (2022).
  78. T. Misiakiewicz and B. Saeed, arXiv preprint arXiv:2403.08938  (2024).
  79. U. M. Tomasini, A. Sclocchi, and M. Wyart, in International Conference on Machine Learning (PMLR, 2022) pp. 21548–21583.
  80. R. A. Horn and C. R. Johnson, Matrix Analysis (Cambridge University Press, 2012).
  81. J. W. Rocks and P. Mehta, Physical Review E 106, 025304 (2022).
  82. Y. M. Lu and H.-T. Yau, arXiv preprint arXiv:2205.06308  (2022).
  83. H. Hu, Y. M. Lu, and T. Misiakiewicz, arXiv  (2024), arXiv:2403.08160 [stat.ML] .
Citations (11)

Summary

  • The paper derives deterministic equivalence using S-transforms and subordination formulas to simplify complex random matrices in high-dimensional regression.
  • It characterizes training and generalization errors in linear and kernel ridge regression models through detailed spectral analysis.
  • It uncovers scaling laws and variance-dominated regimes that provide actionable insights for designing robust high-dimensional learning systems.

Understanding High-Dimensional Regression Through the Lens of Random Matrix Theory

Linear Regression and Kernel Methods in High-Dimensional Spaces

When faced with the challenge of high-dimensional data analysis, both classical linear regression and modern machine learning methodologies like kernel regression are immensely valuable. But as the dimensionality of the data (number of predictors) increases, especially relative to the number of observations, traditional analysis techniques face difficulty due to high variance among other issues.

Focusing on a scenario where data points are drawn from a high-dimensional Gaussian distribution, this paper journeys through an analytical framework using tools from random matrix theory and free probability. This kind of analysis is critical because it helps us understand the behavior of estimators in high-dimensional settings objectively, by connecting abstract mathematical concepts with practical regression problems.

Utilizing SS-transforms in Random Matrix Theory

A key mathematical tool employed in this paper is the SS-transform from free probability theory, which helps manage and simplify the complexity associated with products of random matrices. This is particularly useful when analyzing properties like the covariance of the datasets in high dimensions, which often appear in the form of random matrices in practical problems.

Subordination Formulas and Deterministic Equivalence:

These concepts allow us to replace complex random matrices (like those we encounter in data covariance matrices) with simpler, deterministic equivalents under certain conditions. This simplification is incredibly beneficial for theoretical analyses, making a range of calculations more tractable.

Bridging Theory with Practical Learning Models

Linear and Kernel Ridge Regression:

The paper explores models that add a regularization term (ridge regression) to address issues of overfitting in high-dimensional settings. By applying deterministic equivalence and subordination formulas, the authors derive expressions for training error, generalization error, and provide a sharp characterization of these errors in terms of the data's spectral properties (distribution of eigenvalues of the covariance matrix).

Linear Random Features Model:

Expanding the analysis to random feature models introduces another layer of stochasticity and complexity. Here, the paper scrutinizes how random projections (features) affect learning outcomes, again harnessing the power of SS-transforms to untangle the effects of randomness introduced by such features.

Generalization Error and Scaling Laws:

A particularly practical aspect of the paper revolves around understanding how generalization error scales with parameters like the number of features and samples. This is crucial for designing machine learning systems that are both efficient and robust.

Challenges and Opportunities

One of the takeaways is the nuanced understanding of "variance-dominated" regimes in learning, which occur when parameters or features have nonlinear effects on the learning outcomes. Detecting and understanding these regimes can lead to better model design and parameter adjustment.

Future Paths and Theoretical Implications

The analytical methods highlighted in the paper open pathways for exploring more complex models, including deep learning networks, under the rigorous mathematical framework offered by random matrix theory. This could lead to a deeper understanding of why certain deep learning models perform exceptionally well and how to systematically improve models that underperform.

As we continue to push the boundaries of what's achievable with high-dimensional statistical models, the blend of theoretical rigor with practical application as demonstrated in this paper will be indispensable. Echoing through this paper is a powerful narrative about how abstract mathematical concepts can provide profound insights into the data structures and learning algorithms that drive modern AI systems.