Scaling and renormalization in high-dimensional regression (2405.00592v4)
Abstract: From benign overfitting in overparameterized models to rich power-law scalings in performance, simple ridge regression displays surprising behaviors sometimes thought to be limited to deep neural networks. This balance of phenomenological richness with analytical tractability makes ridge regression the model system of choice in high-dimensional machine learning. In this paper, we present a unifying perspective on recent results on ridge regression using the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning. We highlight the fact that statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This `deterministic equivalence' allows us to obtain analytic formulas for the training and generalization errors in a few lines of algebra by leveraging the properties of the $S$-transform of free probability. From these precise asymptotics, we can easily identify sources of power-law scaling in model performance. In all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. This allows us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.
- G. Bachmann, S. Anagnostidis, and T. Hofmann, Advances in Neural Information Processing Systems 36 (2024).
- B. Widom, The Journal of Chemical Physics 43, 3898 (1965).
- L. P. Kadanoff, Physics Physique Fizika 2, 263 (1966).
- K. G. Wilson, Physical review B 4, 3174 (1971a).
- K. G. Wilson, Physical Review B 4, 3184 (1971b).
- K. G. Wilson and J. Kogut, Physics reports 12, 75 (1974).
- J. Cardy, Scaling and renormalization in statistical physics, Vol. 5 (Cambridge university press, 1996).
- A. Krogh and J. A. Hertz, Journal of Physics A: Mathematical and General 25, 1135 (1992).
- L. H. Dicker, Bernoulli 22, 1 (2016).
- E. Dobriban and S. Wager, The Annals of Statistics 46, 247 (2018).
- P. Nakkiran, arXiv preprint arXiv:1912.07242 (2019).
- M. S. Advani, A. M. Saxe, and H. Sompolinsky, Neural Networks 132, 428 (2020).
- P. Sollich, Advances in neural information processing systems 11 (1998).
- P. Sollich and A. Halees, Neural computation 14, 1393 (2002).
- B. Bordelon, A. Canatar, and C. Pehlevan, in International Conference on Machine Learning (PMLR, 2020) pp. 1024–1034.
- A. Canatar, B. Bordelon, and C. Pehlevan, Nature communications 12, 2914 (2021).
- S. Spigler, M. Geiger, and M. Wyart, Journal of Statistical Mechanics: Theory and Experiment 2020, 124001 (2020).
- C. Louart, Z. Liao, and R. Couillet, The Annals of Applied Probability 28, 1190 (2018).
- S. Mei and A. Montanari, Communications on Pure and Applied Mathematics 75, 667 (2022).
- B. Adlam and J. Pennington, in International Conference on Machine Learning (PMLR, 2020) pp. 74–84.
- S. d’Ascoli, L. Sagun, and G. Biroli, Advances in Neural Information Processing Systems 33, 3058 (2020).
- J. A. Zavatone-Veth and C. Pehlevan, in Advances in Neural Information Processing Systems (2023).
- O. Dhifallah and Y. M. Lu, arXiv preprint arXiv:2008.11904 (2020).
- H. Hu and Y. M. Lu, IEEE Transactions on Information Theory 69, 1932 (2022a).
- A. Maloney, D. A. Roberts, and J. Sully, arXiv preprint arXiv:2210.16859 (2022).
- F. Bach, SIAM Journal on Mathematics of Data Science 6, 26 (2024).
- D. V. Voiculescu, K. J. Dykema, and A. Nica, Free random variables (American Mathematical Society, 1992).
- B. Adlam and J. Pennington, Advances in neural information processing systems 33, 11022 (2020b).
- S. Ahmad and G. Tesauro, Advances in neural information processing systems 1 (1988).
- M. A. Gordon, K. Duh, and J. Kaplan, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021) pp. 5915–5922.
- M. Hutter, arXiv preprint arXiv:2102.04074 (2021).
- U. Sharma and J. Kaplan, Journal of Machine Learning Research 23, 1 (2022).
- S. Arora and A. Goyal, arXiv preprint arXiv:2307.15936 (2023).
- R. Dietrich, M. Opper, and H. Sompolinsky, Physical review letters 82, 2975 (1999).
- S. Mei, T. Misiakiewicz, and A. Montanari, Applied and Computational Harmonic Analysis 59, 3 (2022), special Issue on Harmonic Analysis and Machine Learning.
- T. Misiakiewicz, arXiv (2022), arXiv:2204.10425 [math.ST] .
- H. Hu and Y. M. Lu, arXiv (2022b), arXiv:2205.06798 [cs.LG] .
- J. Pennington and P. Worah, Advances in neural information processing systems 30 (2017).
- A. Montanari and B. N. Saeed, in Proceedings of Thirty Fifth Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 178, edited by P.-L. Loh and M. Raginsky (PMLR, 2022) pp. 4310–4312.
- B. Bordelon and C. Pehlevan, arXiv preprint arXiv:2106.02713 (2021).
- B. Bordelon, A. Atanasov, and C. Pehlevan, arXiv preprint arXiv:2402.01092 (2024).
- T. Misiakiewicz and A. Montanari, arXiv preprint arXiv:2308.13431 (2023).
- C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning (MIT press Cambridge, MA, 2006).
- L. Chizat, E. Oyallon, and F. Bach, Advances in neural information processing systems 32 (2019).
- C. Liu, L. Zhu, and M. Belkin, arXiv (2021), arXiv:2010.01092 [cs.LG] .
- A. Atanasov, B. Bordelon, and C. Pehlevan, in International Conference on Learning Representations (2021).
- E. Dyer and G. Gur-Ari, arXiv preprint arXiv:1909.11304 (2019).
- K. Aitken and G. Gur-Ari, arXiv preprint arXiv:2006.06687 (2020), arXiv:2006.06687 .
- J. A. Zavatone-Veth, W. L. Tong, and C. Pehlevan, Physical Review E 105, 064118 (2022).
- H. Cramér, Mathematical methods of statistics, Vol. 26 (Princeton university press, 1999).
- L. Fahrmeir and H. Kaufmann, The Annals of Statistics 13, 342 (1985).
- A. Wei, W. Hu, and J. Steinhardt, in International Conference on Machine Learning (PMLR, 2022) pp. 23549–23588.
- S. Mei, A. Montanari, and P.-M. Nguyen, Proceedings of the National Academy of Sciences 115, E7665 (2018).
- G. Mel and J. Pennington, in International Conference on Learning Representations (2021).
- G. Mel and S. Ganguli, in Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 139, edited by M. Meila and T. Zhang (PMLR, 2021) pp. 7578–7587.
- G. H. Golub, M. Heath, and G. Wahba, Technometrics 21, 215 (1979).
- A. Caponnetto and E. D. Vito, “Fast rates for regularized least-squares algorithm,” Tech. Rep. (Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, 2005).
- A. Caponnetto and E. De Vito, Foundations of Computational Mathematics 7, 331 (2007).
- D. LeJeune, H. Javadi, and R. Baraniuk, in International Conference on Artificial Intelligence and Statistics (PMLR, 2020) pp. 3525–3535.
- P. Patil and D. LeJeune, in The Twelfth International Conference on Learning Representations (2024).
- M. Potters and J.-P. Bouchaud, A First Course in Random Matrix Theory: For Physicists, Engineers and Data Scientists (Cambridge University Press, 2020).
- H. Neudecker and A. Wesselman, Linear Algebra and its Applications 127, 589 (1990).
- T. Tao, Topics in random matrix theory, Vol. 132 (American Mathematical Society, 2023).
- Z. Burda, R. Janik, and M. Nowak, Physical Review E 84, 061125 (2011).
- D. Kobak, J. Lomond, and B. Sanchez, Journal of Machine Learning Research 21, 1 (2020).
- D. V. Voiculescu, Free probability theory, Vol. 12 (American Mathematical Soc., 1997).
- A. Nica and R. Speicher, Lectures on the combinatorics of free probability, Vol. 13 (Cambridge University Press, 2006).
- D. Weingarten, Journal of Mathematical Physics 19, 999 (1978).
- P. Brouwer and C. Beenakker, Journal of Mathematical Physics 37, 4904 (1996).
- B. Collins and S. Matsumoto, Journal of Mathematical Physics 50 (2009).
- T. Banica, Letters in Mathematical Physics 91, 105 (2010).
- T. Tao and V. Vu, Modern Aspects of Random Matrix Theory 72, 121 (2014).
- V. A. Marchenko and L. A. Pastur, Matematicheskii Sbornik 114, 507 (1967).
- R. R. Muller, IEEE Transactions on Information Theory 48, 2086 (2002).
- J. A. Zavatone-Veth and C. Pehlevan, SciPost Physics Core 6, 026 (2023b).
- P. Craven and G. Wahba, Numerische mathematik 31, 377 (1978).
- C. Cheng and A. Montanari, arXiv preprint arXiv:2210.08571 (2022).
- T. Misiakiewicz and B. Saeed, arXiv preprint arXiv:2403.08938 (2024).
- U. M. Tomasini, A. Sclocchi, and M. Wyart, in International Conference on Machine Learning (PMLR, 2022) pp. 21548–21583.
- R. A. Horn and C. R. Johnson, Matrix Analysis (Cambridge University Press, 2012).
- J. W. Rocks and P. Mehta, Physical Review E 106, 025304 (2022).
- Y. M. Lu and H.-T. Yau, arXiv preprint arXiv:2205.06308 (2022).
- H. Hu, Y. M. Lu, and T. Misiakiewicz, arXiv (2024), arXiv:2403.08160 [stat.ML] .