A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities (2410.18938v1)
Abstract: A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
- Francis Bach. The quest for adaptivity, 2021. URL https://francisbach.com/quest-for-adaptivity/.
- Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research, 22(165):1–73, 2021.
- Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nature Communications, 12(1):4122, 2021.
- Spectral evolution and invariance in linear-width neural networks. Advances in Neural Information Processing Systems, 36, 2024.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
- Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.
- The gaussian equivalence of generative models for learning with shallow neural networks. Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, 145:426–471, 2022.
- Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 2022.
- The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
- High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems, volume 35, pages 37932–37946, 2022.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- How two-layer neural networks learn, one (giant) step at a time. arXiv preprint arXiv:2305.18270, 2023.
- A theory of non-linear feature learning with one gradient step in two-layer neural networks. arXiv preprint arXiv:2310.07891, 2023.
- Asymptotics of feature learning in two-layer networks after one gradient-step. arXiv preprint arXiv:2402.04980, 2024.
- Spin Glass Theory and Beyond. WORLD SCIENTIFIC, 1986. doi: 10.1142/0271. URL https://www.worldscientific.com/doi/abs/10.1142/0271.
- Spectrum dependent learning curves in kernel regression and wide neural networks. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 1024–1034, 2020.
- Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications, 12(1):2914, 2021.
- Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
- Error scaling laws for kernel classification under source and capacity conditions. Machine Learning: Science and Technology, 4(3):035033, 2023.
- Statistical mechanics of support vector networks. Phys. Rev. Lett., 82:2975–2978, Apr 1999. doi: 10.1103/PhysRevLett.82.2975.
- How rotational invariance of common kernels prevents generalization in high dimensions. In International Conference on Machine Learning, pages 2804–2814. PMLR, 2021.
- Limitations of lazy training of two-layers neural network. Advances in Neural Information Processing Systems, 32, 2019.
- When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
- M. Opper and R. Urbanczik. Universal learning curves of support vector machines. Phys. Rev. Lett., 86:4410–4413, 2001.
- Precise learning curves and higher-order scalings for dot-product kernel regression. Advances in Neural Information Processing Systems, 35:4558–4570, 2022.
- Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR, 2020.
- Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022.
- Asymptotics of random feature regression beyond the linear scaling regime. arXiv preprint arXiv:2403.08160, 2024.
- Random features and polynomial rules. arXiv preprint arXiv:2402.10164, 2024.
- On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75(9):1889–1935, 2022.
- Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
- Online stochastic gradient descent on non-convex losses from high-dimensional inference. The Journal of Machine Learning Research, 22(1):4788–4838, 2021.
- Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. In The Thirty Sixth Annual Conference on Learning Theory, pages 2552–2623. PMLR, 2023.
- Learning in the presence of low-dimensional structure: a spiked random matrix perspective. Advances in Neural Information Processing Systems, 36, 2024.
- Sliding down the stairs: how correlated latent variables accelerate learning with neural networks. arXiv preprint arXiv:2404.08602, 2024.
- Learning time-scales in two-layers neural networks. arXiv preprint arXiv:2303.00055, 2023.
- On learning gaussian multi-index models with gradient flow. arXiv preprint arXiv:2310.19793, 2023.
- Smoothing the landscape boosts the signal for sgd: Optimal sample complexity for learning single index models. Advances in Neural Information Processing Systems, 36, 2024.
- Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality. In Annual Conference Computational Learning Theory, 2021.
- Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks. Advances in Neural Information Processing Systems, 35:23244–23255, 2022.
- Symmetric single index learning. arXiv preprint arXiv:2310.02117, 2023.
- The benefits of reusing batches for gradient descent in two-layer networks: Breaking the curse of information and leap exponents. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=iKkFruh4d5.
- Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 114(4):507–536, 1967.
- Signal and noise in correlation matrix. Physica A: Statistical Mechanics and its Applications, 343:295–310, 2004.
- Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169:257–352, 2017.
- Large sample covariance matrices without independence structures in columns. Statistica Sinica, pages 425–442, 2008.
- Concentration of measure and large random matrices with an application to sample covariance matrices. arXiv preprint arXiv:1805.08295, 2018.
- Clément Chouard. Quantitative deterministic equivalent of sample covariance matrices with a general dependence structure. arXiv preprint arXiv:2211.13044, 2022.
- A random matrix approach to neural networks. The Annals of Applied Probability, 28(2):1190–1248, 2018.
- Deterministic equivalent and error universality of deep random features learning. In International Conference on Machine Learning, pages 30285–30320. PMLR, 2023.
- Asymptotics of learning with deep structured (random) features. arXiv preprint arXiv:2402.13999, 2024.
- Precise asymptotic analysis of deep random feature models. In The Thirty Sixth Annual Conference on Learning Theory, pages 4132–4179. PMLR, 2023.
- Dynamic Programming. Rand Corporation research study. Princeton University Press, 1957.
- Neural networks can learn representations with gradient descent. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 5413–5452, 2022.
- The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
- An introduction to random matrices. Number 118. Cambridge university press, 2010.
- Spectral convergence for a general class of random matrices. Statistics & probability letters, 81(5):592–602, 2011.
- Nonlinear random matrix theory for deep learning. Advances in neural information processing systems, 30, 2017.
- Eigenvalue distribution of some nonlinear models of random matrices. Electronic Journal of Probability, 26:1–37, 2021.
- Largest eigenvalues of the conjugate kernel of single-layered neural networks. arXiv preprint arXiv:2201.04753, 2022.
- Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Advances in neural information processing systems, 33:7710–7721, 2020.
- Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
- Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- Universality of approximate message passing with semirandom matrices. The Annals of Probability, 51(5):1616–1683, 2023.
- Gaussian universality of perceptrons with random labels. Physical Review E, 109(3):034305, 2024.
- Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34:18137–18151, 2021.
- Are gaussian data all you need? the extents and limits of universality in high-dimensional generalized linear estimation. In International Conference on Machine Learning, pages 27680–27708. PMLR, 2023.
- Universality of approximate message passing algorithms and tensor networks. arXiv preprint arXiv:2206.13037, 2022.
- An equivalence principle for the spectrum of random inner-product kernel matrices with polynomial scalings. arXiv preprint arXiv:2205.06308, 2022.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12:389–434, 2012.
- Universality of empirical risk minimization. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4310–4312. PMLR, 02–05 Jul 2022.
- The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 2(04):1350010, 2013.
- Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.
- Random matrix methods for machine learning. Cambridge University Press, 2022.