Asymptotics of Learning with Deep Structured (Random) Features (2402.13999v2)
Abstract: For a large class of feature maps we provide a tight asymptotic characterisation of the test error associated with learning the readout layer, in the high-dimensional limit where the input dimension, hidden layer widths, and number of training samples are proportionally large. This characterization is formulated in terms of the population covariance of the features. Our work is partially motivated by the problem of learning with Gaussian rainbow neural networks, namely deep non-linear fully-connected networks with random but structured weights, whose row-wise covariances are further allowed to depend on the weights of previous layers. For such networks we also derive a closed-form formula for the feature covariance in terms of the weight matrices. We further find that in some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. U. S. A., 116(32):15849–15854, 2019.
- Benign overfitting in linear regression. Proc. Natl. Acad. Sci. USA, 117(48):30063–30070, 2020.
- Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
- The Gaussian equivalence of generative models for learning with shallow neural networks. In Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, Proceedings of Machine Learning Research. 145, pages 426–471, 2021.
- Modeling the Influence of Data Structure on Learning in Neural Networks: The Hidden Manifold Model. Phys. Rev. X, 10(4), 2020.
- Generalisation error in learning with random features and the hidden manifold model. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3452–3462. PMLR, 13–18 Jul 2020.
- Universality Laws for High-Dimensional Learning with Random Features. IEEE Trans. Inf. Theory, 2022.
- A precise performance analysis of learning with random features. arXiv:2008.11904, 2020.
- The generalization error of random features regression: precise asymptotics and the double descent curve. Comm. Pure Appl. Math., 75(4):667–766, 2022.
- Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Appl. Comput. Harmon. Anal., 59:3–84, 2022.
- Contrasting random and learned features in deep Bayesian linear regression. Phys. Rev. E, 105(6), 2022.
- Deterministic equivalent and error universality of deep random features learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 30285–30320. PMLR, 23–29 Jul 2023.
- Precise asymptotic analysis of deep random feature models, 2023.
- Learning curves for deep structured gaussian feature models. arXiv preprint arXiv:2303.00564, 2023.
- Random matrix analysis of deep neural network weight matrices. Phys. Rev. E, 106(5):Paper No. 054124, 15, 2022.
- Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research, 22(165):1–73, 2021.
- A rainbow in deep network black boxes. arXiv preprint arXiv:2305.18512, 2023.
- Bayes-optimal learning of deep random networks of extensive-width. In International Conference on Machine Learning, pages 6468–6521. PMLR, 2023.
- On the spectrum of random features maps of high dimensional data. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3063–3071. PMLR, 10–15 Jul 2018.
- Nonlinear random matrix theory for deep learning. J. Stat. Mech. Theory Exp., (12):124005, 14, 2019.
- Eigenvalue distribution of some nonlinear models of random matrices. Electron. J. Probab., 26:Paper No. 150, 37, 2021.
- A precise high-dimensional asymptotic theory for boosting and minimum-ℓℓ\ellroman_ℓ1-norm interpolated classifiers. Ann. Statist., 50(3):1669–1695, 2022.
- Learning curves of generic features maps for realistic datasets with a teacher-student model. J. Stat. Mech. Theory Exp., 2022(11):Paper No. 114001, 78, 2022.
- Double descent in random feature models: Precise asymptotic analysis for general convex regularization. arXiv:2204.02678, 2022.
- Limitations of lazy training of two-layers neural network. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- When do neural networks outperform kernel methods? J. Stat. Mech. Theory Exp., 2021(12):Paper No. 124009, 110, 2021.
- On the power and limitations of random features for understanding neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8936–8947. PMLR, 18–24 Jul 2021.
- Deep neural networks as gaussian processes. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
- Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.
- Learning curves for overparametrized deep neural networks: A field theory perspective. Phys. Rev. Res., 3:023034, Apr 2021.
- Predicting the outputs of finite deep neural networks trained with noisy gradients. Phys. Rev. E, 104:064301, Dec 2021.
- Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Phys. Rev. X, 11:031059, 2021.
- Finite depth and width corrections to the neural tangent kernel. ArXiv, abs/1909.05989, 2019.
- A statistical mechanics framework for bayesian deep neural networks beyond the infinite-width limit. Nature Machine Intelligence, 5(12):1497–1507, Dec 2023.
- Multi-layer generalized linear estimation. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2098–2102, 2017.
- Entropy and mutual information in models of deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- The spiked matrix model with generative priors. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Phase retrieval under a generative prior. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Exact asymptotics for phase retrieval and compressed sensing with random generative priors. In Jianfeng Lu and Rachel Ward, editors, Proceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107 of Proceedings of Machine Learning Research, pages 55–73. PMLR, 20–24 Jul 2020.
- Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Francis Bach. High-dimensional analysis of double descent for linear regression with random projections. Preprint, 2023.
- High-dimensional asymptotics of prediction: ridge regression and classification. Ann. Statist., 46(1):247–279, 2018.
- Deep learning: a statistical viewpoint. Preprint, 2021.
- Dimension free ridge regression. Preprint, 2022.
- Matrix Dyson equation for correlated linearizations and test error of random features regression. Preprint, 2023.
- A random matrix approach to neural networks. Ann. Appl. Probab., 28(2):1190–1248, 2018.
- The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization. Preprint, 2020.
- Radosław Adamczak. A note on the hanson-wright inequality for random vectors with dependencies. Electron. Commun. Prob., 20, 2015.
- Clément Chouard. Quantitative deterministic equivalent of sample covariance matrices with a general dependence structure. arXiv:2211.13044, 2022.
- Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457, apr 1967.
- J.W. Silverstein. Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. Journal of Multivariate Analysis, 55(2):331–339, 1995.
- Large sample covariance matrices without independence structures in columns. Statistica Sinica, 18(2):425–442, 2008.
- Spectral convergence for a general class of random matrices. Statistics & Probability Letters, 81(5):592–602, 2011.
- Anisotropic local laws for random matrices. Probab. Theory Related Fields, 169(1-2):257–352, 2017.
- Rank-uniform local law for Wigner matrices. Forum Math., Sigma, 10, 2022.
- Clément Chouard. Deterministic equivalent of the Conjugate Kernel matrix associated to Artificial Neural Networks. Preprint, 2023.
- Normal approximations with Malliavin calculus: from Stein’s method to universality, volume 192. Cambridge University Press, 2012.
- Implicit regularization of random feature models. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4631–4640. PMLR, 13–18 Jul 2020.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- High-dimensional asymptotics of feature learning: How one gradient step improves the representation, 2022.
- Radosł aw Adamczak. A note on the Hanson-Wright inequality for random vectors with dependencies. Electron. Commun. Probab., 20:no. 72, 13, 2015.
- Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. Preprint, 2010.
- Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.