Fundamental limits of overparametrized shallow neural networks for supervised learning (2307.05635v1)
Abstract: We carry out an information-theoretical analysis of a two-layer neural network trained from input-output pairs generated by a teacher network with matching architecture, in overparametrized regimes. Our results come in the form of bounds relating i) the mutual information between training data and network weights, or ii) the Bayes-optimal generalization error, to the same quantities but for a simpler (generalized) linear model for which explicit expressions are rigorously known. Our bounds, which are expressed in terms of the number of training samples, input dimension and number of hidden units, thus yield fundamental performance limits for any neural network (and actually any learning procedure) trained from limited data generated according to our two-layer teacher neural network model. The proof relies on rigorous tools from spin glasses and is guided by ``Gaussian equivalence principles'' lying at the core of numerous recent analyses of neural networks. With respect to the existing literature, which is either non-rigorous or restricted to the case of the learning of the readout weights only, our results are information-theoretic (i.e. are not specific to any learning algorithm) and, importantly, cover a setting where all the network parameters are trained.
- George V. Cybenko “Approximation by superpositions of a sigmoidal function” In Mathematics of Control, Signals and Systems 2, 1989, pp. 303–314
- Kurt Hornik, Maxwell Stinchcombe and Halbert White “Multilayer feedforward networks are universal approximators” In Neural Networks 2.5, 1989, pp. 359–366 DOI: 10.1016/0893-6080(89)90020-8
- “The Expressive Power of Neural Networks: A View from the Width” In Proceedings of the 31st International Conference on Neural Information Processing Systems Red Hook, NY, USA: Curran Associates Inc., 2017, pp. 6232–6240
- “Three unfinished works on the optimal storage capacity of networks”, 1989, pp. 1983 DOI: 10.1088/0305-4470/22/12/004
- H.S. Seung, H. Sompolinsky and N. Tishby “Statistical mechanics of learning from examples” In Phys. Rev. A 45 American Physical Society, 1992, pp. 6056–6091 DOI: 10.1103/PhysRevA.45.6056
- “Generalized Linear Models” In Journal of the Royal Statistical Society. Series A (General) 135.3, 1972, pp. 370–384
- Peter McCullagh “Generalized linear models” In European Journal of Operational Research 16.3, 1984, pp. 285–292 DOI: https://doi.org/10.1016/0377-2217(84)90282-0
- “Optimal errors and phase transitions in high-dimensional generalized linear models” In Proceedings of the National Academy of Sciences 116.12, 2019, pp. 5451–5460
- “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions” In Journal of Big Data 8, 2021
- “Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization” In Phys. Rev. X 11 American Physical Society, 2021, pp. 031059 DOI: 10.1103/PhysRevX.11.031059
- “Generalization in a large committee machine” In Europhysics Letters 20.4 IOP Publishing, 1992, pp. 375
- Henry Schwarze “Learning a rule in a multilayer neural network” In Journal of Physics A: Mathematical and General 26.21 IOP Publishing, 1993, pp. 5781
- “Learning and generalization theories of large committee-machines” In Modern Physics Letters B 9.30 World Scientific, 1995, pp. 1887–1897
- “The committee machine: Computational to statistical gaps in learning a two-layers neural network” In Advances in Neural Information Processing Systems 31, 2018
- “Surprises in high-dimensional ridgeless least squares interpolation” In The Annals of Statistics 50, 2022 DOI: 10.1214/21-AOS2133
- “The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve” In Communications on Pure and Applied Mathematics 75, 2021 DOI: 10.1002/cpa.22008
- “Modeling the influence of data structure on learning in neural networks: The hidden manifold model” In Physical Review X 10.4 APS, 2020, pp. 041044
- “The Gaussian equivalence of generative models for learning with shallow neural networks” In Mathematical and Scientific Machine Learning, 2022, pp. 426–471 PMLR
- Hong Hu and Yue M Lu “Universality laws for high-dimensional learning with random features” In IEEE Transactions on Information Theory IEEE, 2022
- Vladimir Nikolaevich Sudakov “Typical distributions of linear functionals in finite-dimensional spaces of higher dimension” In Doklady Akademii Nauk 243.6, 1978, pp. 1402–1405 Russian Academy of Sciences
- “Asymptotics of graphical projection pursuit” In The annals of statistics JSTOR, 1984, pp. 793–815
- Galen Reeves “Conditional central limit theorems for Gaussian projections” In 2017 IEEE International Symposium on Information Theory (ISIT), 2017, pp. 3045–3049 IEEE
- Elizabeth Meckes “Approximation of projections of random vectors” In Journal of Theoretical Probability 25 Springer, 2012, pp. 333–352
- “When do neural networks outperform kernel methods?” In Advances in Neural Information Processing Systems 33, 2020, pp. 14820–14830
- “Statistical mechanics of deep learning beyond the infinite-width limit” In arXiv preprint arXiv:2209.04882, 2022
- Marc Mézard, Giorgio Parisi and Miguel Angel Virasoro “Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications” World Scientific Publishing Company, 1987
- Hugo Cui, Florent Krzakala and Lenka Zdeborová “Optimal Learning of Deep Random Networks of Extensive-width” In arXiv preprint arXiv:2302.00375, 2023
- Eli Barkai, David Hansel and Haim Sompolinsky “Broken symmetries in multilayered perceptrons” In Physical Review A 45.6 APS, 1992, pp. 4146
- Carlo Baldassi, Enrico M Malatesta and Riccardo Zecchina “Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations” In Physical review letters 123.17 APS, 2019, pp. 170602
- “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup” In Journal of Statistical Mechanics: Theory and Experiment 2020.12 IOP Publishing, 2020, pp. 124010
- “Storage capacity and learning algorithms for two-layer neural networks” In Phys. Rev. A 45 American Physical Society, 1992, pp. 7590–7609 DOI: 10.1103/PhysRevA.45.7590
- “Generalization in fully connected committee machines” In EPL (Europhysics Letters) 21.7 IOP Publishing, 1993, pp. 785
- “Weight space structure and internal representations: a direct approach to learning and generalization in multilayer neural networks” In Physical review letters 75.12 APS, 1995, pp. 2432
- David Saad and Sara A. Solla “On-line learning in soft committee machines” In Phys. Rev. E 52 American Physical Society, 1995, pp. 4225–4243 DOI: 10.1103/PhysRevE.52.4225
- “Generalization properties of multilayered neural networks” In Journal of Physics A: Mathematical and General 25.19 IOP Publishing, 1992, pp. 5047
- Andreas Engel and Christian Van den Broeck “Statistical mechanics of learning” Cambridge University Press, 2001
- “Stochastic particle gradient descent for infinite ensembles” In arXiv:1712.05438, 2017
- Diyuan Wu, Vyacheslav Kungurtsev and Marco Mondelli “Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence” In Trans. Mach. Learn. Res. 2023, 2022
- Adel Javanmard, Marco Mondelli and Andrea Montanari “Analysis of a two-layer neural network via displacement convexity” In Annals of Statistics 48, 2020, pp. 3619–3642 DOI: 10.1214/20-AOS1945
- Alexander Shevchenko, Vyacheslav Kungurtsev and Marco Mondelli “Mean-Field Analysis of Piecewise Linear Solutions for Wide ReLU Networks” In J. Mach. Learn. Res. 23.1 JMLR.org, 2022
- “Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks” In Proceedings of the 37th International Conference on Machine Learning 119, Proceedings of Machine Learning Research PMLR, 2020, pp. 8773–8784
- “Mean field analysis of neural networks: A central limit theorem” In Stochastic Processes and their Applications 130.3 Elsevier, 2020, pp. 1820–1852
- Dyego Araújo, Roberto I Oliveira and Daniel Yukimura “A mean-field limit for certain deep neural networks” In arXiv preprint arXiv:1906.00193, 2019
- Phan-Minh Nguyen “Mean field limit of the learning dynamics of multilayer neural networks” In arXiv preprint arXiv:1902.02880, 2019
- Phan-Minh Nguyen and Huy Tuan Pham “A rigorous framework for the mean field limit of multilayer neural networks” In arXiv preprint arXiv:2001.11443, 2020
- Song Mei, Andrea Montanari and Phan-Minh Nguyen “A mean field view of the landscape of two-layer neural networks” In Proceedings of the National Academy of Sciences 115.33, 2018, pp. E7665–E7671 DOI: 10.1073/pnas.1806579115
- Song Mei, Theodor Misiakiewicz and Andrea Montanari “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit” In Conference on Learning Theory, 2019, pp. 2388–2464 PMLR
- “On the Global Convergence of Gradient Descent for Over-Parameterized Models Using Optimal Transport” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18 Montréal, Canada: Curran Associates Inc., 2018, pp. 3040–3050
- “Trainability and Accuracy of Artificial Neural Networks: An Interacting Particle System Approach” In Communications on Pure and Applied Mathematics 75, 2022, pp. 1889–1935 DOI: 10.1002/cpa.22074
- Arthur Jacot, Franck Gabriel and Clément Hongler “Neural tangent kernel: convergence and generalization in neural networks (invited paper)”, 2021, pp. 6–6 DOI: 10.1145/3406325.3465355
- “Random Features for Large-Scale Kernel Machines” In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07 Vancouver, British Columbia, Canada: Curran Associates Inc., 2007, pp. 1177–1184
- “On exact computation with an infinitely wide neural net” In Advances in Neural Information Processing Systems 32, 2019
- “Dynamics of deep neural networks and neural tangent hierarchy” In International Conference on Machine Learning, 2020, pp. 4542–4551 PMLR
- “Linearized two-layers neural networks in high dimension” In arXiv preprint arXiv:1904.12191, 2019
- “The interpolation phase transition in neural networks: Memorization and generalization under lazy training” In arXiv preprint arXiv:2007.12826, 2020
- “Generalisation error in learning with random features and the hidden manifold model” In International Conference on Machine Learning, 2020, pp. 3452–3462 PMLR
- “Double trouble in double descent: Bias and variance (s) in the lazy regime” In International Conference on Machine Learning, 2020, pp. 2280–2290 PMLR
- Mario Geiger, Leonardo Petrini and Matthieu Wyart “Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training” In arXiv preprint arXiv:2012.15110, 2020
- Oussama Dhifallah and Yue M Lu “A precise performance analysis of learning with random features” In arXiv preprint arXiv:2008.11904, 2020
- Blake Bordelon, Abdulkadir Canatar and Cengiz Pehlevan “Spectrum dependent learning curves in kernel regression and wide neural networks” In International Conference on Machine Learning, 2020, pp. 1024–1034 PMLR
- “Convergence Analysis of Two-layer Neural Networks with ReLU Activation” In Advances in Neural Information Processing Systems 30, 2017 DOI: 10.5555/3294771.3294828
- “Generalization Properties of Learning with Random Features” In Proceedings of the 31st International Conference on Neural Information Processing Systems Long Beach, California, USA: Curran Associates Inc., 2017, pp. 3218–3228
- Francis Bach “On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions” JMLR.org, 2017
- “Learning overparametrized neural networks via stochastic gradient descent on structured data” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 8168–8177
- Zeyuan Allen-Zhu, Yuanzhi Li and Zhao Song “A convergence theory for deep learning via over-parameterization” In International Conference on Machine Learning, 2019, pp. 242–252 PMLR
- “Gradient Descent Provably Optimizes Over-parameterized Neural Networks” In International Conference on Learning Representations, 2019 URL: https://openreview.net/forum?id=S1eK3i09YQ
- “Gradient descent finds global minima of deep neural networks” In International Conference on Machine Learning, 2019, pp. 1675–1685 PMLR
- “Wide neural networks of any depth evolve as linear models under gradient descent” In Journal of Statistical Mechanics: Theory and Experiment 2020.12 IOP Publishing, 2020, pp. 124002
- Radford M Neal “Bayesian learning for neural networks” Springer Science & Business Media, 2012
- “Gaussian Process Behaviour in Wide Deep Neural Networks” In International Conference on Learning Representations, 2018
- “Deep Neural Networks as Gaussian Processes” In International Conference on Learning Representations, 2018 URL: https://openreview.net/forum?id=B1EA-M-0Z
- “Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes” In International Conference on Learning Representations, 2019
- Ronen Eldan, Dan Mikulincer and Tselil Schramm “Non-asymptotic approximations of neural networks by Gaussian processes” In Proceedings of Thirty Fourth Conference on Learning Theory 134, Proceedings of Machine Learning Research PMLR, 2021, pp. 1754–1775 URL: https://proceedings.mlr.press/v134/eldan21a.html
- “Large-width functional asymptotics for deep Gaussian neural networks” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=0aW6lYOYB7d
- “Nonlinear random matrix theory for deep learning” In Advances in Neural Information Processing Systems 30 Curran Associates, Inc., 2017 DOI: 10.5555/3294996.3295024
- Cosme Louart, Zhenyu Liao and Romain Couillet “A random matrix approach to neural networks” In The Annals of Applied Probability 28.2 Institute of Mathematical Statistics, 2018, pp. 1190–1248 DOI: 10.1214/17-AAP1328
- “On the spectrum of random features maps of high dimensional data” In International Conference on Machine Learning, 2018, pp. 3063–3071 PMLR
- “Eigenvalue distribution of some nonlinear models of random matrices” In Electronic Journal of Probability 26 The Institute of Mathematical Statisticsthe Bernoulli Society, 2021, pp. 1–37
- S Péché “A note on the pennington-worah distribution” In Electronic Communications in Probability 24 The Institute of Mathematical Statisticsthe Bernoulli Society, 2019
- “Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks” In Advances in Neural Information Processing Systems 33, 2020
- “Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization” In Conference on Neural Information Processing Systems (NeurIPS), 2020
- Walid Hachem, Philippe Loubaton and Jamal Najim “Deterministic equivalents for certain functionals of large random matrices” In The Annals of Applied Probability 17.3 Institute of Mathematical Statistics, 2007, pp. 875–930 DOI: 10.1214/105051606000000925
- “The spectrum of random inner-product kernel matrices” In Random Matrices: Theory and Applications 02.04, 2013, pp. 1350010 DOI: 10.1142/S201032631350010X
- “The Spectral Norm of Random Inner-Product Kernel Matrices” In Probability Theory and Related Fields 173, 2019 DOI: 10.1007/s00440-018-0830-4
- “Entropy and Mutual Information in Models of Deep Neural Networks” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18 Montréal, Canada: Curran Associates Inc., 2018, pp. 1826–1836
- “Free Energy of Multi-Layer Generalized Linear Models” In Communications in Mathematical Physics, 2023, pp. 1–53 DOI: 10.1007/s00220-022-04630-4
- Francesco Guerra and Fabio Lucio Toninelli “The Thermodynamic Limit in Mean Field Spin Glass Models” In Communications in Mathematical Physics 230, 2002
- Francesco Guerra “Broken Replica Symmetry Bounds in the Mean Field Spin Glass Model” In Communications in Mathematical Physics 233, 2003
- Stéphane Boucheron, Gábor Lugosi and Olivier Bousquet “Concentration Inequalities” In Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2 - 14, 2003, Tübingen, Germany, August 4 - 16, 2003, Revised Lectures Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 208–240 DOI: 10.1007/978-3-540-28650-9˙9
- Dongning Guo, Shlomo Shamai and Sergio Verdú “Mutual information and minimum mean-square error in Gaussian channels” In IEEE transactions on information theory 51.4 IEEE, 2005, pp. 1261–1282
- Richard Ellis “Entropy, Large Deviations, and Statistical Mechanics” Springer, 2006
- Hidetoshi Nishimori “Statistical Physics of Spin Glasses and Information Processing: an Introduction” Oxford; New York: Oxford University Press, 2001
- “Strong replica symmetry in high-dimensional optimal Bayesian inference” In Communications in mathematical physics 393.3 Springer, 2022, pp. 1199–1239
- “Fundamental limits of symmetric low-rank matrix estimation” In Probability Theory and Related Fields 173, 2017, pp. 859–929