Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fundamental limits of overparametrized shallow neural networks for supervised learning (2307.05635v1)

Published 11 Jul 2023 in cs.LG, cond-mat.dis-nn, cond-mat.stat-mech, cs.IT, math.IT, math.ST, and stat.TH

Abstract: We carry out an information-theoretical analysis of a two-layer neural network trained from input-output pairs generated by a teacher network with matching architecture, in overparametrized regimes. Our results come in the form of bounds relating i) the mutual information between training data and network weights, or ii) the Bayes-optimal generalization error, to the same quantities but for a simpler (generalized) linear model for which explicit expressions are rigorously known. Our bounds, which are expressed in terms of the number of training samples, input dimension and number of hidden units, thus yield fundamental performance limits for any neural network (and actually any learning procedure) trained from limited data generated according to our two-layer teacher neural network model. The proof relies on rigorous tools from spin glasses and is guided by ``Gaussian equivalence principles'' lying at the core of numerous recent analyses of neural networks. With respect to the existing literature, which is either non-rigorous or restricted to the case of the learning of the readout weights only, our results are information-theoretic (i.e. are not specific to any learning algorithm) and, importantly, cover a setting where all the network parameters are trained.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. George V. Cybenko “Approximation by superpositions of a sigmoidal function” In Mathematics of Control, Signals and Systems 2, 1989, pp. 303–314
  2. Kurt Hornik, Maxwell Stinchcombe and Halbert White “Multilayer feedforward networks are universal approximators” In Neural Networks 2.5, 1989, pp. 359–366 DOI: 10.1016/0893-6080(89)90020-8
  3. “The Expressive Power of Neural Networks: A View from the Width” In Proceedings of the 31st International Conference on Neural Information Processing Systems Red Hook, NY, USA: Curran Associates Inc., 2017, pp. 6232–6240
  4. “Three unfinished works on the optimal storage capacity of networks”, 1989, pp. 1983 DOI: 10.1088/0305-4470/22/12/004
  5. H.S. Seung, H. Sompolinsky and N. Tishby “Statistical mechanics of learning from examples” In Phys. Rev. A 45 American Physical Society, 1992, pp. 6056–6091 DOI: 10.1103/PhysRevA.45.6056
  6. “Generalized Linear Models” In Journal of the Royal Statistical Society. Series A (General) 135.3, 1972, pp. 370–384
  7. Peter McCullagh “Generalized linear models” In European Journal of Operational Research 16.3, 1984, pp. 285–292 DOI: https://doi.org/10.1016/0377-2217(84)90282-0
  8. “Optimal errors and phase transitions in high-dimensional generalized linear models” In Proceedings of the National Academy of Sciences 116.12, 2019, pp. 5451–5460
  9. “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions” In Journal of Big Data 8, 2021
  10. “Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization” In Phys. Rev. X 11 American Physical Society, 2021, pp. 031059 DOI: 10.1103/PhysRevX.11.031059
  11. “Generalization in a large committee machine” In Europhysics Letters 20.4 IOP Publishing, 1992, pp. 375
  12. Henry Schwarze “Learning a rule in a multilayer neural network” In Journal of Physics A: Mathematical and General 26.21 IOP Publishing, 1993, pp. 5781
  13. “Learning and generalization theories of large committee-machines” In Modern Physics Letters B 9.30 World Scientific, 1995, pp. 1887–1897
  14. “The committee machine: Computational to statistical gaps in learning a two-layers neural network” In Advances in Neural Information Processing Systems 31, 2018
  15. “Surprises in high-dimensional ridgeless least squares interpolation” In The Annals of Statistics 50, 2022 DOI: 10.1214/21-AOS2133
  16. “The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve” In Communications on Pure and Applied Mathematics 75, 2021 DOI: 10.1002/cpa.22008
  17. “Modeling the influence of data structure on learning in neural networks: The hidden manifold model” In Physical Review X 10.4 APS, 2020, pp. 041044
  18. “The Gaussian equivalence of generative models for learning with shallow neural networks” In Mathematical and Scientific Machine Learning, 2022, pp. 426–471 PMLR
  19. Hong Hu and Yue M Lu “Universality laws for high-dimensional learning with random features” In IEEE Transactions on Information Theory IEEE, 2022
  20. Vladimir Nikolaevich Sudakov “Typical distributions of linear functionals in finite-dimensional spaces of higher dimension” In Doklady Akademii Nauk 243.6, 1978, pp. 1402–1405 Russian Academy of Sciences
  21. “Asymptotics of graphical projection pursuit” In The annals of statistics JSTOR, 1984, pp. 793–815
  22. Galen Reeves “Conditional central limit theorems for Gaussian projections” In 2017 IEEE International Symposium on Information Theory (ISIT), 2017, pp. 3045–3049 IEEE
  23. Elizabeth Meckes “Approximation of projections of random vectors” In Journal of Theoretical Probability 25 Springer, 2012, pp. 333–352
  24. “When do neural networks outperform kernel methods?” In Advances in Neural Information Processing Systems 33, 2020, pp. 14820–14830
  25. “Statistical mechanics of deep learning beyond the infinite-width limit” In arXiv preprint arXiv:2209.04882, 2022
  26. Marc Mézard, Giorgio Parisi and Miguel Angel Virasoro “Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications” World Scientific Publishing Company, 1987
  27. Hugo Cui, Florent Krzakala and Lenka Zdeborová “Optimal Learning of Deep Random Networks of Extensive-width” In arXiv preprint arXiv:2302.00375, 2023
  28. Eli Barkai, David Hansel and Haim Sompolinsky “Broken symmetries in multilayered perceptrons” In Physical Review A 45.6 APS, 1992, pp. 4146
  29. Carlo Baldassi, Enrico M Malatesta and Riccardo Zecchina “Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations” In Physical review letters 123.17 APS, 2019, pp. 170602
  30. “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup” In Journal of Statistical Mechanics: Theory and Experiment 2020.12 IOP Publishing, 2020, pp. 124010
  31. “Storage capacity and learning algorithms for two-layer neural networks” In Phys. Rev. A 45 American Physical Society, 1992, pp. 7590–7609 DOI: 10.1103/PhysRevA.45.7590
  32. “Generalization in fully connected committee machines” In EPL (Europhysics Letters) 21.7 IOP Publishing, 1993, pp. 785
  33. “Weight space structure and internal representations: a direct approach to learning and generalization in multilayer neural networks” In Physical review letters 75.12 APS, 1995, pp. 2432
  34. David Saad and Sara A. Solla “On-line learning in soft committee machines” In Phys. Rev. E 52 American Physical Society, 1995, pp. 4225–4243 DOI: 10.1103/PhysRevE.52.4225
  35. “Generalization properties of multilayered neural networks” In Journal of Physics A: Mathematical and General 25.19 IOP Publishing, 1992, pp. 5047
  36. Andreas Engel and Christian Van den Broeck “Statistical mechanics of learning” Cambridge University Press, 2001
  37. “Stochastic particle gradient descent for infinite ensembles” In arXiv:1712.05438, 2017
  38. Diyuan Wu, Vyacheslav Kungurtsev and Marco Mondelli “Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence” In Trans. Mach. Learn. Res. 2023, 2022
  39. Adel Javanmard, Marco Mondelli and Andrea Montanari “Analysis of a two-layer neural network via displacement convexity” In Annals of Statistics 48, 2020, pp. 3619–3642 DOI: 10.1214/20-AOS1945
  40. Alexander Shevchenko, Vyacheslav Kungurtsev and Marco Mondelli “Mean-Field Analysis of Piecewise Linear Solutions for Wide ReLU Networks” In J. Mach. Learn. Res. 23.1 JMLR.org, 2022
  41. “Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks” In Proceedings of the 37th International Conference on Machine Learning 119, Proceedings of Machine Learning Research PMLR, 2020, pp. 8773–8784
  42. “Mean field analysis of neural networks: A central limit theorem” In Stochastic Processes and their Applications 130.3 Elsevier, 2020, pp. 1820–1852
  43. Dyego Araújo, Roberto I Oliveira and Daniel Yukimura “A mean-field limit for certain deep neural networks” In arXiv preprint arXiv:1906.00193, 2019
  44. Phan-Minh Nguyen “Mean field limit of the learning dynamics of multilayer neural networks” In arXiv preprint arXiv:1902.02880, 2019
  45. Phan-Minh Nguyen and Huy Tuan Pham “A rigorous framework for the mean field limit of multilayer neural networks” In arXiv preprint arXiv:2001.11443, 2020
  46. Song Mei, Andrea Montanari and Phan-Minh Nguyen “A mean field view of the landscape of two-layer neural networks” In Proceedings of the National Academy of Sciences 115.33, 2018, pp. E7665–E7671 DOI: 10.1073/pnas.1806579115
  47. Song Mei, Theodor Misiakiewicz and Andrea Montanari “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit” In Conference on Learning Theory, 2019, pp. 2388–2464 PMLR
  48. “On the Global Convergence of Gradient Descent for Over-Parameterized Models Using Optimal Transport” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18 Montréal, Canada: Curran Associates Inc., 2018, pp. 3040–3050
  49. “Trainability and Accuracy of Artificial Neural Networks: An Interacting Particle System Approach” In Communications on Pure and Applied Mathematics 75, 2022, pp. 1889–1935 DOI: 10.1002/cpa.22074
  50. Arthur Jacot, Franck Gabriel and Clément Hongler “Neural tangent kernel: convergence and generalization in neural networks (invited paper)”, 2021, pp. 6–6 DOI: 10.1145/3406325.3465355
  51. “Random Features for Large-Scale Kernel Machines” In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07 Vancouver, British Columbia, Canada: Curran Associates Inc., 2007, pp. 1177–1184
  52. “On exact computation with an infinitely wide neural net” In Advances in Neural Information Processing Systems 32, 2019
  53. “Dynamics of deep neural networks and neural tangent hierarchy” In International Conference on Machine Learning, 2020, pp. 4542–4551 PMLR
  54. “Linearized two-layers neural networks in high dimension” In arXiv preprint arXiv:1904.12191, 2019
  55. “The interpolation phase transition in neural networks: Memorization and generalization under lazy training” In arXiv preprint arXiv:2007.12826, 2020
  56. “Generalisation error in learning with random features and the hidden manifold model” In International Conference on Machine Learning, 2020, pp. 3452–3462 PMLR
  57. “Double trouble in double descent: Bias and variance (s) in the lazy regime” In International Conference on Machine Learning, 2020, pp. 2280–2290 PMLR
  58. Mario Geiger, Leonardo Petrini and Matthieu Wyart “Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training” In arXiv preprint arXiv:2012.15110, 2020
  59. Oussama Dhifallah and Yue M Lu “A precise performance analysis of learning with random features” In arXiv preprint arXiv:2008.11904, 2020
  60. Blake Bordelon, Abdulkadir Canatar and Cengiz Pehlevan “Spectrum dependent learning curves in kernel regression and wide neural networks” In International Conference on Machine Learning, 2020, pp. 1024–1034 PMLR
  61. “Convergence Analysis of Two-layer Neural Networks with ReLU Activation” In Advances in Neural Information Processing Systems 30, 2017 DOI: 10.5555/3294771.3294828
  62. “Generalization Properties of Learning with Random Features” In Proceedings of the 31st International Conference on Neural Information Processing Systems Long Beach, California, USA: Curran Associates Inc., 2017, pp. 3218–3228
  63. Francis Bach “On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions” JMLR.org, 2017
  64. “Learning overparametrized neural networks via stochastic gradient descent on structured data” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 8168–8177
  65. Zeyuan Allen-Zhu, Yuanzhi Li and Zhao Song “A convergence theory for deep learning via over-parameterization” In International Conference on Machine Learning, 2019, pp. 242–252 PMLR
  66. “Gradient Descent Provably Optimizes Over-parameterized Neural Networks” In International Conference on Learning Representations, 2019 URL: https://openreview.net/forum?id=S1eK3i09YQ
  67. “Gradient descent finds global minima of deep neural networks” In International Conference on Machine Learning, 2019, pp. 1675–1685 PMLR
  68. “Wide neural networks of any depth evolve as linear models under gradient descent” In Journal of Statistical Mechanics: Theory and Experiment 2020.12 IOP Publishing, 2020, pp. 124002
  69. Radford M Neal “Bayesian learning for neural networks” Springer Science & Business Media, 2012
  70. “Gaussian Process Behaviour in Wide Deep Neural Networks” In International Conference on Learning Representations, 2018
  71. “Deep Neural Networks as Gaussian Processes” In International Conference on Learning Representations, 2018 URL: https://openreview.net/forum?id=B1EA-M-0Z
  72. “Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes” In International Conference on Learning Representations, 2019
  73. Ronen Eldan, Dan Mikulincer and Tselil Schramm “Non-asymptotic approximations of neural networks by Gaussian processes” In Proceedings of Thirty Fourth Conference on Learning Theory 134, Proceedings of Machine Learning Research PMLR, 2021, pp. 1754–1775 URL: https://proceedings.mlr.press/v134/eldan21a.html
  74. “Large-width functional asymptotics for deep Gaussian neural networks” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=0aW6lYOYB7d
  75. “Nonlinear random matrix theory for deep learning” In Advances in Neural Information Processing Systems 30 Curran Associates, Inc., 2017 DOI: 10.5555/3294996.3295024
  76. Cosme Louart, Zhenyu Liao and Romain Couillet “A random matrix approach to neural networks” In The Annals of Applied Probability 28.2 Institute of Mathematical Statistics, 2018, pp. 1190–1248 DOI: 10.1214/17-AAP1328
  77. “On the spectrum of random features maps of high dimensional data” In International Conference on Machine Learning, 2018, pp. 3063–3071 PMLR
  78. “Eigenvalue distribution of some nonlinear models of random matrices” In Electronic Journal of Probability 26 The Institute of Mathematical Statisticsthe Bernoulli Society, 2021, pp. 1–37
  79. S Péché “A note on the pennington-worah distribution” In Electronic Communications in Probability 24 The Institute of Mathematical Statisticsthe Bernoulli Society, 2019
  80. “Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks” In Advances in Neural Information Processing Systems 33, 2020
  81. “Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization” In Conference on Neural Information Processing Systems (NeurIPS), 2020
  82. Walid Hachem, Philippe Loubaton and Jamal Najim “Deterministic equivalents for certain functionals of large random matrices” In The Annals of Applied Probability 17.3 Institute of Mathematical Statistics, 2007, pp. 875–930 DOI: 10.1214/105051606000000925
  83. “The spectrum of random inner-product kernel matrices” In Random Matrices: Theory and Applications 02.04, 2013, pp. 1350010 DOI: 10.1142/S201032631350010X
  84. “The Spectral Norm of Random Inner-Product Kernel Matrices” In Probability Theory and Related Fields 173, 2019 DOI: 10.1007/s00440-018-0830-4
  85. “Entropy and Mutual Information in Models of Deep Neural Networks” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18 Montréal, Canada: Curran Associates Inc., 2018, pp. 1826–1836
  86. “Free Energy of Multi-Layer Generalized Linear Models” In Communications in Mathematical Physics, 2023, pp. 1–53 DOI: 10.1007/s00220-022-04630-4
  87. Francesco Guerra and Fabio Lucio Toninelli “The Thermodynamic Limit in Mean Field Spin Glass Models” In Communications in Mathematical Physics 230, 2002
  88. Francesco Guerra “Broken Replica Symmetry Bounds in the Mean Field Spin Glass Model” In Communications in Mathematical Physics 233, 2003
  89. Stéphane Boucheron, Gábor Lugosi and Olivier Bousquet “Concentration Inequalities” In Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2 - 14, 2003, Tübingen, Germany, August 4 - 16, 2003, Revised Lectures Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 208–240 DOI: 10.1007/978-3-540-28650-9˙9
  90. Dongning Guo, Shlomo Shamai and Sergio Verdú “Mutual information and minimum mean-square error in Gaussian channels” In IEEE transactions on information theory 51.4 IEEE, 2005, pp. 1261–1282
  91. Richard Ellis “Entropy, Large Deviations, and Statistical Mechanics” Springer, 2006
  92. Hidetoshi Nishimori “Statistical Physics of Spin Glasses and Information Processing: an Introduction” Oxford; New York: Oxford University Press, 2001
  93. “Strong replica symmetry in high-dimensional optimal Bayesian inference” In Communications in mathematical physics 393.3 Springer, 2022, pp. 1199–1239
  94. “Fundamental limits of symmetric low-rank matrix estimation” In Probability Theory and Related Fields 173, 2017, pp. 859–929
Citations (5)

Summary

We haven't generated a summary for this paper yet.