Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks (2202.00293v4)

Published 1 Feb 2022 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. S. Mei, A. Montanari, and P.-M. Nguyen, “A mean field view of the landscape of two-layer neural networks,” Proceedings of the National Academy of Sciences, vol. 115, no. 33, pp. E7665–E7671, 2018.
  2. L. Chizat and F. Bach, “On the global convergence of gradient descent for over-parameterized models using optimal transport,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31.   Curran Associates, Inc., 2018.
  3. G. Rotskoff and E. Vanden-Eijnden, “Trainability and accuracy of artificial neural networks: An interacting particle system approach,” Communications on Pure and Applied Mathematics, vol. 75, no. 9, pp. 1889–1935, 2022.
  4. J. Sirignano and K. Spiliopoulos, “Mean field analysis of neural networks: A central limit theorem,” Stochastic Processes and their Applications, vol. 130, no. 3, pp. 1820–1852, 2020.
  5. D. Saad and S. A. Solla, “On-line learning in soft committee machines,” Phys. Rev. E, vol. 52, pp. 4225–4243, Oct 1995.
  6. S. Goldt, M. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborová, “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019.
  7. W. Kinzel and P. Ruján, “Improving a network generalization ability by selecting examples,” Europhysics Letters (EPL), vol. 13, no. 5, pp. 473–477, nov 1990.
  8. O. Kinouchi and N. Caticha, “Optimal generalization in perceptions,” Journal of Physics A: Mathematical and General, vol. 25, no. 23, pp. 6243–6250, dec 1992.
  9. M. Copelli and N. Caticha, “On-line learning in the committee machine,” Journal of Physics A: Mathematical and General, vol. 28, no. 6, pp. 1615–1625, mar 1995.
  10. M. Biehl and H. Schwarze, “Learning by on-line gradient descent,” Journal of Physics A: Mathematical and General, vol. 28, no. 3, pp. 643–656, feb 1995.
  11. P. Riegler and M. Biehl, “On-line backpropagation in two-layered neural networks,” Journal of Physics A: Mathematical and General, vol. 28, no. 20, pp. L507–L513, oct 1995.
  12. D. Saad and S. Solla, “Dynamics of on-line gradient descent learning for multilayer neural networks,” in Advances in Neural Information Processing Systems, D. Touretzky, M. C. Mozer, and M. Hasselmo, Eds., vol. 8.   MIT Press, 1996.
  13. R. Vicente, O. Kinouchi, and N. Caticha, “Statistical mechanics of online learning of drifting concepts: A variational approach,” Machine learning, vol. 32, no. 2, pp. 179–201, 1998.
  14. S. Mei, T. Misiakiewicz, and A. Montanari, “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,” in Proceedings of the Thirty-Second Conference on Learning Theory, ser. Proceedings of Machine Learning Research, A. Beygelzimer and D. Hsu, Eds., vol. 99.   PMLR, 25–28 Jun 2019, pp. 2388–2464.
  15. D. Saad and S. A. Solla, “Exact solution for on-line learning in multilayer neural networks,” Phys. Rev. Lett., vol. 74, pp. 4337–4340, May 1995.
  16. M. Refinetti, S. Goldt, F. Krzakala, and L. Zdeborova, “Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 8936–8947.
  17. S. Goldt, B. Loureiro, G. Reeves, F. Krzakala, M. Mézard, and L. Zdeborová, “The gaussian equivalence of generative models for learning with two-layer neural networks,” in Proceedings of Machine Learning Research, vol. 145.   2nd Annual Conference on Mathematical and Scientific Machine Learning, 2021, pp. 1–46.
  18. H. Hu and Y. M. Lu, “Universality laws for high-dimensional learning with random features,” IEEE Transactions on Information Theory, vol. 69, no. 3, pp. 1932–1964, 2023.
  19. A. Montanari and B. N. Saeed, “Universality of empirical risk minimization,” in Proceedings of Thirty Fifth Conference on Learning Theory, ser. Proceedings of Machine Learning Research, P.-L. Loh and M. Raginsky, Eds., vol. 178.   PMLR, 02–05 Jul 2022, pp. 4310–4312.
  20. C. Wang, Y. C. Eldar, and Y. M. Lu, “Subspace estimation from incomplete observations: A high-dimensional analysis,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1240–1252, 2018.
  21. Y. Yoshida, R. Karakida, M. Okada, and S.-i. Amari, “Statistical Mechanical Analysis of Online Learning with Weight Normalization in Single Layer Perceptron,” Journal of the Physical Society of Japan, vol. 86, no. 4, p. 044002, Apr. 2017.
  22. P. Del Moral and A. Niclas, “A taylor expansion of the square root matrix function,” Journal of Mathematical Analysis and Applications, vol. 465, no. 1, pp. 259–266, 2018.
  23. B. Aubin, A. Maillard, J. Barbier, F. Krzakala, N. Macris, and L. Zdeborová, “The committee machine: computational to statistical gaps in learning a two-layers neural network,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, p. 124023, dec 2019.
  24. A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31.   Curran Associates, Inc., 2018.
  25. L. Chizat, E. Oyallon, and F. Bach, “On lazy training in differentiable programming,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019.
  26. F. Bach and L. Chizat, “Gradient descent on infinitely wide neural networks: Global convergence and generalization,” arXiv preprint arXiv:2110.08084, 2021.
  27. Y. S. Tan and R. Vershynin, “Phase retrieval via randomized Kaczmarz: theoretical guarantees,” Information and Inference: A Journal of the IMA, vol. 8, no. 1, pp. 97–123, 04 2018.
  28. G. B. Arous, R. Gheissari, and A. Jagannath, “Online stochastic gradient descent on non-convex losses from high-dimensional inference,” Journal of Machine Learning Research, vol. 22, no. 106, pp. 1–51, 2021.
  29. ——, “Algorithmic thresholds for tensor PCA,” The Annals of Probability, vol. 48, no. 4, pp. 2052 – 2087, 2020.
  30. C. Wang, H. Hu, and Y. Lu, “A solvable high-dimensional model of gan,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019.
Citations (30)

Summary

We haven't generated a summary for this paper yet.