Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent (2303.13093v4)

Published 23 Mar 2023 in cs.LG, math.OC, and physics.data-an

Abstract: Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem. We first show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle. The dynamics of SGD around these saddles are thus to leading order described by a random matrix product process, and it is thus natural to study the dynamics of SGD around these saddles using the notion of probabilistic stability and the related Lyapunov exponent. Theoretically, we link the study of SGD dynamics to well-known concepts in ergodic theory, which we leverage to show that saddle points can be either attractive or repulsive for SGD, and its dynamics can be classified into four different phases, depending on the signal-to-noise ratio in the gradient close to the saddle.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Sgd with large step sizes learns sparse features. arXiv preprint arXiv:2210.05337, 2022.
  2. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape. arXiv preprint arXiv:1907.02911, 2019.
  3. Products of random matrices: in Statistical Physics, volume 104. Springer Science & Business Media, 2012.
  4. Sharp Minima Can Generalize For Deep Nets. ArXiv e-prints, March 2017.
  5. Ergodic theory of chaos and strange attractors. The theory of chaotic attractors, pp.  273–312, 1985.
  6. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
  7. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3):317–327, 2000.
  8. Products of random matrices. The Annals of Mathematical Statistics, 31(2):457–469, 1960.
  9. Sgd and weight decay provably induce a low-rank bias in neural networks, 2023.
  10. Matrix completion has no spurious local minimum. Advances in neural information processing systems, 29, 2016.
  11. Sgd: General analysis and improved rates. In International conference on machine learning, pp. 5200–5209. PMLR, 2019.
  12. Minimal model of permutation symmetry in unsupervised learning. Journal of Physics A: Mathematical and Theoretical, 52(41):414001, 2019.
  13. Effective estimates on the top lyapunov exponents for random matrix products. Nonlinearity, 32(11):4117, 2019.
  14. Kenji Kawaguchi. Deep learning without poor local minima. Advances in Neural Information Processing Systems, 29:586–594, 2016.
  15. Rafail Zalmanovich Khas’ minskii. Necessary and sufficient conditions for the asymptotic stability of linear stochastic systems. Theory of Probability & Its Applications, 12(1):144–147, 1967.
  16. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://dblp.uni-trier.de/db/journals/corr/corr1412.html#KingmaB14.
  17. Smoothing the edges: A general framework for smooth optimization in sparse regularization using hadamard overparametrization. arXiv preprint arXiv:2307.03571, 2023.
  18. On the validity of modeling sgd with stochastic differential equations (sdes). Advances in Neural Information Processing Systems, 34:12712–12725, 2021.
  19. Noise and fluctuation of finite learning rate stochastic gradient descent, 2021.
  20. An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261–18271, 2020.
  21. Depth creates no bad local minima. arXiv preprint arXiv:1702.08580, 2017.
  22. Aleksandr Mikhailovich Lyapunov. The general problem of the stability of motion. International journal of control, 55(3):531–534, 1992.
  23. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  24. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  25. Mark Pollicott. Maximal lyapunov exponents for random matrix products. Inventiones mathematicae, 181(1):209–226, 2010.
  26. Smooth bilevel programming for sparse regularization. Advances in Neural Information Processing Systems, 34:1543–1555, 2021.
  27. Smooth over-parameterized solvers for non-smooth structured optimization. arXiv preprint arXiv:2205.01385, 2022.
  28. Searching for activation functions, 2017.
  29. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pp. 9722–9732. PMLR, 2021.
  30. A converse lyapunov theorem and robustness for asymptotic stability in probability. IEEE Transactions on Automatic Control, 59(9):2426–2441, 2014.
  31. Yuandong Tian. Deep contrastive learning is provably (almost) principal component analysis. arXiv preprint arXiv:2201.12680, 2022.
  32. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pp.  1195–1204. PMLR, 2019.
  33. Posterior collapse of a linear latent variable model. arXiv preprint arXiv:2205.04009, 2022.
  34. Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory. Journal of machine learning research, 11(12), 2010.
  35. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
  36. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. arXiv preprint arXiv:1803.00195, 2018.
  37. Liu Ziyin. Symmetry leads to structured constraint of learning, 2023.
  38. spred: Solving L1 Penalty with SGD. In International Conference on Machine Learning, 2023.
  39. Exact solutions of a deep linear network. arXiv preprint arXiv:2202.04777, 2022a.
  40. Strength of minibatch noise in SGD. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=uorVGbWV5sw.
  41. Law of balance and stationary distribution of stochastic gradient descent. arXiv preprint arXiv:2308.06671, 2023a.
  42. What shapes the loss landscape of self supervised learning? In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=3zSn48RUO8M.
Citations (7)

Summary

We haven't generated a summary for this paper yet.