Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent (2310.00692v3)

Published 1 Oct 2023 in cs.LG and stat.ML

Abstract: In this paper, we provide a theoretical study of noise geometry for minibatch stochastic gradient descent (SGD), a phenomenon where noise aligns favorably with the geometry of local landscape. We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength. We show that for (over-parameterized) linear models and two-layer nonlinear networks, when measured by these metrics, the alignment can be provably guaranteed under conditions independent of the degree of over-parameterization. To showcase the utility of our noise geometry characterizations, we present a refined analysis of the mechanism by which SGD escapes from sharp minima. We reveal that unlike gradient descent (GD), which escapes along the sharpest directions, SGD tends to escape from flatter directions and cyclical learning rates can exploit this SGD characteristic to navigate more effectively towards flatter regions. Lastly, extensive experiments are provided to support our theoretical findings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Understanding the unstable convergence of gradient descent. In International Conference on Machine Learning, pages 247–257. PMLR, 2022.
  2. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.
  3. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. In International Conference on Machine Learning, pages 468–477. PMLR, 2021.
  4. On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564, 2018.
  5. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
  6. Nearly optimal bounds for the global geometric landscape of phase retrieval. arXiv preprint arXiv:2204.09416, 2022.
  7. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2020.
  8. Label noise SGD provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449–27461, 2021.
  9. Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pages 1155–1164. PMLR, 2018.
  10. Fast convergence of stochastic subgradient method under interpolation. In International Conference on Learning Representations, 2020.
  11. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. In Conference on Learning Theory, pages 313–355. PMLR, 2022.
  12. Yu Feng and Yuhai Tu. The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima. Proceedings of the National Academy of Sciences, 118(9), 2021.
  13. Deep learning. MIT press, 2016.
  14. Bootstrapping upper confidence bound. Advances in neural information processing systems, 32, 2019.
  15. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315–2357. PMLR, 2021.
  16. Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  18. S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
  19. Snapshot ensembles: Train 1, get M𝑀{M}italic_M for free. In International Conference on Learning Representations, 2018.
  20. On large-batch training for deep learning: Generalization gap and sharp minima. In In International Conference on Learning Representations (ICLR), 2017.
  21. An alternative view: When does SGD escape local minima? In International Conference on Machine Learning, pages 2698–2707. PMLR, 2018.
  22. Learning multiple layers of features from tiny images, 2009. URL https://www.cs.toronto.edu/~kriz/cifar.html.
  23. On the validity of modeling SGD with stochastic differential equations (SDEs). In Advances in Neural Information Processing Systems, volume 34, 2021.
  24. What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations, 2022.
  25. Aiming towards the minimizers: fast convergence of SGD for overparametrized problems. arXiv preprint arXiv:2306.02601, 2023.
  26. Noise and fluctuation of finite learning rate stochastic gradient descent. In International Conference on Machine Learning, pages 7045–7056. PMLR, 2021.
  27. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  28. Beyond the quadratic approximation: The multiscale structure of neural network loss landscapes. Journal of Machine Learning, 1(3):247–267, 2022.
  29. Power-law escape rate of SGD. In International Conference on Machine Learning, pages 15959–15975. PMLR, 2022.
  30. Implicit bias of SGD for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021.
  31. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  32. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827–5837. PMLR, 2019.
  33. Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017.
  34. On the interplay between noise and curvature and its effect on optimization and generalization. In International Conference on Artificial Intelligence and Statistics, pages 3503–3513. PMLR, 2020.
  35. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  36. Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. part II: Continuous time analysis. arXiv preprint arXiv:2106.02588, 2021.
  37. Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type part i: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023.
  38. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.
  39. On the noisy gradient descent that generalizes as SGD. In International Conference on Machine Learning, pages 10367–10376. PMLR, 2020.
  40. The implicit regularization of dynamical stability in stochastic gradient descent. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 37656–37684. PMLR, 23–29 Jul 2023.
  41. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
  42. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31:8279–8288, 2018.
  43. The alignment property of SGD noise and how it helps select flat minima: A stability analysis. Advances in Neural Information Processing Systems, 35:4680–4693, 2022.
  44. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2020.
  45. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International conference on machine learning, pages 24430–24459. PMLR, 2022.
  46. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
  47. Residual learning without normalization via better initialization. In International Conference on Learning Representations, 2019.
  48. Towards theoretically understanding why SGD generalizes better than Adam in deep learning. Advances in Neural Information Processing Systems, 33, 2020.
  49. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In International Conference on Machine Learning, pages 7654–7663. PMLR, 2019.
  50. Strength of minibatch noise in SGD. In International Conference on Learning Representations, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.