2000 character limit reached
Convergence of stochastic gradient descent under a local Lojasiewicz condition for deep neural networks (2304.09221v2)
Published 18 Apr 2023 in cs.LG, math.OC, and stat.ML
Abstract: We study the convergence of stochastic gradient descent (SGD) for non-convex objective functions. We establish the local convergence with positive probability under the local \L{}ojasiewicz condition introduced by Chatterjee in \cite{chatterjee2022convergence} and an additional local structural assumption of the loss function landscape. A key component of our proof is to ensure that the whole trajectories of SGD stay inside the local region with a positive probability. We also provide examples of neural networks with finite widths such that our assumptions hold.
- Z. Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd. Advances in Neural Information Processing Systems, 31, 2018.
- Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. Advances in neural information processing systems, 31, 2018.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- M. Benaïm. Dynamics of stochastic approximation algorithms. In Seminaire de probabilites XXXIII, pages 1–68. Springer, 2006.
- D. P. Bertsekas et al. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
- Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
- S. Chatterjee. Convergence of gradient descent for deep neural networks. arXiv preprint arXiv:2203.16462, 2022.
- L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
- K. L. Chung. On a stochastic approximation method. The Annals of Mathematical Statistics, pages 463–483, 1954.
- A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
- From gradient flow on population loss to learning with stochastic gradient descent. In Advances in Neural Information Processing Systems.
- J. C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.
- Convergence rates for the stochastic gradient descent method for non-convex objective functions. The Journal of Machine Learning Research, 21(1):5354–5401, 2020.
- S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
- Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pages 1315–1323. PMLR, 2021.
- Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200–5209. PMLR, 2019.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- A. Jentzen and A. Riekert. On the existence of global minima and convergence analyses for gradient descent methods in the training of deep neural networks. arXiv preprint arXiv:2112.09684, 2021.
- Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pages 795–811. Springer, 2016.
- H. J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications. 2003.
- Non-convex finite-sum optimization via scsg methods. Advances in Neural Information Processing Systems, 30, 2017.
- Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022.
- From optimization dynamics to generalization bounds via łojasiewicz gradient inequality. Transactions on Machine Learning Research.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- On the almost sure convergence of stochastic gradient descent in non-convex problems. Advances in Neural Information Processing Systems, 33:1117–1128, 2020.
- E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.
- Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
- Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- Path-sgd: Path-normalized optimization in deep neural networks. Advances in neural information processing systems, 28, 2015.
- Global convergence of three-layer neural networks in the mean field regime. In International Conference on Learning Representations, 2020.
- B. T. Polyak. Introduction to optimization. optimization software. Inc., Publications Division, New York, 1:32, 1987.
- Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323. PMLR, 2016.
- H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.
- S. U. Stich. Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232, 2019.
- Adagrad stepsizes: Sharp convergence over nonconvex landscapes. The Journal of Machine Learning Research, 21(1):9047–9076, 2020.
- S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type part i: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023.
- Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109:467–492, 2020.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.