Non-convex Stochastic Composite Optimization with Polyak Momentum (2403.02967v4)
Abstract: The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.
- Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd. Advances in Neural Information Processing Systems, 31, 2018.
- Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214, 2023.
- Principled analyses and design of first-order methods with inexact proximal operators. Mathematical Programming, 201(1):185–230, 2023.
- Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- Efficient algorithms for minimizing compositions of convex functions and random functions and its applications in network revenue management. arXiv preprint arXiv:2205.01774, 2022.
- Proximal splitting methods in signal processing, 2010.
- Momentum improves normalized sgd. In International conference on machine learning, pages 2260–2268. PMLR, 2020.
- Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
- Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
- Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pages 1626–1635. PMLR, 2019.
- Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018.
- Momentum provably improves error feedback! In Advances in Neural Information Processing Systems, 2023.
- Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
- Accelerated gradient methods for nonconvex nonlinear and stochastic programming, 2013.
- Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
- Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, 2013.
- Statistically preconditioned accelerated gradient method for distributed optimization. In International conference on machine learning, pages 4203–4227. PMLR, 2020.
- Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning, pages 9965–10040. PMLR, 2022.
- Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics, pages 18–37. PMLR, 2022.
- On large-batch training for deep learning: Generalization gap and sharp minima. In Proceedings of International Conference on Learning Representations, 2017.
- Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, pages 10997–11057. PMLR, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Cifar-10 dataset, 2014.
- Efficient backprop. Neural Networks: Tricks of the Trade, Springer-Verlag, 2012.
- On the last iterate convergence of momentum methods. In International Conference on Algorithmic Learning Theory, pages 699–717. PMLR, 2022.
- Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261–18271, 2020.
- D. Russell Luke. Proximal Methods for Image Processing, pages 165–202. Springer International Publishing, Cham, 2020.
- Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pages 15750–15769. PMLR, 2022.
- Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
- Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012.
- The future of digital health with federated learning. NPJ digital medicine, 3(1):119, 2020.
- Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, pages 3935–3971. PMLR, 2021.
- Sgd: The role of implicit regularization, batch-size and multiple-epochs. Advances In Neural Information Processing Systems, 34:27422–27433, 2021.
- A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191(2):1005–1071, 2022.
- Quickly finding a benign region via heavy ball momentum in non-convex optimization. arXiv preprint arXiv:2010.01449, 2020.
- Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, 32, 2019.
- The general inefficiency of batch training for gradient descent learning. Neural Networks, 2003.
- Two losses are better than one: Faster optimization using a cheaper proxy, 2023.
- Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization, 2022.
- On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, pages 7184–7193. PMLR, 2019.
- Complexity of finding stationary points of nonconvex nonsmooth functions. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 11173–11182. PMLR, 13–18 Jul 2020.
- Junyu Zhang. Stochastic bergman proximal gradient method revisited: Kernel conditioning and painless variance reduction. arXiv preprint arXiv:2401.03155, 2024.