On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions (2402.03982v1)
Abstract: The Adaptive Momentum Estimation (Adam) algorithm is highly effective in training various deep learning tasks. Despite this, there's limited theoretical understanding for Adam, especially when focusing on its vanilla form in non-convex smooth scenarios with potential unbounded gradients and affine variance noise. In this paper, we study vanilla Adam under these challenging conditions. We introduce a comprehensive noise model which governs affine variance noise, bounded noise and sub-Gaussian noise. We show that Adam can find a stationary point with a $\mathcal{O}(\text{poly}(\log T)/\sqrt{T})$ rate in high probability under this general noise model where $T$ denotes total number iterations, matching the lower rate of stochastic first-order algorithms up to logarithm factors. More importantly, we reveal that Adam is free of tuning step-sizes with any problem-parameters, yielding a better adaptation property than the Stochastic Gradient Descent under the same conditions. We also provide a probabilistic convergence result for Adam under a generalized smooth condition which allows unbounded smoothness parameters and has been illustrated empirically to more accurately capture the smooth property of many practical objective functions.
- Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214, 2023.
- SGD with AdaGrad stepsizes: full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. In International Conference on Machine Learning, 2023.
- Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
- Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
- On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
- Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018.
- On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations, 2019.
- Robustness to unbounded smoothness of generalized signSGD. In Advances in Neural Information Processing Systems, 2022.
- Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv preprint arXiv:1807.06766, 2018.
- A simple convergence proof of Adam and Adagrad. Transactions on Machine Learning Research, 2022.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7):2121–2159, 2011.
- Beyond uniform smoothness: a stopped analysis of adaptive SGD. In Conference on Learning Theory, 2023.
- The power of adaptivity in SGD: self-tuning step sizes with unbounded gradients and affine variance. In Conference on Learning Theory, 2022.
- Wayne A Fuller. Measurement error models. John Wiley & Sons, 2009.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Deep learning. MIT press, 2016.
- A novel convergence analysis for algorithms of the Adam family. In Annual Workshop on Optimization for Machine Learning, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
- High probability convergence of Adam under unbounded gradients and affine variance noise. arXiv preprint arXiv:2311.02000, 2023.
- Super-Adam: faster and universal framework of adaptive gradients. In Advances in Neural Information Processing Systems, 2021.
- Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. In International Conference on Learning Representations, 2022.
- Feature noise induces loss discrepancy across groups. In International Conference on Machine Learning, pages 5209–5219. PMLR, 2020.
- Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Convergence of Adam under relaxed assumptions. In Advances in Neural Information Processing Systems, 2023.
- On the convergence of stochastic gradient descent with adaptive stepsizes. In International Conference on Artificial Intelligence and Statistics, 2019.
- A high probability analysis of adaptive SGD with momentum. In Workshop on International Conference on Machine Learning, 2020.
- Towards better understanding of adaptive gradient algorithms in generative adversarial nets. arXiv preprint arXiv:1912.11940, 2019.
- High probability convergence of stochastic gradient methods. In International Conference on Machine Learning, 2023.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
- Understanding gradient clipping in incremental gradient methods. In International Conference on Artificial Intelligence and Statistics, 2021.
- On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.
- Variance-reduced clipping for non-convex optimization. arXiv preprint arXiv:2303.00883, 2023.
- A stochastic approximation method. Annals of Mathematical Statistics, pages 400–407, 1951.
- A unified analysis of AdaGrad with weighted aggregation and momentum acceleration. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- RMSProp converges with proper hyper-parameter. In International Conference on Learning Representations, 2020.
- Online learning algorithms. Foundations of Computational Mathematics, 6:145–170, 2006.
- Less regret via online conditioning. arXiv preprint arXiv:1002.4862, 2010.
- Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- Closing the gap between the upper bound and lower bound of Adam’s iteration complexity. In Advances in Neural Information Processing Systems, 2023.
- Convergence of AdaGrad for non-convex objectives: simple proofs and relaxed assumptions. In Conference on Learning Theory, 2023.
- Provable adaptivity in Adam. arXiv preprint arXiv:2208.09900, 2022.
- Adagrad stepsizes: sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21(1):9047–9076, 2020.
- Robust regression and Lasso. In Advances in Neural Information Processing Systems, 2008.
- Nest your adaptive algorithm for parameter-agnostic nonconvex minimax optimization. In Advances in Neural Information Processing Systems, 2022.
- Online regularized classification algorithms. IEEE Transactions on Information Theory, 52(11):4775–4788, 2006.
- Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems, 2018.
- Improved analysis of clipping algorithms for non-convex optimization. In Advances in Neural Information Processing Systems, 2020.
- Why gradient clipping accelerates training: a theoretical justification for adaptivity. In International Conference on Learning Representations, 2020.
- Adam can converge without any modification on update rules. In Advances in Neural Information Processing Systems, 2022.
- On the convergence and improvement of stochastic normalized gradient descent. Science China Information Sciences, 64:1–13, 2021.
- On the convergence of adaptive gradient methods for nonconvex optimization. In Annual Workshop on Optimization for Machine Learning, 2020.
- Win: weight-decay-integrated Nesterov acceleration for adaptive gradient algorithms. In International Conference on Learning Representations, 2023.
- A sufficient condition for convergences of Adam and RMSProp. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Yusu Hong (5 papers)
- Junhong Lin (29 papers)