On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm (2402.00389v5)
Abstract: Although adaptive gradient methods have been extensively used in deep learning, their convergence rates proved in the literature are all slower than that of SGD, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of $\frac{1}{T}\sum_{k=1}T E\left[|\nabla f(xk)|_1\right]\leq O(\frac{\sqrt{d}C}{T{1/4}})$ measured by $\ell_1$ norm without the bounded gradient assumption, where $d$ is the dimension of the optimization variable, $T$ is the iteration number, and $C$ is a constant identical to that appeared in the optimal convergence rate of SGD. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Since $|x|2\ll|x|_1\leq\sqrt{d}|x|_2$ for problems with extremely large $d$, our convergence rate can be considered to be analogous to the $\frac{1}{T}\sum{k=1}T E\left[|\nabla f(xk)|_2\right]\leq O(\frac{C}{T{1/4}})$ rate of SGD in the ideal case of $|\nabla f(x)|_1=\varTheta(\sqrt{d}|\nabla f(x)|_2)$.
- Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199:165–214, 2023.
- SGD with AdaGrad stepsizes: full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. In International Conference on Machine Learning (ICML), 2023.
- Dissecting Adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning (ICML), 2018.
- SignSGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning (ICML), 2018.
- Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
- Towards practical Adam: Non-convexity, convergence theory, and mini-batch acceleration. Journal of Machine Learning Research, 23(229):1–47, 2022.
- Closing the generalization gap of adaptive gradient methods in training deep neural networks. In International Joint Conferences on Artificial Intelligence (IJCAI), 2021.
- On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations (ICLR), 2019.
- Robustness to unbounded smoothness of generalized signSGD. Arxiv: 2208.11195, 2022.
- A simple convergence proof of Adam and AdaGrad. Transactions on Machine Learning Research, 2022.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7):2121–2159, 2011.
- The power of adaptivity in SGD: self-tuning step sizes with unbounded gradients and affine variance. In Conference on Learning Theory (COLT), 2022.
- A novel convergence analysis for algorithms of the Adam family. Arxiv: 2112.03459, 2021.
- High probability convergence of Adam under unbounded gradients and affine variance noise. Arxiv: 2311.02000, 2023.
- Revisiting convergence of Adagrad with relaxed assumptions. Arxiv: 2402.13794, 2024a.
- On convergence of Adam for stochastic optimization under relaxed assumptions. Arxiv: 2402.03982, 2024b.
- High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. In International Conference on Learning Representations (ICLR), 2022.
- Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- Convergence of Adam under relaxed assumptions. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- High probability convergence of stochastic gradient methods. In International Conference on Machine Learning (ICML), 2023.
- Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2018.
- Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations (ICLR), 2019.
- Adaptive bound optimization for online convex optimization. In Conference on Learning Theory (COLT), 2010.
- On the convergence of Adam and beyond. In International Conference on Learning Representations (ICLR), 2018.
- Domain-independent dominance of adaptive methods. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- RMSProp converges with proper hyper-parameter. In International Conference on Learning Representations (ICLR), 2020.
- Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. In COURSERA: Neural Networks for Machine Learning, 2012.
- Closing the gap between the upper bound and the lower bound of Adam’s iteration complexity. In Advances in Neural Information Processing Systems (NeurIPS), 2023a.
- Convergence of AdaGrad for non-convex objectives: simple proofs and relaxed assumptions. In Conference on Learning Theory (COLT), 2023b.
- AdaGrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21(1):9047–9076, 2020.
- Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. Arxiv: 2208.06677, 2022.
- Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations (ICLR), 2019.
- Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems (NIPS), 2018.
- Convergence guarantees for RMSProp and Adam in generalized-smooth non-convex optimization with affine noise variance. Arxiv: 2404.01436, 2024.
- Adam can converge without any modification on update rules. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- A sufficient condition for convergences of Adam and RMSProp. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.