A theoretical and empirical study of new adaptive algorithms with additional momentum steps and shifted updates for stochastic non-convex optimization (2110.08531v2)
Abstract: It is known that adaptive optimization algorithms represent the key pillar behind the rise of the Machine Learning field. In the Optimization literature numerous studies have been devoted to accelerated gradient methods but only recently adaptive iterative techniques were analyzed from a theoretical point of view. In the present paper we introduce new adaptive algorithms endowed with momentum terms for stochastic non-convex optimization problems. Our purpose is to show a deep connection between accelerated methods endowed with different inertial steps and AMSGrad-type momentum methods. Our methodology is based on the framework of stochastic and possibly non-convex objective mappings, along with some assumptions that are often used in the investigation of adaptive algorithms. In addition to discussing the finite-time horizon analysis in relation to a certain final iteration and the almost sure convergence to stationary points, we shall also look at the worst-case iteration complexity. This will be followed by an estimate for the expectation of the squared Euclidean norm of the gradient. Various computational simulations for the training of neural networks are being used to support the theoretical analysis. For future research we emphasize that there are multiple possible extensions to our work, from which we mention the investigation regarding non-smooth objective functions and the theoretical analysis of a more general formulation that encompass our adaptive optimizers in a stochastic framework.
- Cristian Daniel Alecsa. The rate of convergence of optimization algorithms obtained via discretizations of heavy ball dynamical systems for convex optimization problems. Optimization, 71(13):3909–3939, 2022.
- An extension of the second order dynamical system that models nesterov’s convex gradient method. Applied Mathematics & Optimization, pages 1–30, 2020.
- A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem. Numerical Algorithms, pages 1–28, 2019.
- Convergence rates of a momentum algorithm with bounded adaptive step size for nonconvex optimization. In Asian Conference on Machine Learning, pages 225–240. PMLR, 2020.
- Convergence and dynamical behavior of the adam algorithm for nonconvex stochastic optimization. SIAM Journal on Optimization, 31(1):244–274, 2021.
- Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
- Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
- Padam: Closing the generalization gap of adaptive gradient methods in training deep neural networks. 2018.
- On the convergence of a class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations, 2018.
- A kurdyka-lojasiewicz property for stochastic optimization algorithms in a non-convex setting. arXiv preprint arXiv:2302.06447, 2023.
- A general system of differential equations to model first-order adaptive algorithms. Journal of Machine Learning Research, 21(129):1–42, 2020.
- Aaron Defazio. On the curved geometry of accelerated optimization. Advances in Neural Information Processing Systems, 32:1766–1775, 2019.
- Adaptivity without compromise: a momentumized, adaptive, dual averaged gradient method for stochastic optimization. The Journal of Machine Learning Research, 23(1):6429–6462, 2022.
- A simple convergence proof of adam and adagrad. arXiv preprint arXiv:2003.02395, 2020.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Asymptotic study of stochastic adaptive algorithms in non-convex landscape. The Journal of Machine Learning Research, 23(1):10357–10410, 2022.
- Global convergence of the heavy-ball method for convex optimization. In 2015 European control conference (ECC), pages 310–315. IEEE, 2015.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
- Non asymptotic analysis of adaptive stochastic gradient algorithms and applications. arXiv preprint arXiv:2303.01370, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Szilárd Csaba László. Convergence rates for an inertial algorithm of gradient type associated to a smooth non-convex minimization. Mathematical Programming, pages 1–45, 2020.
- A high probability analysis of adaptive sgd with momentum. arXiv preprint arXiv:2007.14294, 2020.
- Xuezhe Ma. Apollo: An adaptive parameter-wise diagonal quasi-newton method for nonconvex stochastic optimization. arXiv preprint arXiv:2009.13586, 2020.
- Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24:451–459, 2011.
- Yurii E Nesterov. A method for solving the convex programming problem with convergence rate 𝒪(1/k2\mathcal{O}(1/k^{2}caligraphic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983.
- ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388–1419, 2014.
- Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
- Adaptive methods for nonconvex optimization. In Proceeding of 32nd Conference on Neural Information Processing Systems (NIPS 2018), 2018.
- On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
- A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. HAL preprint hal-03135145, version 1, 2021.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
- Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927–956, 2017.
- Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pages 6677–6686. PMLR, 2019.
- Less Wright. Ranger - a synergistic optimizer. https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer, 2019.
- Adam-family methods for nonsmooth optimization with convergence guarantees. arXiv preprint arXiv:2305.03938, 2023.
- A unified analysis of stochastic momentum methods for deep learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2955–2961, 2018.
- Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
- Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33:18795–18806, 2020.
- A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11127–11135, 2019.