A Simple Convergence Proof of Adam and Adagrad
(2003.02395v3)
Published 5 Mar 2020 in stat.ML and cs.LG
Abstract: We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer, the dimension $d$, and the total number of iterations $N$. This bound can be made arbitrarily small, and with the right hyper-parameters, Adam can be shown to converge with the same rate of convergence $O(d\ln(N)/\sqrt{N})$. When used with the default parameters, Adam doesn't converge, however, and just like constant step-size SGD, it moves away from the initialization point faster than Adagrad, which might explain its practical success. Finally, we obtain the tightest dependency on the heavy ball momentum decay rate $\beta_1$ among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-\beta_1){-3})$ to $O((1-\beta_1){-1})$.
The paper provides a streamlined convergence proof for Adam and Adagrad, establishing an expectation-based upper bound on the average squared gradient norm.
The paper demonstrates that with optimal hyper-parameter settings, Adam can achieve a convergence rate of O(d ln(N)/√N) comparable to Adagrad.
The paper refines the dependency on the momentum decay rate from O((1−β1)⁻³) to O((1−β1)⁻¹), offering clearer insights for nonconvex optimization.
A Simple Convergence Proof of Adam and Adagrad
In the paper "A Simple Convergence Proof of Adam and Adagrad," the authors provide a streamlined demonstration of convergence for the well-known adaptive optimization algorithms Adam and Adagrad. This proof is applicable to smooth, non-convex, stochastic optimization problems where the objective function possesses bounded gradients. The work notably addresses two key aspects of convergence—detailing an expectation-based upper bound on the average squared norm of the gradient over the trajectory, and tackling the critical issue of parameter choices to ensure effective convergence. The bound delineates explicit dependencies on constants inherent to the optimization problem, dimensions, optimizer parameters, and total iterations denoted as N.
The authors reveal that under optimal parameter configuration, particularly focusing on hyper-parameters like α (step size) and β2 (exponential moving average decay rate), Adam can achieve a convergence rate similar to that of Adagrad, specifically O(dln(N)/N). However, it is also noted that with default parameters, Adam, akin to stochastic gradient descent (SGD) with a constant step size, does not inherently lead to convergence, albeit displaying a tendency to diverge due to moving away from the initialization faster than Adagrad. Intriguingly, this characteristic might underpin Adam's observed empirical performance gains.
Furthermore, the paper presents a significant improvement in understanding the role of the heavy ball momentum decay rate β1. The refinement adjusts the dependency from the former O((1−β1)−3) to O((1−β1)−1) for non-convex Adam and Adagrad. This adjustment presents a more optimistic view regarding the adaptability and efficiency of momentum in these optimizers, especially in non-convex domains, making the analysis relevant for broader classes of machine learning problems, including those characterized by stochastic and sparse data environments.
The implications of these findings are twofold. Practically, they provide guidance on parameter selection to aid in leveraging the full potential of Adam and Adagrad, without resorting to variants such as AMSGrad for non-convex settings. Theoretically, the insight into momentum's behavior reinforces its rationale and utility in accelerating convergence, enhancing our comprehension of optimizer dynamics in non-convex landscapes.
Moving forward, this could pave the way for even more refined studies on the adaptive step sizes and momentum factors within other derivative-free optimization algorithms that are well-suited for deep learning and other machine learning applications requiring robust handling of high-dimensional and complex objective functions. Such advancements could lead to enhanced training processes and model performances across a myriad of AI applications, driving further interest in algorithmic design improved by this deeper understanding of convergence properties.