ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate (2411.02853v3)

Published 5 Nov 2024 in cs.LG and stat.ML

Abstract: Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $\beta_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.

PDF HTML Abstract

Overview of Adam's Convergence Issues

Adam is a widely adopted adaptive gradient algorithm for training deep neural networks, combining the benefits of momentum and RMSProp. Its update rule involves maintaining estimates of the first moment (mean) $m_t$ and the second moment (uncentered variance) $v_t$ of the gradients:

$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$

$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$

where $g_t$ is the gradient at step $t$ , and $\beta_1, \beta_2 \in [0, 1)$ are exponential decay rates. After bias correction, the parameter update is:

$\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

Despite its empirical success, theoretical analyses have revealed potential convergence issues. Notably, convergence guarantees for Adam often require restrictive conditions. A key finding was that Adam can fail to converge even in simple convex settings for certain choices of hyperparameters, particularly $\beta_2$ . Specifically, the effective learning rate can become undesirably large in some iterations, preventing convergence. Some analyses showed convergence requires $\beta_2$ to be chosen based on problem parameters, diminishing its adaptivity.

Limitations of Existing Solutions

Several variants of Adam have been proposed to address its theoretical shortcomings. AMSGrad, for example, modifies the second moment update to ensure a non-increasing effective learning rate by maintaining the maximum of past second moment estimates:

$\hat{v}_t^{AMS} = \max(\hat{v}_{t-1}^{AMS}, \hat{v}_t)$

$\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t^{AMS}} + \epsilon}$

While AMSGrad provides convergence guarantees in the online convex optimization setting, these guarantees often rely on the assumption that the stochastic gradients have uniformly bounded norms ( $||g_t|| \le G$ for all $t$ ). This assumption is considered impractical for many deep learning scenarios where gradients can vary significantly and may not be uniformly bounded. Therefore, a need exists for an adaptive method that provably converges under more realistic assumptions without requiring problem-dependent hyperparameter tuning like the original Adam.

The ADOPT Algorithm

The ADOPT (Adaptive Optimization with Decoupled Prediction and Target) algorithm is proposed to rectify the convergence issues associated with Adam under standard assumptions in stochastic non-convex optimization (Taniguchi et al., 5 Nov 2024 ). ADOPT introduces two primary modifications to the standard Adam update mechanism:

Decoupled Second Moment Estimation: The estimate of the second moment $v_t$ is modified to remove the influence of the current gradient $g_t$ . Instead, it likely utilizes past gradient information, such as $g_{t-1}^2$ . A plausible update rule based on this description is: $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_{t-1}^2$ (for $t>1$ , with appropriate initialization for $v_1$ ) This decoupling prevents the possibility of $v_t$ becoming small precisely when $g_t$ is large, a situation that can lead to excessively large steps in standard Adam.
Modified Update Order: The order of applying the momentum update and the normalization (division by the square root of the second moment estimate) is changed. Instead of applying momentum to the raw gradient $g_t$ and then normalizing the resulting momentum vector $\hat{m}_t$ , ADOPT likely normalizes the gradient first using the (bias-corrected) second moment estimate $\hat{v}_t$ , and then applies the momentum update to this normalized gradient.

Combining these modifications, a potential formulation for the ADOPT update steps is:

Compute gradient: $g_t = \nabla f_t(\theta_t)$
Update second moment (using previous gradient): $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_{t-1}^2$ (handle $t=1$ initialization)
Bias-correct second moment: $\hat{v}_t = v_t / (1-\beta_2^t)$
Normalize current gradient: $\tilde{g}_t = g_t / (\sqrt{\hat{v}_t} + \epsilon)$
Update first moment (using normalized gradient): $m_t = \beta_1 m_{t-1} + (1-\beta_1) \tilde{g}_t$
Bias-correct first moment: $\hat{m}_t = m_t / (1-\beta_1^t)$
Update parameters: $\theta_{t+1} = \theta_t - \eta \hat{m}_t$

This revised structure ensures that the step direction incorporates gradient history through $m_t$ , while the step magnitude scaling, derived from $v_t$ , is based on past gradient magnitudes, leading to more stable updates.

Theoretical Analysis and Convergence Guarantees

A significant contribution of ADOPT is its theoretical convergence properties (Taniguchi et al., 5 Nov 2024 ). The authors prove that ADOPT achieves the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$ for the average squared gradient norm in the context of non-convex stochastic optimization, i.e., $\frac{1}{T} \sum_{t=1}^T \mathbb{E} [||\nabla f(\theta_t)||^2] = \mathcal{O}(1/\sqrt{T})$ .

Crucially, this convergence guarantee holds under standard assumptions (e.g., Lipschitz smooth objective function, bounded variance of stochastic gradients) and possesses two key advantages over previous methods:

Independence from $\beta_2$ : The convergence rate is attained for any choice of $\beta_2 \in [0, 1)$ . This restores the adaptivity that was compromised in theoretical analyses of Adam, removing the need for problem-dependent tuning of $\beta_2$ to ensure convergence.
No Bounded Noise Assumption: Unlike AMSGrad and other variants whose proofs rely on the assumption of uniformly bounded gradient norms, ADOPT's convergence guarantee does not require this restrictive condition. It relies only on the more standard assumption of bounded variance of the stochastic gradients relative to the true gradient.

These theoretical results establish ADOPT as a robust alternative to Adam, providing guaranteed convergence with the optimal rate under realistic conditions, regardless of the choice of $\beta_2$ .

Experimental Evaluation

The practical efficacy of ADOPT was evaluated across a diverse set of deep learning tasks (Taniguchi et al., 5 Nov 2024 ). The experiments included:

Image Classification: Tasks involving standard datasets like CIFAR.
Generative Modeling: Likely involving training Generative Adversarial Networks (GANs) or other generative models.
Natural Language Processing: Tasks potentially including sequence modeling or transformer-based architectures.
Deep Reinforcement Learning: Training agents in RL environments.

Across these varied domains, ADOPT reportedly demonstrated superior performance compared to Adam and its variants (potentially including AMSGrad, AdamW, etc.). The consistent improvements observed in these intensive numerical experiments suggest that the theoretical advantages of ADOPT translate into tangible benefits in practical deep learning applications.

Implementation Details

The ADOPT algorithm maintains the core structure of Adam, involving first and second moment estimates. The primary changes lie in how these estimates are updated and utilized.

Computational Overhead: The computational cost per iteration of ADOPT is expected to be very similar to Adam. It involves vector additions, scalar multiplications, element-wise squaring, division, and square roots, all of which are computationally efficient. The storage requirements are also comparable, needing to store the parameters $\theta$ , the first moment vector $m$ , and the second moment vector $v$ . An additional vector might be needed to store $g_{t-1}^2$ or $g_{t-1}$ depending on the exact implementation of the decoupled $v_t$ update.
Hyperparameters: ADOPT retains the standard hyperparameters $\eta$ (learning rate), $\beta_1$ (momentum decay), and $\beta_2$ (second moment decay), plus $\epsilon$ . The key theoretical benefit is that the choice of $\beta_2$ does not impact the convergence guarantee, although it may still affect empirical performance and convergence speed, similar to how $\beta_1$ affects performance. Typical values used for Adam (e.g., $\beta_1=0.9$ , $\beta_2=0.999$ ) can serve as starting points.
Availability: An implementation of ADOPT is provided by the authors and is available at https://github.com/iShohei220/adopt, facilitating its adoption and further evaluation by the research community.

Conclusion

ADOPT presents a modification to the Adam optimizer designed to overcome its known theoretical convergence limitations (Taniguchi et al., 5 Nov 2024 ). By decoupling the second moment estimate from the current gradient and adjusting the order of normalization and momentum updates, ADOPT achieves provable convergence with the optimal $\mathcal{O}(1/\sqrt{T})$ rate for non-convex stochastic optimization. Notably, this guarantee holds for any $\beta_2 \in [0, 1)$ and does not require the impractical assumption of uniformly bounded gradients. Empirical results across various domains suggest that ADOPT outperforms Adam and related methods, making it a promising alternative for training deep learning models.