Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate (2411.02853v3)

Published 5 Nov 2024 in cs.LG and stat.ML

Abstract: Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $\beta_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.

Overview of Adam's Convergence Issues

Adam is a widely adopted adaptive gradient algorithm for training deep neural networks, combining the benefits of momentum and RMSProp. Its update rule involves maintaining estimates of the first moment (mean) mtm_t and the second moment (uncentered variance) vtv_t of the gradients:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

where gtg_t is the gradient at step tt, and β1,β2[0,1)\beta_1, \beta_2 \in [0, 1) are exponential decay rates. After bias correction, the parameter update is:

θt+1=θtηm^tv^t+ϵ\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Despite its empirical success, theoretical analyses have revealed potential convergence issues. Notably, convergence guarantees for Adam often require restrictive conditions. A key finding was that Adam can fail to converge even in simple convex settings for certain choices of hyperparameters, particularly β2\beta_2. Specifically, the effective learning rate can become undesirably large in some iterations, preventing convergence. Some analyses showed convergence requires β2\beta_2 to be chosen based on problem parameters, diminishing its adaptivity.

Limitations of Existing Solutions

Several variants of Adam have been proposed to address its theoretical shortcomings. AMSGrad, for example, modifies the second moment update to ensure a non-increasing effective learning rate by maintaining the maximum of past second moment estimates:

v^tAMS=max(v^t1AMS,v^t)\hat{v}_t^{AMS} = \max(\hat{v}_{t-1}^{AMS}, \hat{v}_t)

θt+1=θtηm^tv^tAMS+ϵ\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t^{AMS}} + \epsilon}

While AMSGrad provides convergence guarantees in the online convex optimization setting, these guarantees often rely on the assumption that the stochastic gradients have uniformly bounded norms (gtG||g_t|| \le G for all tt). This assumption is considered impractical for many deep learning scenarios where gradients can vary significantly and may not be uniformly bounded. Therefore, a need exists for an adaptive method that provably converges under more realistic assumptions without requiring problem-dependent hyperparameter tuning like the original Adam.

The ADOPT Algorithm

The ADOPT (Adaptive Optimization with Decoupled Prediction and Target) algorithm is proposed to rectify the convergence issues associated with Adam under standard assumptions in stochastic non-convex optimization (Taniguchi et al., 5 Nov 2024 ). ADOPT introduces two primary modifications to the standard Adam update mechanism:

  1. Decoupled Second Moment Estimation: The estimate of the second moment vtv_t is modified to remove the influence of the current gradient gtg_t. Instead, it likely utilizes past gradient information, such as gt12g_{t-1}^2. A plausible update rule based on this description is: vt=β2vt1+(1β2)gt12v_t = \beta_2 v_{t-1} + (1-\beta_2) g_{t-1}^2 (for t>1t>1, with appropriate initialization for v1v_1) This decoupling prevents the possibility of vtv_t becoming small precisely when gtg_t is large, a situation that can lead to excessively large steps in standard Adam.
  2. Modified Update Order: The order of applying the momentum update and the normalization (division by the square root of the second moment estimate) is changed. Instead of applying momentum to the raw gradient gtg_t and then normalizing the resulting momentum vector m^t\hat{m}_t, ADOPT likely normalizes the gradient first using the (bias-corrected) second moment estimate v^t\hat{v}_t, and then applies the momentum update to this normalized gradient.

Combining these modifications, a potential formulation for the ADOPT update steps is:

  1. Compute gradient: gt=ft(θt)g_t = \nabla f_t(\theta_t)
  2. Update second moment (using previous gradient): vt=β2vt1+(1β2)gt12v_t = \beta_2 v_{t-1} + (1-\beta_2) g_{t-1}^2 (handle t=1t=1 initialization)
  3. Bias-correct second moment: v^t=vt/(1β2t)\hat{v}_t = v_t / (1-\beta_2^t)
  4. Normalize current gradient: g~t=gt/(v^t+ϵ)\tilde{g}_t = g_t / (\sqrt{\hat{v}_t} + \epsilon)
  5. Update first moment (using normalized gradient): mt=β1mt1+(1β1)g~tm_t = \beta_1 m_{t-1} + (1-\beta_1) \tilde{g}_t
  6. Bias-correct first moment: m^t=mt/(1β1t)\hat{m}_t = m_t / (1-\beta_1^t)
  7. Update parameters: θt+1=θtηm^t\theta_{t+1} = \theta_t - \eta \hat{m}_t

This revised structure ensures that the step direction incorporates gradient history through mtm_t, while the step magnitude scaling, derived from vtv_t, is based on past gradient magnitudes, leading to more stable updates.

Theoretical Analysis and Convergence Guarantees

A significant contribution of ADOPT is its theoretical convergence properties (Taniguchi et al., 5 Nov 2024 ). The authors prove that ADOPT achieves the optimal convergence rate of O(1/T)\mathcal{O}(1/\sqrt{T}) for the average squared gradient norm in the context of non-convex stochastic optimization, i.e., 1Tt=1TE[f(θt)2]=O(1/T)\frac{1}{T} \sum_{t=1}^T \mathbb{E} [||\nabla f(\theta_t)||^2] = \mathcal{O}(1/\sqrt{T}).

Crucially, this convergence guarantee holds under standard assumptions (e.g., Lipschitz smooth objective function, bounded variance of stochastic gradients) and possesses two key advantages over previous methods:

  1. Independence from β2\beta_2: The convergence rate is attained for any choice of β2[0,1)\beta_2 \in [0, 1). This restores the adaptivity that was compromised in theoretical analyses of Adam, removing the need for problem-dependent tuning of β2\beta_2 to ensure convergence.
  2. No Bounded Noise Assumption: Unlike AMSGrad and other variants whose proofs rely on the assumption of uniformly bounded gradient norms, ADOPT's convergence guarantee does not require this restrictive condition. It relies only on the more standard assumption of bounded variance of the stochastic gradients relative to the true gradient.

These theoretical results establish ADOPT as a robust alternative to Adam, providing guaranteed convergence with the optimal rate under realistic conditions, regardless of the choice of β2\beta_2.

Experimental Evaluation

The practical efficacy of ADOPT was evaluated across a diverse set of deep learning tasks (Taniguchi et al., 5 Nov 2024 ). The experiments included:

  • Image Classification: Tasks involving standard datasets like CIFAR.
  • Generative Modeling: Likely involving training Generative Adversarial Networks (GANs) or other generative models.
  • Natural Language Processing: Tasks potentially including sequence modeling or transformer-based architectures.
  • Deep Reinforcement Learning: Training agents in RL environments.

Across these varied domains, ADOPT reportedly demonstrated superior performance compared to Adam and its variants (potentially including AMSGrad, AdamW, etc.). The consistent improvements observed in these intensive numerical experiments suggest that the theoretical advantages of ADOPT translate into tangible benefits in practical deep learning applications.

Implementation Details

The ADOPT algorithm maintains the core structure of Adam, involving first and second moment estimates. The primary changes lie in how these estimates are updated and utilized.

  • Computational Overhead: The computational cost per iteration of ADOPT is expected to be very similar to Adam. It involves vector additions, scalar multiplications, element-wise squaring, division, and square roots, all of which are computationally efficient. The storage requirements are also comparable, needing to store the parameters θ\theta, the first moment vector mm, and the second moment vector vv. An additional vector might be needed to store gt12g_{t-1}^2 or gt1g_{t-1} depending on the exact implementation of the decoupled vtv_t update.
  • Hyperparameters: ADOPT retains the standard hyperparameters η\eta (learning rate), β1\beta_1 (momentum decay), and β2\beta_2 (second moment decay), plus ϵ\epsilon. The key theoretical benefit is that the choice of β2\beta_2 does not impact the convergence guarantee, although it may still affect empirical performance and convergence speed, similar to how β1\beta_1 affects performance. Typical values used for Adam (e.g., β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999) can serve as starting points.
  • Availability: An implementation of ADOPT is provided by the authors and is available at https://github.com/iShohei220/adopt, facilitating its adoption and further evaluation by the research community.

Conclusion

ADOPT presents a modification to the Adam optimizer designed to overcome its known theoretical convergence limitations (Taniguchi et al., 5 Nov 2024 ). By decoupling the second moment estimate from the current gradient and adjusting the order of normalization and momentum updates, ADOPT achieves provable convergence with the optimal O(1/T)\mathcal{O}(1/\sqrt{T}) rate for non-convex stochastic optimization. Notably, this guarantee holds for any β2[0,1)\beta_2 \in [0, 1) and does not require the impractical assumption of uniformly bounded gradients. Empirical results across various domains suggest that ADOPT outperforms Adam and related methods, making it a promising alternative for training deep learning models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Shohei Taniguchi (6 papers)
  2. Keno Harada (1 paper)
  3. Gouki Minegishi (6 papers)
  4. Yuta Oshima (4 papers)
  5. Seong Cheol Jeong (1 paper)
  6. Go Nagahara (1 paper)
  7. Tomoshi Iiyama (1 paper)
  8. Masahiro Suzuki (55 papers)
  9. Yusuke Iwasawa (43 papers)
  10. Yutaka Matsuo (128 papers)
Citations (2)