Adan Optimization Algorithm

Updated 2 October 2025

Adan Optimization Algorithm is an adaptive gradient method that reformulates classical Nesterov momentum by predicting future gradients using current and past information.
It integrates per-coordinate adaptive learning rates and decoupled weight decay to achieve faster convergence and robust performance across vision, NLP, and reinforcement learning tasks.
The algorithm offers near-optimal stochastic gradient complexity bounds and demonstrates reduced training epochs with high stability and flexibility for varied batch sizes.

The Adan Optimization Algorithm is an adaptive gradient-based method designed to accelerate and stabilize the training of deep neural networks through an innovative Nesterov momentum estimation. Adan reformulates classical Nesterov acceleration to eliminate the need for gradient evaluations at extrapolated points, integrates this approach within a per-coordinate adaptive framework, and applies it across diverse architectures in vision, language, and reinforcement learning domains. The algorithm introduces a fast and robust update rule, achieves near-optimal stochastic gradient complexity bounds, and demonstrates strong empirical performance over a broad range of model types and batch sizes (Xie et al., 2022).

1. Mathematical Foundations and Update Rule

The Adan algorithm is structured around the concept of Nesterov Momentum Estimation (NME), which predicts the "look-ahead" gradient via a correction term, leveraging information from previous iterations. Unlike standard Nesterov methods that require gradient computation at extrapolated points, Adan performs all updates at the current parameter set. The key update steps are:

Modified gradient computation:

$g_k' = g_k + (1 - \beta_1)(g_k - g_{k-1})$

First and second moment estimations:

$m_k = (1 - \beta_1)m_{k-1} + \beta_1 g_k'$

$n_k = (1 - \beta_3)n_{k-1} + \beta_3(g_k')^2$

Adaptive step size:

$\eta_k = \frac{\eta}{\sqrt{n_k} + \epsilon}$

Decoupled weight decay, similar to AdamW:

$\theta_{k+1} = (1 + \lambda_k\eta)^{-1}\left[\theta_k - \eta_k \cdot (m_k + (1-\beta_2)v_k)\right]$

where $v_k$ is an auxiliary momentum term and the β parameters are exponential decay rates for moments.

This formulation enables Adan to normalize, stabilize, and accelerate per-coordinate updates, maintaining robustness to gradient noise and curvature variation. Adan leverages both first-order and second-order moment information and tightly integrates momentum correction with coordinate-wise adaptive scaling.

2. Nesterov Momentum Estimation versus Classical Methods

In canonical Nesterov Accelerated Gradient (NAG), one first extrapolates along the momentum direction, then evaluates the gradient at the extrapolation point—an added computational burden in deep learning and backpropagation contexts. Adan’s NME sidesteps this by approximating the look-ahead via finite differences: $g_k' = g_k + (1 - \beta_1)(g_k - g_{k-1})$ This estimation utilizes only information already computed, lowering overhead and sidestepping issues with extrapolated points. The resulting updates allow Adan to anticipate future changes in gradient trajectory based on recent history, leading to improved prediction and smoothing of parameter updates.

3. Theoretical Guarantees and Complexity Bounds

Adan's convergence analysis establishes an upper bound of

$\mathcal{O}(\epsilon^{-3.5})$

stochastic gradient evaluations required to reach an ε-approximate first-order stationary point in non-convex stochastic problems—under Lipschitz-gradient and Lipschitz-Hessian assumptions. This matches the best-known lower bound for such problems.

With a restart strategy and appropriately chosen hyperparameters (β of order ε² and η also ∼ ε²), the bound is achieved for the dynamic objective that decouples the loss from the ℓ₂ weight decay term. Without a Lipschitz Hessian, Adan exhibits slightly slower convergence ( $\mathcal{O}(\epsilon^{-4})$ ). Compared to standard adaptive methods (which typically do not reach this bound), Adan's complexity rate is optimal for single-gradient-query stochastic optimization.

4. Empirical Performance on Deep Models

The algorithm was extensively evaluated on vision, NLP, and RL benchmarks:

Vision: On architectures such as ResNet, ConvNext, ViT, Swin, MAE, DETR, Adan achieves superior or comparable accuracy to AdamW, LAMB, SGD with momentum, and SAM, often with half as many training epochs required.
NLP: Experiments on GPT-2, Transformer-XL, BERT, and LSTM demonstrate accelerated convergence and improved perplexity or accuracy metrics.
RL: In PPO-based environments (MuJoCo), PPO-Adan attains higher average rewards compared to PPO-Adam with identical settings.

Adan achieves robustness across minibatch sizes, maintaining performance from batch sizes of 1k up to 32k.

5. Implementation and Integration

Adan’s design minimizes integration friction for practitioners:

It follows the interface conventions of Adam-type optimizers, making it a drop-in replacement in frameworks such as PyTorch and TensorFlow.
Open-source code is available at https://github.com/sail-sg/Adan, used in multiple deep learning frameworks and projects.
The decoupled weight decay and NME are implemented internally and require no architectural changes to apply, allowing existing model codebases to benefit from Adan without adaptation.

6. Practical Implications: Robustness and Adaptivity

Adan delivers practical advantages in large-scale training:

Reduced training cost: Empirically, high-accuracy solutions require fewer epochs.
Reduced difficulty in optimizer selection: Adan generalizes well across convolutional, transformer, and self-supervised architectures, ameliorating the need for multiple optimizer trials.
Improved generalization: Decoupled weight decay improves stability and final test accuracy, especially important for large-scale overparameterized models.
Batch size flexibility: Effective across a wide range, supporting distributed and large-batch regimes.

7. Relationships to Contemporary Optimization Algorithms

Relative to AdaBelief, Padam, Adam, and AdaFamily variants:

Adan’s predictive momentum estimation offers faster and smoother convergence than aggressive moment updates in Adam.
Adan’s decoupled weight decay mirrors AdamW, enhancing generalization.
Adan provides more robust learning curves compared to conventional adaptive optimizers, especially in early epochs and under noisy gradients.

Adan’s novel approach to momentum and per-iteration adaptation positions it as a widely applicable optimizer for diverse deep learning applications, merging speed, accuracy, and general robustness within a principled mathematical and empirical framework (Xie et al., 2022).

PDF Markdown Chat (Pro)

References (1)

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models (2022)

Follow Topic

Get notified by email when new papers are published related to Adan Optimization Algorithm.