AdamW with EMA: Enhanced Deep Training

Updated 18 April 2026

AdamW with EMA is a combination of adaptive optimization and exponential moving averages that improves model stability and convergence.
EMA, applied to either model parameters or momenta, smooths out updates and reduces oscillatory behavior during training.
Empirical studies across vision, language, and tabular tasks demonstrate that EMA variants consistently enhance performance with minimal computational cost.

AdamW with Exponential Moving Average (EMA) denotes the combination of the AdamW optimizer—for decoupled weight decay in adaptive moment estimation—with an exponential moving average applied to model parameters, or, in some extensions, to optimizer momenta. This duality targets improved stability, generalization, and convergence in deep neural network training, and features recurrently in modern large-scale vision, language, and tabular learning. Recent variants and empirical studies have systematized the benefits of EMA, both in its standard and advanced forms, clarifying its theoretical underpinnings and guiding practical deployment.

1. Definition and Basic Principles

AdamW is an adaptive gradient optimizer that decouples the weight decay regularization term from gradient-based updates. Formally, given gradient $g_t = \nabla_\theta \ell(\theta_{t-1})$ at step $t$ , the AdamW update is:

$\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$

where $\eta$ is the learning rate, $\lambda$ is the weight decay, and $\varepsilon$ is a stability constant.

EMA is a parameter-averaging method: a “shadow” copy of the weights, $\bar\theta_t$ , is maintained as

$\bar\theta_t = \alpha\,\bar\theta_{t-1} + (1-\alpha)\,\theta_t$

with $\alpha \in (0,1)$ as the EMA decay rate. At evaluation or checkpoint time, $\bar\theta_t$ is swapped in to compute predictions.

When AdamW is combined with EMA, the optimizer benefits from both adaptive updates and trajectory-averaged weights, targeting flatter minima and improved generalization across domains (Gorishniy et al., 16 Apr 2026, Li et al., 2024).

2. EMA Interpretations and Theoretical Foundations

Several lines of recent work clarify the role of EMA within AdamW dynamics. Most notably, it has been shown that the decoupled weight decay of AdamW imparts an intrinsic EMA structure to the parameter sequence itself. Specifically, letting $t$ 0 and defining $t$ 1, the update can be rewritten as

$t$ 2

which is an exponential moving average of the transformed gradient estimates, with EMA timescale $t$ 3. This equivalence motivates principled rules for scaling weight decay with respect to dataset and model size, and links the optimizer’s regularization effect directly to the implicit memory of past updates (Wang et al., 2024).

Moreover, EMA regularization is widely understood to improve optimization trajectory stability, reduce oscillatory behavior, and drive the learned parameters to flatter optima. In rigorous theoretical analyses, e.g., in quadratic models, the variance of EMA-averaged parameters is strictly lower than plain optimizer iterates, and variants such as “Switch EMA” (SEMA)—where model weights are periodically updated to their EMA value—can further accelerate convergence and sharpen flatness by alternating fast exploration with slow averaging (Li et al., 2024).

3. Implementations: Parameter and Moment EMA

The simplest and most widespread use of EMA with AdamW is as a parameter averaging mechanism. The procedure is as follows:

Initialize $t$ 4.
After every AdamW update to $t$ 5, update $t$ 6.
For validation, and at test time, use $t$ 7 in place of $t$ 8.

Typical $t$ 9 values fall in $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 0, with no warmup or schedule required (Gorishniy et al., 16 Apr 2026). The computational cost is minimal: one vector scale-and-add per step, and a single copy of parameters is maintained in parallel.

Recent contributions have proposed variants:

SEMA (“Switch EMA”): At the end of each training epoch, the current parameters $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 1 are overwritten by their EMA, i.e., set $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 2, then resume both regular and EMA updates as before (Li et al., 2024). This alternating scheme further equilibrates trajectory stability with the descent properties of AdamW.
AdEMAMix: Instead of operating solely on parameters, this method introduces a mixture of “fast” and “slow” EMA tracks of the raw gradients, which are bias-corrected and mixed before applying the usual AdamW step. This dual-EMA approach explicitly separates local responsiveness and long-term memory, capturing both immediate and distant past gradient information (Pagliardini et al., 2024).

4. Empirical Results Across Domains

Extensive empirical studies substantiate the benefit of AdamW with EMA variants:

Tabular MLPs: AdamW plus EMA (with $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 3 in $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 4) yielded a +0.66% relative unified score improvement across 17 datasets, winning or tying baseline AdamW in 16/17 cases. Gains are clearest for vanilla MLPs, diminishing for deep ensembles or models with feature embeddings (Gorishniy et al., 16 Apr 2026).
Vision Classification: On ImageNet-1K (DeiT-S, 300 epochs), top-1 accuracy improved from 80.0% (AdamW) to 80.2% (AdamW+EMA) to 80.6% (AdamW+SEMA). On CIFAR-100 (ResNet-18, 200 epochs): 76.91% (AdamW), 77.16% (+EMA), 77.61% (+SEMA) (Li et al., 2024).
Large-Scale Language Modeling: AdEMAMix, a dual-EMA modification of AdamW, enabled a 1.3B parameter LLM trained on 101B tokens to match the loss of AdamW baseline trained with 197B tokens (≈95% token efficiency gain). Similarly, rapid loss reductions were observed in smaller scale Transformers, Mamba models, and on a variety of in-context evaluation tasks (Pagliardini et al., 2024).
Forgetting Experiments: AdEMAMix preserved injected batch-dependent gradient information over tens of thousands of steps, in contrast to AdamW, which forgets such information rapidly (Pagliardini et al., 2024).

5. Methodological Details and Hyperparameter Tuning

Standard AdamW with EMA

Parameter EMA: Use after every optimizer step. Decay rate $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 5 typically in $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 6. No warmup or schedule required; start from step 0.
Switch EMA (SEMA): Copy EMA weights to the model after every epoch. Do not reset optimizer moments.
Checkpointing/Evaluation: Always use the EMA-averaged parameters.
Cost: One additional parameter copy; negligible compute overhead (Gorishniy et al., 16 Apr 2026, Li et al., 2024).

AdEMAMix

Dual-gradient EMA: Maintain two gradient EMAs, a fast ( $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 7) and slow ( $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 8) branch, each with its own decay, then mix post bias-correction.
Mixing coefficient $\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}$ 9: Typical values $\eta$ 0– $\eta$ 1, linearly scheduled during training.
Warmup scheduling: Gradually ramp $\eta$ 2 and $\eta$ 3 from initial values to final over $\eta$ 4 steps to avoid instability.
Memory: One additional gradient buffer. Can drop the fast EMA entirely if memory is a concern, reducing to AdamW cost but with small added noise (Pagliardini et al., 2024).

Weight Decay Scaling

The equivalence of AdamW weight decay to an EMA timescale leads to the practical rule:

$\eta$ 5

where $\eta$ 6 is weight decay, $\eta$ 7 the learning rate, $\eta$ 8 the EMA timescale in epochs, and $\eta$ 9 the number of gradient steps per epoch. Scaling recommendations are:

On $\lambda$ 0 larger datasets, reduce $\lambda$ 1 by $\lambda$ 2.
For width scaling under $\lambda$ 3-Param ( $\lambda$ 4), increase $\lambda$ 5 by $\lambda$ 6 to keep timescale fixed (Wang et al., 2024).

6. Practical Integration and Implementation Considerations

AdamW with (parameter or moment) EMA is a drop-in extension for any pipeline. Key considerations are:

Maintain a shadow EMA buffer, updated each step (parameter EMA) or switch at epoch end (SEMA).
For AdEMAMix, add an additional EMA buffer for slow gradients and implement schedulers for mixing.
Always evaluate/checkpoint with the EMA parameter set.
Memory use: one extra full parameter or moment buffer. Compute overhead: negligible; per-step vector arithmetic is negligible versus forward/backward operations (Li et al., 2024).
No impact on optimizer state beyond the additional buffer; AdamW moments continue uninterrupted under SEMA.

7. Key Empirical and Theoretical Takeaways

AdamW with EMA variants consistently improves optimization and generalization in both vision and language tasks, at negligible wall-clock or implementation cost. Parameter EMA yields systematic improvements for vanilla networks, while SEMA sharpens these benefits and accelerates convergence. The dual-gradient EMA design (AdEMAMix) further addresses the intrinsic trade-off in a single EMA tracker—namely, between rapid responsiveness to recent gradients and retention of distant past information. Careful analysis of EMA’s interaction with weight decay clarifies principled hyperparameter scaling as models and datasets grow. Across all variants, the central insight is that moving average mechanisms, when paired with adaptive optimizers and decoupled regularization, offer robust, theoretically motivated, and empirically validated paths to improved deep model training (Gorishniy et al., 16 Apr 2026, Li et al., 2024, Pagliardini et al., 2024, Wang et al., 2024).