Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdamW with EMA: Enhanced Deep Training

Updated 18 April 2026
  • AdamW with EMA is a combination of adaptive optimization and exponential moving averages that improves model stability and convergence.
  • EMA, applied to either model parameters or momenta, smooths out updates and reduces oscillatory behavior during training.
  • Empirical studies across vision, language, and tabular tasks demonstrate that EMA variants consistently enhance performance with minimal computational cost.

AdamW with Exponential Moving Average (EMA) denotes the combination of the AdamW optimizer—for decoupled weight decay in adaptive moment estimation—with an exponential moving average applied to model parameters, or, in some extensions, to optimizer momenta. This duality targets improved stability, generalization, and convergence in deep neural network training, and features recurrently in modern large-scale vision, language, and tabular learning. Recent variants and empirical studies have systematized the benefits of EMA, both in its standard and advanced forms, clarifying its theoretical underpinnings and guiding practical deployment.

1. Definition and Basic Principles

AdamW is an adaptive gradient optimizer that decouples the weight decay regularization term from gradient-based updates. Formally, given gradient gt=θ(θt1)g_t = \nabla_\theta \ell(\theta_{t-1}) at step tt, the AdamW update is:

mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}

where η\eta is the learning rate, λ\lambda is the weight decay, and ε\varepsilon is a stability constant.

EMA is a parameter-averaging method: a “shadow” copy of the weights, θˉt\bar\theta_t, is maintained as

θˉt=αθˉt1+(1α)θt\bar\theta_t = \alpha\,\bar\theta_{t-1} + (1-\alpha)\,\theta_t

with α(0,1)\alpha \in (0,1) as the EMA decay rate. At evaluation or checkpoint time, θˉt\bar\theta_t is swapped in to compute predictions.

When AdamW is combined with EMA, the optimizer benefits from both adaptive updates and trajectory-averaged weights, targeting flatter minima and improved generalization across domains (Gorishniy et al., 16 Apr 2026, Li et al., 2024).

2. EMA Interpretations and Theoretical Foundations

Several lines of recent work clarify the role of EMA within AdamW dynamics. Most notably, it has been shown that the decoupled weight decay of AdamW imparts an intrinsic EMA structure to the parameter sequence itself. Specifically, letting tt0 and defining tt1, the update can be rewritten as

tt2

which is an exponential moving average of the transformed gradient estimates, with EMA timescale tt3. This equivalence motivates principled rules for scaling weight decay with respect to dataset and model size, and links the optimizer’s regularization effect directly to the implicit memory of past updates (Wang et al., 2024).

Moreover, EMA regularization is widely understood to improve optimization trajectory stability, reduce oscillatory behavior, and drive the learned parameters to flatter optima. In rigorous theoretical analyses, e.g., in quadratic models, the variance of EMA-averaged parameters is strictly lower than plain optimizer iterates, and variants such as “Switch EMA” (SEMA)—where model weights are periodically updated to their EMA value—can further accelerate convergence and sharpen flatness by alternating fast exploration with slow averaging (Li et al., 2024).

3. Implementations: Parameter and Moment EMA

The simplest and most widespread use of EMA with AdamW is as a parameter averaging mechanism. The procedure is as follows:

  • Initialize tt4.
  • After every AdamW update to tt5, update tt6.
  • For validation, and at test time, use tt7 in place of tt8.

Typical tt9 values fall in mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}0, with no warmup or schedule required (Gorishniy et al., 16 Apr 2026). The computational cost is minimal: one vector scale-and-add per step, and a single copy of parameters is maintained in parallel.

Recent contributions have proposed variants:

  • SEMA (“Switch EMA”): At the end of each training epoch, the current parameters mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}1 are overwritten by their EMA, i.e., set mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}2, then resume both regular and EMA updates as before (Li et al., 2024). This alternating scheme further equilibrates trajectory stability with the descent properties of AdamW.
  • AdEMAMix: Instead of operating solely on parameters, this method introduces a mixture of “fast” and “slow” EMA tracks of the raw gradients, which are bias-corrected and mixed before applying the usual AdamW step. This dual-EMA approach explicitly separates local responsiveness and long-term memory, capturing both immediate and distant past gradient information (Pagliardini et al., 2024).

4. Empirical Results Across Domains

Extensive empirical studies substantiate the benefit of AdamW with EMA variants:

  • Tabular MLPs: AdamW plus EMA (with mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}3 in mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}4) yielded a +0.66% relative unified score improvement across 17 datasets, winning or tying baseline AdamW in 16/17 cases. Gains are clearest for vanilla MLPs, diminishing for deep ensembles or models with feature embeddings (Gorishniy et al., 16 Apr 2026).
  • Vision Classification: On ImageNet-1K (DeiT-S, 300 epochs), top-1 accuracy improved from 80.0% (AdamW) to 80.2% (AdamW+EMA) to 80.6% (AdamW+SEMA). On CIFAR-100 (ResNet-18, 200 epochs): 76.91% (AdamW), 77.16% (+EMA), 77.61% (+SEMA) (Li et al., 2024).
  • Large-Scale Language Modeling: AdEMAMix, a dual-EMA modification of AdamW, enabled a 1.3B parameter LLM trained on 101B tokens to match the loss of AdamW baseline trained with 197B tokens (≈95% token efficiency gain). Similarly, rapid loss reductions were observed in smaller scale Transformers, Mamba models, and on a variety of in-context evaluation tasks (Pagliardini et al., 2024).
  • Forgetting Experiments: AdEMAMix preserved injected batch-dependent gradient information over tens of thousands of steps, in contrast to AdamW, which forgets such information rapidly (Pagliardini et al., 2024).

5. Methodological Details and Hyperparameter Tuning

Standard AdamW with EMA

  • Parameter EMA: Use after every optimizer step. Decay rate mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}5 typically in mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}6. No warmup or schedule required; start from step 0.
  • Switch EMA (SEMA): Copy EMA weights to the model after every epoch. Do not reset optimizer moments.
  • Checkpointing/Evaluation: Always use the EMA-averaged parameters.
  • Cost: One additional parameter copy; negligible compute overhead (Gorishniy et al., 16 Apr 2026, Li et al., 2024).

AdEMAMix

  • Dual-gradient EMA: Maintain two gradient EMAs, a fast (mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}7) and slow (mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}8) branch, each with its own decay, then mix post bias-correction.
  • Mixing coefficient mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t),v^t=vt/(1β2t) θt=θt1η(m^tv^t+ε+λθt1)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat m_t &= m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) \ \theta_t &= \theta_{t-1} - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} + \lambda \theta_{t-1} \right) \end{aligned}9: Typical values η\eta0–η\eta1, linearly scheduled during training.
  • Warmup scheduling: Gradually ramp η\eta2 and η\eta3 from initial values to final over η\eta4 steps to avoid instability.
  • Memory: One additional gradient buffer. Can drop the fast EMA entirely if memory is a concern, reducing to AdamW cost but with small added noise (Pagliardini et al., 2024).

Weight Decay Scaling

The equivalence of AdamW weight decay to an EMA timescale leads to the practical rule:

η\eta5

where η\eta6 is weight decay, η\eta7 the learning rate, η\eta8 the EMA timescale in epochs, and η\eta9 the number of gradient steps per epoch. Scaling recommendations are:

  • On λ\lambda0 larger datasets, reduce λ\lambda1 by λ\lambda2.
  • For width scaling under λ\lambda3-Param (λ\lambda4), increase λ\lambda5 by λ\lambda6 to keep timescale fixed (Wang et al., 2024).

6. Practical Integration and Implementation Considerations

AdamW with (parameter or moment) EMA is a drop-in extension for any pipeline. Key considerations are:

  • Maintain a shadow EMA buffer, updated each step (parameter EMA) or switch at epoch end (SEMA).
  • For AdEMAMix, add an additional EMA buffer for slow gradients and implement schedulers for mixing.
  • Always evaluate/checkpoint with the EMA parameter set.
  • Memory use: one extra full parameter or moment buffer. Compute overhead: negligible; per-step vector arithmetic is negligible versus forward/backward operations (Li et al., 2024).
  • No impact on optimizer state beyond the additional buffer; AdamW moments continue uninterrupted under SEMA.

7. Key Empirical and Theoretical Takeaways

AdamW with EMA variants consistently improves optimization and generalization in both vision and language tasks, at negligible wall-clock or implementation cost. Parameter EMA yields systematic improvements for vanilla networks, while SEMA sharpens these benefits and accelerates convergence. The dual-gradient EMA design (AdEMAMix) further addresses the intrinsic trade-off in a single EMA tracker—namely, between rapid responsiveness to recent gradients and retention of distant past information. Careful analysis of EMA’s interaction with weight decay clarifies principled hyperparameter scaling as models and datasets grow. Across all variants, the central insight is that moving average mechanisms, when paired with adaptive optimizers and decoupled regularization, offer robust, theoretically motivated, and empirically validated paths to improved deep model training (Gorishniy et al., 16 Apr 2026, Li et al., 2024, Pagliardini et al., 2024, Wang et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdamW with EMA.