AdEMAMix Optimizer: Dual Momentum for Deep Learning

Updated 19 March 2026

AdEMAMix Optimizer is defined as an optimizer for large-scale deep learning that uses two distinct EMA buffers (fast and slow) to overcome single-EMA limitations.
It employs a mixing strategy with a fast decay (β₁ ≈ 0.9) and a slow decay (β₃ ≈ 0.9999) that achieves up to a 95% token efficiency gain in language modeling.
Empirical studies demonstrate improved convergence rates, reduced model forgetting, and superior downstream performance in language and vision tasks with minimal overhead.

AdEMAMix is an optimizer designed for large-scale deep learning, augmenting the AdamW algorithm with a two-scale momentum mechanism. By maintaining distinct short-term (“fast”) and long-term (“slow”) @@@@1@@@@ (EMAs) of past gradients and mixing them, AdEMAMix directly addresses a fundamental limitation of single-EMA optimizers. Empirical evaluation demonstrates that this approach yields substantially improved convergence rates, greater data efficiency, reduced model forgetting, and better downstream performance in high-iteration, large-data settings such as language modeling and image classification (Pagliardini et al., 2024, Morwani et al., 4 Feb 2025).

1. Motivation and Formulation

Momentum-based algorithms such as Adam and AdamW accumulate past gradients via a single EMA with decay rate $\beta_1$ . This mechanism is reactive to either recent gradients (with small $\beta_1$ ) at the expense of forgetting history, or excessively sluggish (with high $\beta_1$ ) if trying to incorporate long-term information. No single $\beta_1$ parameter can achieve both rapid responsiveness and long memory simultaneously.

AdEMAMix addresses this by maintaining two EMA buffers:

$m_1$ (“fast”) with standard decay rate $\beta_1 \approx 0.9$ , capturing current curvature.
$m_2$ (“slow”) with a high decay ( $\beta_3 \gg \beta_1$ , e.g., $\beta_3 = 0.9999$ ), retaining information over thousands of iterations.

The AdEMAMix update, with momentum buffers $m_1, m_2$ and second-moment $v$ : $\begin{aligned} g^{(t)} & = \nabla_\theta \ell(\theta^{(t-1)}) \ m_1^{(t)} & \gets \beta_1 m_1^{(t-1)} + (1 - \beta_1) g^{(t)} \ m_2^{(t)} & \gets \beta_3^{(t)} m_2^{(t-1)} + (1 - \beta_3^{(t)}) g^{(t)} \ v^{(t)} & \gets \beta_2 v^{(t-1)} + (1 - \beta_2) (g^{(t)})^2 \ \hat m_1^{(t)} & = m_1^{(t)}/(1-\beta_1^t), \quad \hat v^{(t)} = v^{(t)}/(1-\beta_2^t) \end{aligned}$ The parameter update is: $\theta^{(t)} \gets \theta^{(t-1)} - \eta \frac{ \hat m_1^{(t)} + \alpha^{(t)} m_2^{(t)} }{ \sqrt{\hat v^{(t)} + \epsilon} } + \lambda \theta^{(t-1)}$ Here, $\alpha^{(t)}$ controls the mix between fast and slow EMA; $\alpha$ and $\beta_3$ are ramped up via schedules for stability in early training (Pagliardini et al., 2024, Morwani et al., 4 Feb 2025).

2. Connections to Accelerated Gradient Methods

AdEMAMix operationalizes the accelerated stochastic gradient descent (SGD) template in the noise-dominated regime by decoupling the momentum coefficient from the weight on each new gradient. In the classical accelerated-SGD paradigm, the update is: $m_t = \beta_{a, t} m_{t-1} + g_t,\quad w_{t+1} = w_t - \eta_{a, t} m_t - \alpha_{a, t} g_t$ For $\beta_{a, t} = 1 - k/t$ and $\alpha_{a, t} = \Theta(t)$ , provably accelerated convergence, $O(1/t^2 + \sigma/\sqrt{t})$ , is attained in the high-noise regime (Morwani et al., 4 Feb 2025).

By setting $\beta_3^{(t)} \approx 1-1/t$ and $\alpha^{(t)} \propto t$ , AdEMAMix mirrors this two-scale mechanism. Unlike Adam, Lion, or MARS, AdEMAMix's explicit two-EMA mixture with independent scheduling more faithfully implements the theoretical acceleration template (Morwani et al., 4 Feb 2025).

3. Implementation, Hyperparameters, and Schedules

Key aspects of AdEMAMix implementation include:

Decay Rates: $\beta_1 \approx 0.9$ (fast), $\beta_2 \approx 0.999$ (second moment), $\beta_3$ in $[0.999, 0.99999]$ (slow).
Mixing Weight $\alpha$ : Governs slow EMA influence; effective values are $4\leq\alpha\leq10$ . Both $\alpha$ and $\beta_3$ are ramped up over the total training window $T$ with linear or “half-life” schedules, preventing instability in early iterations.
Step Size $\eta$ : Standard AdamW schedules apply (warmup + cosine/linear decay).
Gradient Clipping: Norm clipping remains beneficial.
Memory Overhead: Maintaining $m_2$ doubles the first-moment state, unless $\beta_1=0$ (in which case $m_1$ is dropped).
Bias Correction: Only $m_1$ receives bias correction; the slow buffer $m_2$ is slowly ramped and not bias-corrected, avoiding cold-start issues.

Pseudocode mirrors AdamW with the addition of the $m_2$ update and mixing step. In practice, failure to schedule large $\alpha$ or $\beta_3$ can cause unstable jumps (“explosions”) early in training. These issues are remediated via appropriate scheduler design (Pagliardini et al., 2024).

4. Empirical Performance and Comparative Studies

Large-scale experiments substantiate the advantages of AdEMAMix:

Language Modeling: On the RedPajama v2 corpus, Transformer architectures with $110$M, $335$M, and $1.3$B parameters attain the same validation loss as AdamW-trained models using roughly half the number of tokens ( $\sim$ 95% token efficiency gain). For example, a $1.3$B parameter model trained with AdEMAMix on $101$B tokens matches AdamW’s performance at $197$B tokens.
Generalization and Slower Forgetting: In “forgetting” experiments (injecting a held-out batch at time $t_B$ ), AdEMAMix retains information about the batch over thousands of steps, whereas AdamW forgets rapidly.
Architectural Breadth: AdEMAMix halves the steps to convergence for Mamba on FineWeb and yields lower train/test loss for ViTs trained on ImageNet-21k. No systematic benefit is observed where data is extremely scarce or overfitting dominates.
Training Overhead: The addition of $m_2$ increases training time per step by less than $5\%$ , offset by substantial token efficiency.
Model Switching: Switching from AdamW to AdEMAMix mid-training (with $m_2 \leftarrow 0$ ) immediately improves convergence, especially with earlier switching.
In-Context Performance: On zero/few-shot tasks (HellaSwag, ARC, MMLU, PubMedQA, RewardBench), AdEMAMix outperforms AdamW baselines, sometimes by several percentage points.

Table: Empirical Comparison in Language Modeling and Vision Tasks (Pagliardini et al., 2024, Morwani et al., 4 Feb 2025)

Task	AdEMAMix	AdamW	Notes
LM, 1.3B, 101B tokens	matches AdamW @ 197B	197B tokens needed	$\sim$ 95% token reduction
Mamba, FineWeb	Halves steps	-	$\alpha=8$ , $\beta_3=0.9999$
ViT, ImageNet-21k	Lower loss	Higher loss	Large-scale, data-rich; AdEMAMix advantage
ViT, ImageNet-1k (scarce)	No improvement	-	Overfitting regime

AdEMAMix’s closest comparators include AdamW (single EMA), Lion (coordinate-wise sign momentum), MARS (aggregated gradients), and schedule-free Adam variants. Unlike these, AdEMAMix maintains two independently scheduled momenta, matching the two-timescale dynamics required for accelerated SGD in stochastic regimes.

AdamW: Ties the current gradient’s weight to the momentum term; cannot separate fast/slow behavior.
Lion and MARS: Implement variants of acceleration or sign normalization, but lack explicit two-buffer mixtures or direct theoretical matching to accelerated SGD.
Schedule-Free AdamW: Employs a single momentum buffer with time-scheduled $\beta$ , but lacks stable acceleration in large-batch regimes.

Ablation studies show that dropping $m_1$ (setting $\beta_1=0$ ) suffices in small-batch/noisy regimes, but in large-batch scenarios, both buffers are required to retain the acceleration and stability properties (Morwani et al., 4 Feb 2025).

6. Extensions: Simplified-AdEMAMix

Building on the dual-EMA design, "Simplified-AdEMAMix" collapses both buffers into a single theory-style EMA. The update: $m^{(t)} = \beta_1^{(t)} m^{(t-1)} + g^{(t)}$

$\theta^{(t)} = \theta^{(t-1)} - \eta^{(t)} \frac{ m^{(t)} + \alpha g^{(t)} }{ \sqrt{ \hat{\nu}^{(t)} + \epsilon } }$

Here, $\beta_1^{(t)}$ is warmed to $1-1/t$ and $\alpha$ can be fixed or zero. Empirically, for both small and large batch regimes, setting $\alpha=0$ recovers the full performance of the original AdEMAMix (even at scale), eliminating the need for two buffers and reducing implementation complexity. This establishes that the efficacy of AdEMAMix lies in flexibly scheduled momentum, not strictly the duplication of EMA state (Morwani et al., 4 Feb 2025).

7. Limitations, Open Questions, and Research Directions

AdEMAMix excels in high-iteration, high-data regimes. Its strengths are less pronounced for low-iteration or distribution-shifted settings; in such cases, tuning $\beta_3$ downward or reverting to AdamW may be recommended. Maintaining both buffers increases memory overhead and can introduce early training instability without proper scheduling. Theoretical questions remain regarding the trade-off between noise accumulation, generalization, and the influence of alternative memory kernels (e.g., power-law decay). These observations motivate continued investigation of multi-timescale momentum mechanisms beyond EMAs and deeper analysis of their generalization properties in diverse learning regimes (Pagliardini et al., 2024, Morwani et al., 4 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (2)

The AdEMAMix Optimizer: Better, Faster, Older (2024)

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdEMAMix Optimizer.