Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdEMAMix Optimizer: Dual Momentum for Deep Learning

Updated 19 March 2026
  • AdEMAMix Optimizer is defined as an optimizer for large-scale deep learning that uses two distinct EMA buffers (fast and slow) to overcome single-EMA limitations.
  • It employs a mixing strategy with a fast decay (β₁ ≈ 0.9) and a slow decay (β₃ ≈ 0.9999) that achieves up to a 95% token efficiency gain in language modeling.
  • Empirical studies demonstrate improved convergence rates, reduced model forgetting, and superior downstream performance in language and vision tasks with minimal overhead.

AdEMAMix is an optimizer designed for large-scale deep learning, augmenting the AdamW algorithm with a two-scale momentum mechanism. By maintaining distinct short-term (“fast”) and long-term (“slow”) @@@@1@@@@ (EMAs) of past gradients and mixing them, AdEMAMix directly addresses a fundamental limitation of single-EMA optimizers. Empirical evaluation demonstrates that this approach yields substantially improved convergence rates, greater data efficiency, reduced model forgetting, and better downstream performance in high-iteration, large-data settings such as language modeling and image classification (Pagliardini et al., 2024, Morwani et al., 4 Feb 2025).

1. Motivation and Formulation

Momentum-based algorithms such as Adam and AdamW accumulate past gradients via a single EMA with decay rate β1\beta_1. This mechanism is reactive to either recent gradients (with small β1\beta_1) at the expense of forgetting history, or excessively sluggish (with high β1\beta_1) if trying to incorporate long-term information. No single β1\beta_1 parameter can achieve both rapid responsiveness and long memory simultaneously.

AdEMAMix addresses this by maintaining two EMA buffers:

  • m1m_1 (“fast”) with standard decay rate β10.9\beta_1 \approx 0.9, capturing current curvature.
  • m2m_2 (“slow”) with a high decay (β3β1\beta_3 \gg \beta_1, e.g., β3=0.9999\beta_3 = 0.9999), retaining information over thousands of iterations.

The AdEMAMix update, with momentum buffers m1,m2m_1, m_2 and second-moment vv: g(t)=θ(θ(t1)) m1(t)β1m1(t1)+(1β1)g(t) m2(t)β3(t)m2(t1)+(1β3(t))g(t) v(t)β2v(t1)+(1β2)(g(t))2 m^1(t)=m1(t)/(1β1t),v^(t)=v(t)/(1β2t)\begin{aligned} g^{(t)} & = \nabla_\theta \ell(\theta^{(t-1)}) \ m_1^{(t)} & \gets \beta_1 m_1^{(t-1)} + (1 - \beta_1) g^{(t)} \ m_2^{(t)} & \gets \beta_3^{(t)} m_2^{(t-1)} + (1 - \beta_3^{(t)}) g^{(t)} \ v^{(t)} & \gets \beta_2 v^{(t-1)} + (1 - \beta_2) (g^{(t)})^2 \ \hat m_1^{(t)} & = m_1^{(t)}/(1-\beta_1^t), \quad \hat v^{(t)} = v^{(t)}/(1-\beta_2^t) \end{aligned} The parameter update is: θ(t)θ(t1)ηm^1(t)+α(t)m2(t)v^(t)+ϵ+λθ(t1)\theta^{(t)} \gets \theta^{(t-1)} - \eta \frac{ \hat m_1^{(t)} + \alpha^{(t)} m_2^{(t)} }{ \sqrt{\hat v^{(t)} + \epsilon} } + \lambda \theta^{(t-1)} Here, α(t)\alpha^{(t)} controls the mix between fast and slow EMA; α\alpha and β3\beta_3 are ramped up via schedules for stability in early training (Pagliardini et al., 2024, Morwani et al., 4 Feb 2025).

2. Connections to Accelerated Gradient Methods

AdEMAMix operationalizes the accelerated stochastic gradient descent (SGD) template in the noise-dominated regime by decoupling the momentum coefficient from the weight on each new gradient. In the classical accelerated-SGD paradigm, the update is: mt=βa,tmt1+gt,wt+1=wtηa,tmtαa,tgtm_t = \beta_{a, t} m_{t-1} + g_t,\quad w_{t+1} = w_t - \eta_{a, t} m_t - \alpha_{a, t} g_t For βa,t=1k/t\beta_{a, t} = 1 - k/t and αa,t=Θ(t)\alpha_{a, t} = \Theta(t), provably accelerated convergence, O(1/t2+σ/t)O(1/t^2 + \sigma/\sqrt{t}), is attained in the high-noise regime (Morwani et al., 4 Feb 2025).

By setting β3(t)11/t\beta_3^{(t)} \approx 1-1/t and α(t)t\alpha^{(t)} \propto t, AdEMAMix mirrors this two-scale mechanism. Unlike Adam, Lion, or MARS, AdEMAMix's explicit two-EMA mixture with independent scheduling more faithfully implements the theoretical acceleration template (Morwani et al., 4 Feb 2025).

3. Implementation, Hyperparameters, and Schedules

Key aspects of AdEMAMix implementation include:

  • Decay Rates: β10.9\beta_1 \approx 0.9 (fast), β20.999\beta_2 \approx 0.999 (second moment), β3\beta_3 in [0.999,0.99999][0.999, 0.99999] (slow).
  • Mixing Weight α\alpha: Governs slow EMA influence; effective values are 4α104\leq\alpha\leq10. Both α\alpha and β3\beta_3 are ramped up over the total training window TT with linear or “half-life” schedules, preventing instability in early iterations.
  • Step Size η\eta: Standard AdamW schedules apply (warmup + cosine/linear decay).
  • Gradient Clipping: Norm clipping remains beneficial.
  • Memory Overhead: Maintaining m2m_2 doubles the first-moment state, unless β1=0\beta_1=0 (in which case m1m_1 is dropped).
  • Bias Correction: Only m1m_1 receives bias correction; the slow buffer m2m_2 is slowly ramped and not bias-corrected, avoiding cold-start issues.

Pseudocode mirrors AdamW with the addition of the m2m_2 update and mixing step. In practice, failure to schedule large α\alpha or β3\beta_3 can cause unstable jumps (“explosions”) early in training. These issues are remediated via appropriate scheduler design (Pagliardini et al., 2024).

4. Empirical Performance and Comparative Studies

Large-scale experiments substantiate the advantages of AdEMAMix:

  • Language Modeling: On the RedPajama v2 corpus, Transformer architectures with $110$M, $335$M, and $1.3$B parameters attain the same validation loss as AdamW-trained models using roughly half the number of tokens (\sim95% token efficiency gain). For example, a $1.3$B parameter model trained with AdEMAMix on $101$B tokens matches AdamW’s performance at $197$B tokens.
  • Generalization and Slower Forgetting: In “forgetting” experiments (injecting a held-out batch at time tBt_B), AdEMAMix retains information about the batch over thousands of steps, whereas AdamW forgets rapidly.
  • Architectural Breadth: AdEMAMix halves the steps to convergence for Mamba on FineWeb and yields lower train/test loss for ViTs trained on ImageNet-21k. No systematic benefit is observed where data is extremely scarce or overfitting dominates.
  • Training Overhead: The addition of m2m_2 increases training time per step by less than 5%5\%, offset by substantial token efficiency.
  • Model Switching: Switching from AdamW to AdEMAMix mid-training (with m20m_2 \leftarrow 0) immediately improves convergence, especially with earlier switching.
  • In-Context Performance: On zero/few-shot tasks (HellaSwag, ARC, MMLU, PubMedQA, RewardBench), AdEMAMix outperforms AdamW baselines, sometimes by several percentage points.

Table: Empirical Comparison in Language Modeling and Vision Tasks (Pagliardini et al., 2024, Morwani et al., 4 Feb 2025)

Task AdEMAMix AdamW Notes
LM, 1.3B, 101B tokens matches AdamW @ 197B 197B tokens needed \sim95% token reduction
Mamba, FineWeb Halves steps - α=8\alpha=8, β3=0.9999\beta_3=0.9999
ViT, ImageNet-21k Lower loss Higher loss Large-scale, data-rich; AdEMAMix advantage
ViT, ImageNet-1k (scarce) No improvement - Overfitting regime

AdEMAMix’s closest comparators include AdamW (single EMA), Lion (coordinate-wise sign momentum), MARS (aggregated gradients), and schedule-free Adam variants. Unlike these, AdEMAMix maintains two independently scheduled momenta, matching the two-timescale dynamics required for accelerated SGD in stochastic regimes.

  • AdamW: Ties the current gradient’s weight to the momentum term; cannot separate fast/slow behavior.
  • Lion and MARS: Implement variants of acceleration or sign normalization, but lack explicit two-buffer mixtures or direct theoretical matching to accelerated SGD.
  • Schedule-Free AdamW: Employs a single momentum buffer with time-scheduled β\beta, but lacks stable acceleration in large-batch regimes.

Ablation studies show that dropping m1m_1 (setting β1=0\beta_1=0) suffices in small-batch/noisy regimes, but in large-batch scenarios, both buffers are required to retain the acceleration and stability properties (Morwani et al., 4 Feb 2025).

6. Extensions: Simplified-AdEMAMix

Building on the dual-EMA design, "Simplified-AdEMAMix" collapses both buffers into a single theory-style EMA. The update: m(t)=β1(t)m(t1)+g(t)m^{(t)} = \beta_1^{(t)} m^{(t-1)} + g^{(t)}

θ(t)=θ(t1)η(t)m(t)+αg(t)ν^(t)+ϵ\theta^{(t)} = \theta^{(t-1)} - \eta^{(t)} \frac{ m^{(t)} + \alpha g^{(t)} }{ \sqrt{ \hat{\nu}^{(t)} + \epsilon } }

Here, β1(t)\beta_1^{(t)} is warmed to $1-1/t$ and α\alpha can be fixed or zero. Empirically, for both small and large batch regimes, setting α=0\alpha=0 recovers the full performance of the original AdEMAMix (even at scale), eliminating the need for two buffers and reducing implementation complexity. This establishes that the efficacy of AdEMAMix lies in flexibly scheduled momentum, not strictly the duplication of EMA state (Morwani et al., 4 Feb 2025).

7. Limitations, Open Questions, and Research Directions

AdEMAMix excels in high-iteration, high-data regimes. Its strengths are less pronounced for low-iteration or distribution-shifted settings; in such cases, tuning β3\beta_3 downward or reverting to AdamW may be recommended. Maintaining both buffers increases memory overhead and can introduce early training instability without proper scheduling. Theoretical questions remain regarding the trade-off between noise accumulation, generalization, and the influence of alternative memory kernels (e.g., power-law decay). These observations motivate continued investigation of multi-timescale momentum mechanisms beyond EMAs and deeper analysis of their generalization properties in diverse learning regimes (Pagliardini et al., 2024, Morwani et al., 4 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdEMAMix Optimizer.