Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adam with Model EMA Optimization

Updated 11 March 2026
  • Adam with Model EMA is a method that combines adaptive gradient updates with an exponential moving average to enhance optimizer stability and convergence in nonconvex settings.
  • The approach utilizes coordinate-wise clipping and adaptive scaling to achieve minimax-optimal rates even under heterogeneous noise and gradient conditions.
  • Theoretical analysis shows that employing a clipped Adam update alongside model EMA outperforms vanilla Adam and SGD, providing deterministic convergence guarantees.

Adam with Model Exponential Moving Average (EMA) refers to the combination of the Adam optimizer—a widely used adaptive gradient method in stochastic optimization for deep learning—with an exponential moving average applied directly to model parameters. This pairing, especially in nonconvex settings, merges the benefits of adaptivity, momentum, clipping, and online-to-batch conversion through model weight averaging, achieving minimax-optimal convergence rates for smooth and nonsmooth problems. Recent theoretical advances have established that employing a clipped version of Adam alongside a model EMA enables optimal convergence guarantees in both global and coordinate-wise nonconvex settings, and is superior to vanilla Adam or SGD under heterogeneous coordinate scales (Ahn et al., 2024). The EMA component can itself be understood via a physical analogy to damped harmonic motion, motivating further algorithmic generalizations and improved parameter schedules (Patsenker et al., 2023).

1. Algorithmic Foundations

The joint method is designed for stochastic nonconvex optimization. Let F:RdRF:\mathbb{R}^d \to \mathbb{R} be the objective, with stochastic gradients gtF(xt)g_t \approx \nabla F(x_t). The optimization proceeds by alternating between clipped Adam updates and maintenance of a model EMA shadow sequence.

Clipped Adam Update (coordinate-wise):

  • Exponential gradient and squared-gradient averages:

mt,i=s=1tβ1tsgs,i,vt,i=s=1tβ2tsgs,i2m_{t,i} = \sum_{s=1}^t \beta_1^{t-s} g_{s,i}, \quad v_{t,i} = \sqrt{\sum_{s=1}^t \beta_2^{t-s} g_{s,i}^2}

  • Scalar clip operator:

$\clip_D(u) = \min\Big(\frac{D}{|u|}, 1\Big) u$

  • Parameter update with smoothing constant ε>0\varepsilon>0:

$z_{t,i} = -\clip_D\Big(D\,\frac{m_{t-1,i}}{v_{t-1,i}+\varepsilon}\Big), \quad x_t = x_{t-1} + z_t$

Model Exponential Moving Average:

  • EMA parameter with discount β\beta:

x~t=ββt1βtx~t1+1β1βtxt,x~0=x0\tilde x_t = \frac{\beta-\beta^t}{1-\beta^t}\tilde x_{t-1} + \frac{1-\beta}{1-\beta^t} x_t, \quad \tilde x_0 = x_0

Or, equivalently,

x~t=1β1βts=1tβtsxs\tilde x_t = \frac{1-\beta}{1-\beta^t} \sum_{s=1}^t \beta^{t-s} x_s

At algorithm termination, x~T\tilde x_T is returned as the solution.

In practice, parameters are typically chosen as β10.9\beta_1 \approx 0.9, β20.999\beta_2 \approx 0.999 (with theory favoring β2=β12\beta_2 = \beta_1^2), small ε\varepsilon, and a large EMA decay β\beta (Ahn et al., 2024).

2. Convergence Theory and Optimality

Convergence is quantified using a generalized (λ,ϵ)(\lambda,\epsilon)–Goldstein stationarity notion:

infp:E[y]=x{EF(y)+λEyx2}ϵ\inf_{p: \mathbb{E}[y] = x} \big\{ \mathbb{E}\|\nabla F(y)\| + \lambda \mathbb{E}\|y-x\|^2 \big\} \leq \epsilon

2.1 Global Nonconvex Convergence

Under GG-Lipschitz FF, stochastic gradients with variance σ2\leq\sigma^2, and initial gap Δ=F(x0)infF\Delta=F(x_0)-\inf F, the algorithm achieves:

  • With ϵG+σ\epsilon\approx G+\sigma:

T=O((G+σ)2Δϵ7/2)T = O\Big((G+\sigma)^2\Delta\,\epsilon^{-7/2}\Big)

This rate is minimax-optimal for smooth and nonsmooth nonconvex problems (Ahn et al., 2024).

2.2 Coordinate-wise Adaptive Rates

For per-coordinate Lipschitz and noise parameters G=(Gi)G=(G_i), σ=(σi)\sigma=(\sigma_i), the coordinate-wise result is:

  • Stationarity in 1\ell_1:

T=O(G+σ12Δdϵ7/2)T = O\Big(\|G+\sigma\|_1^2\,\Delta\,\sqrt d\,\epsilon^{-7/2}\Big)

An O(d)O(d) speedup arises if only a few coordinates are “hard” (G+σ1G+σ2\|G+\sigma\|_1 \approx \|G+\sigma\|_2).

3. Comparative Analysis: EMA-Augmented Adam vs. Alternatives

Optimizer Requires Model EMA Rate (Smooth Nonconvex) Key Limitation
Adam (no EMA) No O(ϵ4)O(\epsilon^{-4}) Suboptimal, random iterate
Clipped Adam + EMA Yes O(ϵ7/2)O(\epsilon^{-7/2}) Optimal, deterministic EMA
SGD + Polyak averages No O(ϵ7/2)O(\epsilon^{-7/2}) No coordinate adaptation
Adagrad (scale-free FTRL) No O(gt2)O(\sqrt{\sum g_t^2}) Step-size tuning needed

Without model EMA, Adam attains suboptimal rates, and analysis typically forces random iterate selection, which increases output variance and is rarely used in practice. SGD with averaging also reaches optimal rates, but lacks per-coordinate scaling, effectively slowing progress on heteroskedastic objectives. Adaptive methods such as scale-free FTRL can be integrated but still require additional tuning (Ahn et al., 2024).

4. Structural Roles: Momentum, Discounting, and Adaptivity

  • Momentum (β1\beta_1): Interpreted as discount factors in the sequence of observed gradients, implementing an online-to-nonconvex reduction where each step minimizes a discounted history of linearized losses.
  • EMA Discounting (β\beta): Both the loss-regret framework and model EMA employ the same discount parameter to weight history, making the optimizer both responsive and stable in nonstationary or varying regimes.
  • Coordinate-wise Adaptivity (β2\beta_2 and normalization): Each coordinate adaptively tunes its own effective step size via per-coordinate second-moment normalization, crucial when gradient or noise scales are nonuniform across coordinates, leading to dimension-dependent speedup.
  • Gradient Clipping (DD): Added robustness to occasional large-magnitude gradients, enforcing bounded iterates.

5. Physical and Dynamical Analogies for EMA

Model EMA can be viewed, in continuous time, as a first-order low-pass filter, governed by the ODE

dwemadt=wwemaτ\frac{dw_{\text{ema}}}{dt} = \frac{w-w_{\text{ema}}}{\tau}

with τ\tau as the time constant set by the EMA decay (Patsenker et al., 2023). A refined analogy introduces the framework of damped harmonic motion: the EMA weights (“mass” m2m_2) are attracted to the model weights (“mass” m1m_1) via a spring (kk) and experience damping (c2c_2). This analogy provides an interpretation for stability and smoothness in the optimization trajectory and inspires generalizations such as BELAY, where feedback between the EMA and the model further enhances stability and convergence, especially under large learning rates or ill-conditioning.

A direct implication is that the EMA decay parameter α\alpha should scale inversely with training length, ensuring the averaging horizon is appropriate:

1α1/T1-\alpha \propto 1/T

where TT is the expected number of optimization steps (Patsenker et al., 2023).

6. Proof Techniques

The convergence proof proceeds in three main steps (Ahn et al., 2024):

  1. Discounted-online-to-nonconvex reduction: Small discounted regret of the clipped Adam optimizer implies bounds on the stationarity of the EMA output, controlling both gradient norm and output variance.
  2. Scale-free FTRL regret bound: Using per-coordinate adaptive step sizes, the regret is bounded as O(Dtgt2)O(D\sqrt{\sum_t \|g_t\|^2}) globally or coordinate-wise.
  3. Parameter balancing: Careful algebraic selection of discount, clipping, and total steps, balancing bias, stochastic variance, and regret, yields the minimax-optimal sample complexity.

7. Practical Implementation and Extensions

In practice, Adam with model EMA is implemented trivially: maintain a second set of “shadow” parameters that are updated according to the EMA recursion after every Adam step. In frameworks such as PyTorch, this consists of a single additional memory buffer and an update rule involving the current decay constant. BELAY extends this further by treating both the model and the EMA as coupled but independently governed masses, offering additional robustness and tunable feedback.

Theoretical insights indicate the widespread convention of using both Adam and model EMA in large-scale machine learning pipelines is well-justified, particularly in large, nonconvex, and heteroskedastic optimization problems (Ahn et al., 2024, Patsenker et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adam with Model Exponential Moving Average.