Adam with Model EMA Optimization

Updated 11 March 2026

Adam with Model EMA is a method that combines adaptive gradient updates with an exponential moving average to enhance optimizer stability and convergence in nonconvex settings.
The approach utilizes coordinate-wise clipping and adaptive scaling to achieve minimax-optimal rates even under heterogeneous noise and gradient conditions.
Theoretical analysis shows that employing a clipped Adam update alongside model EMA outperforms vanilla Adam and SGD, providing deterministic convergence guarantees.

Adam with Model Exponential Moving Average (EMA) refers to the combination of the Adam optimizer—a widely used adaptive gradient method in stochastic optimization for deep learning—with an exponential moving average applied directly to model parameters. This pairing, especially in nonconvex settings, merges the benefits of adaptivity, momentum, clipping, and online-to-batch conversion through model weight averaging, achieving minimax-optimal convergence rates for smooth and nonsmooth problems. Recent theoretical advances have established that employing a clipped version of Adam alongside a model EMA enables optimal convergence guarantees in both global and coordinate-wise nonconvex settings, and is superior to vanilla Adam or SGD under heterogeneous coordinate scales (Ahn et al., 2024). The EMA component can itself be understood via a physical analogy to damped harmonic motion, motivating further algorithmic generalizations and improved parameter schedules (Patsenker et al., 2023).

1. Algorithmic Foundations

The joint method is designed for stochastic nonconvex optimization. Let $F:\mathbb{R}^d \to \mathbb{R}$ be the objective, with stochastic gradients $g_t \approx \nabla F(x_t)$ . The optimization proceeds by alternating between clipped Adam updates and maintenance of a model EMA shadow sequence.

Clipped Adam Update (coordinate-wise):

Exponential gradient and squared-gradient averages:

$m_{t,i} = \sum_{s=1}^t \beta_1^{t-s} g_{s,i}, \quad v_{t,i} = \sqrt{\sum_{s=1}^t \beta_2^{t-s} g_{s,i}^2}$

Scalar clip operator:

$\clip_D(u) = \min\Big(\frac{D}{|u|}, 1\Big) u$

Parameter update with smoothing constant $\varepsilon>0$ :

$z_{t,i} = -\clip_D\Big(D\,\frac{m_{t-1,i}}{v_{t-1,i}+\varepsilon}\Big), \quad x_t = x_{t-1} + z_t$

Model Exponential Moving Average:

EMA parameter with discount $\beta$ :

$\tilde x_t = \frac{\beta-\beta^t}{1-\beta^t}\tilde x_{t-1} + \frac{1-\beta}{1-\beta^t} x_t, \quad \tilde x_0 = x_0$

Or, equivalently,

$\tilde x_t = \frac{1-\beta}{1-\beta^t} \sum_{s=1}^t \beta^{t-s} x_s$

At algorithm termination, $\tilde x_T$ is returned as the solution.

In practice, parameters are typically chosen as $\beta_1 \approx 0.9$ , $\beta_2 \approx 0.999$ (with theory favoring $\beta_2 = \beta_1^2$ ), small $\varepsilon$ , and a large EMA decay $\beta$ (Ahn et al., 2024).

2. Convergence Theory and Optimality

Convergence is quantified using a generalized $(\lambda,\epsilon)$ –Goldstein stationarity notion:

$\inf_{p: \mathbb{E}[y] = x} \big\{ \mathbb{E}\|\nabla F(y)\| + \lambda \mathbb{E}\|y-x\|^2 \big\} \leq \epsilon$

2.1 Global Nonconvex Convergence

Under $G$ -Lipschitz $F$ , stochastic gradients with variance $\leq\sigma^2$ , and initial gap $\Delta=F(x_0)-\inf F$ , the algorithm achieves:

With $\epsilon\approx G+\sigma$ :

$T = O\Big((G+\sigma)^2\Delta\,\epsilon^{-7/2}\Big)$

This rate is minimax-optimal for smooth and nonsmooth nonconvex problems (Ahn et al., 2024).

2.2 Coordinate-wise Adaptive Rates

For per-coordinate Lipschitz and noise parameters $G=(G_i)$ , $\sigma=(\sigma_i)$ , the coordinate-wise result is:

Stationarity in $\ell_1$ :

$T = O\Big(\|G+\sigma\|_1^2\,\Delta\,\sqrt d\,\epsilon^{-7/2}\Big)$

An $O(d)$ speedup arises if only a few coordinates are “hard” ( $\|G+\sigma\|_1 \approx \|G+\sigma\|_2$ ).

3. Comparative Analysis: EMA-Augmented Adam vs. Alternatives

Optimizer	Requires Model EMA	Rate (Smooth Nonconvex)	Key Limitation
Adam (no EMA)	No	$O(\epsilon^{-4})$	Suboptimal, random iterate
Clipped Adam + EMA	Yes	$O(\epsilon^{-7/2})$	Optimal, deterministic EMA
SGD + Polyak averages	No	$O(\epsilon^{-7/2})$	No coordinate adaptation
Adagrad (scale-free FTRL)	No	$O(\sqrt{\sum g_t^2})$	Step-size tuning needed

Without model EMA, Adam attains suboptimal rates, and analysis typically forces random iterate selection, which increases output variance and is rarely used in practice. SGD with averaging also reaches optimal rates, but lacks per-coordinate scaling, effectively slowing progress on heteroskedastic objectives. Adaptive methods such as scale-free FTRL can be integrated but still require additional tuning (Ahn et al., 2024).

4. Structural Roles: Momentum, Discounting, and Adaptivity

Momentum ( $\beta_1$ ): Interpreted as discount factors in the sequence of observed gradients, implementing an online-to-nonconvex reduction where each step minimizes a discounted history of linearized losses.
EMA Discounting ( $\beta$ ): Both the loss-regret framework and model EMA employ the same discount parameter to weight history, making the optimizer both responsive and stable in nonstationary or varying regimes.
Coordinate-wise Adaptivity ( $\beta_2$ and normalization): Each coordinate adaptively tunes its own effective step size via per-coordinate second-moment normalization, crucial when gradient or noise scales are nonuniform across coordinates, leading to dimension-dependent speedup.
Gradient Clipping ( $D$ ): Added robustness to occasional large-magnitude gradients, enforcing bounded iterates.

5. Physical and Dynamical Analogies for EMA

Model EMA can be viewed, in continuous time, as a first-order low-pass filter, governed by the ODE

$\frac{dw_{\text{ema}}}{dt} = \frac{w-w_{\text{ema}}}{\tau}$

with $\tau$ as the time constant set by the EMA decay (Patsenker et al., 2023). A refined analogy introduces the framework of damped harmonic motion: the EMA weights (“mass” $m_2$ ) are attracted to the model weights (“mass” $m_1$ ) via a spring ( $k$ ) and experience damping ( $c_2$ ). This analogy provides an interpretation for stability and smoothness in the optimization trajectory and inspires generalizations such as BELAY, where feedback between the EMA and the model further enhances stability and convergence, especially under large learning rates or ill-conditioning.

A direct implication is that the EMA decay parameter $\alpha$ should scale inversely with training length, ensuring the averaging horizon is appropriate:

$1-\alpha \propto 1/T$

where $T$ is the expected number of optimization steps (Patsenker et al., 2023).

6. Proof Techniques

The convergence proof proceeds in three main steps (Ahn et al., 2024):

Discounted-online-to-nonconvex reduction: Small discounted regret of the clipped Adam optimizer implies bounds on the stationarity of the EMA output, controlling both gradient norm and output variance.
Scale-free FTRL regret bound: Using per-coordinate adaptive step sizes, the regret is bounded as $O(D\sqrt{\sum_t \|g_t\|^2})$ globally or coordinate-wise.
Parameter balancing: Careful algebraic selection of discount, clipping, and total steps, balancing bias, stochastic variance, and regret, yields the minimax-optimal sample complexity.

7. Practical Implementation and Extensions

In practice, Adam with model EMA is implemented trivially: maintain a second set of “shadow” parameters that are updated according to the EMA recursion after every Adam step. In frameworks such as PyTorch, this consists of a single additional memory buffer and an update rule involving the current decay constant. BELAY extends this further by treating both the model and the EMA as coupled but independently governed masses, offering additional robustness and tunable feedback.

Theoretical insights indicate the widespread convention of using both Adam and model EMA in large-scale machine learning pipelines is well-justified, particularly in large, nonconvex, and heteroskedastic optimization problems (Ahn et al., 2024, Patsenker et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Adam with model exponential moving average is effective for nonconvex optimization (2024)

Exponential weight averaging as damped harmonic motion (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adam with Model Exponential Moving Average.

Adam with Model EMA Optimization

1. Algorithmic Foundations

2. Convergence Theory and Optimality

2.1 Global Nonconvex Convergence

2.2 Coordinate-wise Adaptive Rates

3. Comparative Analysis: EMA-Augmented Adam vs. Alternatives

4. Structural Roles: Momentum, Discounting, and Adaptivity

5. Physical and Dynamical Analogies for EMA

6. Proof Techniques

7. Practical Implementation and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Adam with Model EMA Optimization

1. Algorithmic Foundations

2. Convergence Theory and Optimality

2.1 Global Nonconvex Convergence

2.2 Coordinate-wise Adaptive Rates

3. Comparative Analysis: EMA-Augmented Adam vs. Alternatives

4. Structural Roles: Momentum, Discounting, and Adaptivity

5. Physical and Dynamical Analogies for EMA

6. Proof Techniques

7. Practical Implementation and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research