Adam-Style Optimizers

Updated 7 March 2026

Adam-style optimizers are algorithms that use EMA of first- and second-order gradients to adapt per-parameter step sizes and smooth momentum.
They extend the standard Adam method by incorporating rectification, hybrid preconditioning, and architectural-aware scaling to improve convergence.
Practical benefits include robust performance across diverse tasks, improved accuracy on benchmarks, and mitigation of dynamic artifacts like the epochal sawtooth effect.

Adam-style optimizers represent a broad class of adaptive gradient-based optimization algorithms fundamentally rooted in the Exponential Moving Average (EMA) of first- and second-order gradient information. Originating with Adam, these methods extend and generalize the exponential-moment paradigm to modulate step sizes and directions per coordinate, yielding robust convergence across diverse deep learning workloads. Recent works refine, hybridize, and analyze Adam-style schemes, introducing architectural awareness, variance rectification, hybrid preconditioning, learning-to-optimize frameworks, and even automatic discovery. This article synthesizes the current technical landscape on Adam-style optimizers, focusing on principles, mathematical formulations, theoretical properties, empirical findings, newcomer variants, and dynamical artifacts such as the epochal sawtooth effect.

1. Core Principles and Mathematical Foundations

Adam-style optimizers combine per-parameter step-size adaptation via EMA of squared gradients with momentum-based smoothing of the raw gradient. The canonical Adam update, excluding bias-correction and weight decay for brevity, is given by:

$\begin{aligned} &m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \ &v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ &\hat m_t = m_t / (1 - \beta_1^{t}),\quad \hat v_t = v_t / (1 - \beta_2^{t}) \ &\theta_{t+1} = \theta_t - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} \end{aligned}$

where $g_t = \nabla \ell(\theta_t)$ is the stochastic gradient, $\alpha$ is the base learning rate, $\beta_1,\beta_2 \in (0, 1)$ are the first- and second-moment decay parameters, and $\epsilon$ ensures numerical stability (Liu et al., 2024).

Numerous variants alter this template, introducing rectification (RAdam), learning-to-optimize controls (HyperAdam, MADA), isometric or architectural preconditioning (IsoAdam, CaAdam), trend estimation (AdamT), alternative moment statistics (AdaBelief, AdaMomentum), modified batch selection (AdamCB), averaging (Averaged Adam), and alternative second-moment update forms (e.g. Padam, EAdam).

2. Extensions, Variants, and Adaptive Mechanisms

Rectified Adam (RAdam) and Lookahead

RAdam introduces a rectification factor correcting the effective learning rate in the early stages when the variance estimate $\hat v_t$ is unreliable. For $\rho_t > 4$ , the step-normalization is scaled by $r_t$ , where $\rho_t = \rho_\infty - \frac{2 t \beta_2^t}{1 - \beta_2^t}$ , $\rho_\infty = \frac{2}{1 - \beta_2} - 1$ , ensuring stable adaptation. For $g_t = \nabla \ell(\theta_t)$ 0, RAdam falls back to momentum SGD (Pasechnyuk et al., 2023).

Lookahead optimizers periodically interpolate ("slow" weights) with "fast" weights after every $g_t = \nabla \ell(\theta_t)$ 1 steps, smoothing the trajectory and rectifying optimizer oscillations, e.g., Lookahead(RAdam) yields state-of-the-art empirical results in software engineering tasks (Pasechnyuk et al., 2023).

Architectural/Connection-aware Adaptation

CaAdam introduces proxy-based scaling factors $g_t = \nabla \ell(\theta_t)$ 2 for the learning rate, based on layer depth, connection counts, or structural parameters:

$g_t = \nabla \ell(\theta_t)$ 3

where $g_t = \nabla \ell(\theta_t)$ 4 is computed per-parameter as a deterministic function of local connectivity or depth proxies (Genet et al., 2024). Depth-based scaling and connection-based scaling lead to improved convergence and better minima (up to $g_t = \nabla \ell(\theta_t)$ 5 accuracy on CIFAR-10 vs. Adam).

Hybridization and Meta-optimization

Meta-adaptive algorithms (MADA) parameterize a convex hull of multiple base optimizers (Adam, AMSGrad, Yogi, Adan, Lion) and perform hyper-gradient descent on the optimizer's coefficients $g_t = \nabla \ell(\theta_t)$ 6 during training (Ozkara et al., 2024). Learning-to-optimize frameworks such as HyperAdam dynamically ensemble updates produced by parallel Adam sub-processes with task-adaptive decay rates, controlled via an RNN (Wang et al., 2018). Such methods consistently outperform fixed-form Adam schemes, especially under sub-optimal hyperparameter initializations or across task boundaries.

Full-dimension and Isometric Preconditioning

IsoAdam incorporates isometric preconditioning: it normalizes updates to make them invariant to arbitrary invertible linear transformations of the input or output. The update for a weight matrix $g_t = \nabla \ell(\theta_t)$ 7 is:

$g_t = \nabla \ell(\theta_t)$ 8

where $g_t = \nabla \ell(\theta_t)$ 9 and $\alpha$ 0 are empirical covariances of the input and gradient, respectively (Jackson, 2023). HVAdam introduces a hidden vector $\alpha$ 1 estimating the "valley floor" of the loss landscape and preconditions adaptively in the direction of this floor, interpolating between SGD-like and Adam-like regimes via an incremental delay update mechanism (Zhang et al., 25 Nov 2025).

Confidence-based and Adaptive Selection

CAdam introduces a confidence gating mechanism: updates are only applied to coordinates where the sign of the momentum $\alpha$ 2 matches that of the gradient $\alpha$ 3 ( $\alpha$ 4 if $\alpha$ 5, zero otherwise). This results in improved adaptation to distributional shifts and robustness against noise in online learning (Wang et al., 2024).

AdamCB fuses Adam updates with combinatorial semi-bandit batch selection: non-uniform batch selection weights are updated via exponential weights, provably accelerating expected regret convergence by a factor $\alpha$ 6 (where $\alpha$ 7 is batch size) compared to uniform sampling (Kim et al., 7 Dec 2025).

3. Convergence, Scaling Laws, and Theoretical Guarantees

Adam-style optimizers (including their advanced variants) generally achieve $\alpha$ 8 regret in the online convex setting, and $\alpha$ 9 rates in smooth nonconvex settings under mild assumptions (Wang et al., 2021, Ozkara et al., 2024, Zhang et al., 25 Nov 2025). Several advances further tighten or generalize these bounds:

LaProp separates parameter space momentum from adaptivity, eliminating the classical $\beta_1,\beta_2 \in (0, 1)$ 0 coupling and providing regret bounds that hold for all $\beta_1,\beta_2 \in (0, 1)$ 1 (Ziyin et al., 2020).
MADA proves that interpolating among Adam, AMSGrad, Yogi, and Adan can reduce error bounds up to constants, providing an advantage over fixed-rule schemes (Ozkara et al., 2024).
Higher-order time-stepping schemes (e.g., IMEX-Trapezoidal Adam) convert Adam's dynamics into an ODE and apply second-order implicit-explicit discretizations, yielding lower errors and greater stability in stiff regimes (Bhattacharjee et al., 2024).

The scaling law between batch size $\beta_1,\beta_2 \in (0, 1)$ 2 and optimal learning rate $\beta_1,\beta_2 \in (0, 1)$ 3 is non-monotonic for sign-gradient-based Adam variants: $\beta_1,\beta_2 \in (0, 1)$ 4 increases as $\beta_1,\beta_2 \in (0, 1)$ 5 up to $\beta_1,\beta_2 \in (0, 1)$ 6 (the noise-critical batch size), reaching a maximum, then decreases. This "surge" behavior provides rigorous guidance for batch size and learning rate scheduling distinct from the linear scaling in SGD (Li et al., 2024).

4. Empirical Behaviors and Modern Performance Benchmarks

Adam-style optimizers dominate both standard deep learning and emerging scientific computing tasks. Extensive empirical studies demonstrate:

Adam and its descendants (AdaBelief, Padam, EAdam) are among the fastest for initial objective value reduction but may overfit quickly and exhibit oscillatory validation metrics in certain settings (e.g., EMNIST) (Zhu et al., 2021).
RAdam and Lookahead(RAdam) consistently outperform Adam across code-related tasks, achieving median relative improvements exceeding $\beta_1,\beta_2 \in (0, 1)$ 7 in documentation generation and $\beta_1,\beta_2 \in (0, 1)$ 8 in method name generation (Pasechnyuk et al., 2023).
CaAdam, with multiplicative/depth-based scaling, accelerates convergence and achieves up to $\beta_1,\beta_2 \in (0, 1)$ 9 improvement in CIFAR-100 accuracy relative to Adam, and reduces regression RMSE and training epochs to convergence (Genet et al., 2024).
Averaged Adam (Polyak–Ruppert or exponential moving average of iterates) consistently outperforms plain Adam, reducing test errors $\epsilon$ 0 and boosting generalization in scientific ML, PINNs, deep BSDEs, and image classification (Dereich et al., 10 Jan 2025).
Hybrid evolutionary designs (combining sign and adaptive moment terms, e.g., $\epsilon$ 1) outperform standard Adam, achieving $\epsilon$ 2 test accuracy gain on CIFAR-10 (Marfinetz, 5 Dec 2025).

5. Dynamical Artifacts: The Epochal Sawtooth Effect

A prominent phenomenon inherent to Adam-style optimizers is the "Epochal Sawtooth Effect" (ESE), identified as a sharp batchwise loss drop at the start of each epoch followed by a gradual intra-epoch increase, yielding a sawtooth-shaped loss trajectory (Liu et al., 2024). The ESE arises from:

Quadratic accumulation and intra-epoch growth in the second moment $\epsilon$ 3 leading to shrinking effective step-sizes.
An initial high momentum $\epsilon$ 4 spike due to minibatch overlap and data reshuffling, which decays within the epoch.
The interaction between these momenta causes a sharp initial drop in loss (when the numerator is large and denominator is small), transitioning to a shallower climb as $\epsilon$ 5 grows.
Smaller $\epsilon$ 6 amplifies ESE amplitude, larger batch sizes and absence of shuffling/epoch structure suppress the effect.
ESE is not a mere overfitting artifact but rather an optimizer-induced loss pattern, especially manifest in high-capacity models.

To mitigate ESE, increase $\epsilon$ 7 close to $\epsilon$ 8, reduce $\epsilon$ 9 if necessary, use larger batch sizes, or avoid epoch-based data shuffling; conversely, ESE can be exploited for optimization heuristics such as cyclical LR or SWA (Liu et al., 2024).

6. Practical Guidelines, Limitations, and Design Trade-offs

Practical recommendations, distilled from rigorous benchmarks and large-scale deployment studies, include:

Defaults: $\hat v_t$ 0 work reliably, but advanced architectures or learning-to-optimize schemes may tune these dynamically.
For code-related deep learning, Lookahead(RAdam) is the recommended default; for image or scientific ML tasks, Averaged Adam or hybrid sign-moment optimizers extend performance margins.
Confidence-aware (CAdam) or full-dimension (HVAdam) optimizers excel in online learning with nonstationary or noisy data, as evidenced by consistent AUC and GMV improvements in large-scale industrial recommender deployments (Wang et al., 2024).
Architectural scaling (CaAdam), dynamic selection (MADA), and meta-learned adaptation (HyperAdam) indicate that hand-crafted fixed optimizers are being superseded by adaptive, context-aware, and even automatically evolved schemes.
Limitations: Many new designs require extra memory or compute per iteration (e.g., isometric covariance estimation, ring buffers, meta-optimizer state) or meta-training effort. Such costs are generally offset by improved early convergence, sharper minima, or wall-clock reductions in required training epochs.

7. Outlook and Theoretical Trends

Adam-style methods are now a platform on which a wide array of theoretical, algorithmic, and empirical innovations are constructed. Trends include:

Increasing emphasis on meta-adaptation and hybridization (MADA, evolved optimizers, HyperAdam) to supplant "one-size-fits-all" rule-based schemes.
The emergence of invariant or structure-aligned preconditioning (IsoAdam, HVAdam) for deep and highly parameterized networks.
Novel analyses of dynamic loss behaviors (ESE), and their mitigation or exploitation via optimizer configuration.
Growing impact of adaptive batch selection and non-i.i.d. data handling (AdamCB, CAdam) for variance reduction and rapid adaptation.
Continued theme that optimizer design space is vast, with evolutionary and learned-search approaches already producing variants that outperform hand-crafted algorithms in challenging settings (Marfinetz, 5 Dec 2025, Ozkara et al., 2024).

It is evident that Adam-style optimization continues to provide a fertile foundation for both technical advancement and practical deployment in modern machine learning. The field is characterized by rapid iteration, synergy between empirical discovery and mathematical insight, and by cross-pollination between theory, architecture, and optimizer design.