Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdamW Optimization in Deep Learning

Updated 2 February 2026
  • AdamW optimization is an adaptive algorithm that decouples weight decay from gradient updates, enhancing stability and convergence in deep neural networks.
  • It utilizes bias-corrected first and second moment estimates with explicit ℓ2 regularization, yielding robust performance in high-dimensional settings.
  • Empirical benchmarks demonstrate that AdamW outperforms traditional methods like Adam and SGD in vision, language, and molecular modeling tasks.

AdamW is a first-order adaptive optimization algorithm designed for stochastic objective functions in deep learning. It modifies the original Adam optimizer by "decoupling" weight decay from the gradient update, yielding improvements in both theoretical properties and empirical performance across vision, language, molecular modeling, and other high-dimensional domains. AdamW is the default optimizer for training large neural architectures such as Transformers, ConvNeXt, LLaMA, and atomistic foundation models. Its canonical update decomposes the parameter change into an adaptive, per-coordinate gradient step based on running first and second moments, and an explicit 2\ell_2 regularization term applied independently.

1. Algorithm and Mathematical Formalism

AdamW maintains first-moment and second-moment estimators for the gradient at each step. Let θt\theta_t be the model parameters, gt=θL(θt)g_t = \nabla_\theta \mathcal{L}(\theta_t) the stochastic gradient, learning rate ηt\eta_t, decay factors β1,β2[0,1)\beta_1, \beta_2 \in [0,1), and weight decay coefficient λwd\lambda_{wd}. The AdamW update is

mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)(gtgt) m^t=mt/(1β1t),v^t=vt/(1β2t) θt+1=θtηt(m^tv^t+ϵ)ηtλwdθt\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) (g_t \odot g_t) \ \hat{m}_t &= m_t/(1-\beta_1^t), \quad \hat{v}_t = v_t/(1-\beta_2^t) \ \theta_{t+1} &= \theta_t - \eta_t \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \right) - \eta_t \lambda_{wd} \theta_t \end{aligned}

where ϵ\epsilon is a small constant for numerical stability. Bias correction is mandatory for optimal performance. Weight decay is applied "decoupled," i.e., separately from the gradient-adaptive update rather than inside gtg_t.

This decoupling is crucial: in classic Adam, weight decay is added to the gradient, resulting in dynamically scaled regularization that can be inappropriate for ill-conditioned directions. In AdamW, weight decay is an explicit isotropic shrinkage, preserving the spectral flattening effect of adaptive steps and acting as a trust-region regularizer (Liu et al., 5 Dec 2025).

2. Theoretical Properties and Convergence

AdamW exhibits robust convergence properties analogous to stochastic gradient descent. Recent work establishes that for deep learning tasks in dimension dd and over KK iterations, AdamW achieves

1Kk=1KE[f(xk)1]O(dCK1/4)\frac{1}{K} \sum_{k=1}^K \mathbb{E}[\|\nabla f(x^k)\|_1] \leq O\left( \frac{\sqrt{d} C}{K^{1/4}} \right)

where CC matches the scaling in the optimal SGD rate. Empirical studies indicate that in high-dimensional neural networks, f(x)1=Θ(d)f(x)2\|\nabla f(x)\|_1 = \Theta(\sqrt{d}) \|\nabla f(x)\|_2, so the convergence in 1\ell_1 is equivalent up to constants with the best-known SGD rate in 2\ell_2 norm (Li et al., 17 May 2025). No bounded-gradient assumption is needed; only finite second-moment of noise is required.

Continuous-time formulations yield a principled ODE view of AdamW, where the weight decay term controls update magnitude without contaminating adaptive momentum or variance estimates, implying sharply bounded, stable updates for hyperparameter choices (β2>β1)(\beta_2 > \beta_1) (Gould et al., 2024). These analyses provide strong guidance for optimal tuning and architectural design.

3. Implicit Bias, Scale-Freeness, and Objective Geometry

AdamW fundamentally differs from Adam-2\ell_2 (coupled regularization) in its objective geometry. AdamW can be interpreted as an approximation to the proximal gradient method for composite objectives F(x)=f(x)+λ2x22F(x) = f(x) + \frac{\lambda}{2}\|x\|_2^2:

xtxt1ηtλxt1ηtαm^tv^t+ϵx_t \approx x_{t-1} - \eta_t \lambda x_{t-1} - \eta_t \frac{\alpha \hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}

The explicit shrinkage via weight decay restores scale-freeness: per-coordinate rescaling of gradients leaves the update invariant (assuming ϵ=0\epsilon=0), in contrast to Adam-2\ell_2 where regularization interacts pathologically with per-coordinate scaling (Zhuang et al., 2022). Empirically, AdamW excels in settings with multi-scale gradients (deep, unnormalized nets) where scale-free updates are critical to convergence.

Deterministic analysis shows that AdamW in full-batch mode with non-increasing step schedule and diverging cumulative sum converges to a KKT point of the original objective under an \ell_\infty constraint: θ1/λ\|\theta\|_\infty \leq 1/\lambda (Xie et al., 2024). This geometric constraint illuminates AdamW's implicit bias, suggesting robustness in high-curvature or poorly scaled problems.

4. Practical Performance, Empirical Benchmarks, and Task-Dependent Effects

AdamW exhibits strong empirical performance for large-scale vision, language, and scientific domains.

  • Vision (ViT, ConvNeXt): AdamW consistently outperforms SGD in fine-tuning, especially under distribution shift and for models with embedding-layer gradient outliers. On CLIP ViT-B/16, AdamW improves OOD accuracy by +8.1% compared to SGD. "Freeze-embedding" is a memory-efficient hack that closes the gap for SGD, implicating AdamW's primary advantage as controlling large first-layer updates (Kumar et al., 2022).
  • Atomistic modeling: Empirical benchmarks across molecular, crystalline, and interfacial tasks show AdamW and ScheduleFree achieve the best force RMSE and physical observable fidelity. Decoupled decay yields superior curvature conditioning; post-training L-BFGS refinement enhances anisotropy correction (Liu et al., 5 Dec 2025).
  • Memory scaling: AdamW has higher per-parameter state (16 bytes) compared to SGD (8–12 bytes). APOLLO methods compress memory to near-SGD levels via random low-rank projection, matching or exceeding AdamW's curve on LLaMA-7B and LLaMA-13B at large batch sizes (Zhu et al., 2024).

AdamW also generally requires much smaller learning rates relative to SGD, and best practice employs cosine or linear warm-up schedules.

5. Variants and Extensions: Stability, Acceleration, and Augmented Updates

Several extensions improve on AdamW by targeting stability, convergence speed, and variance reduction:

  • Aida: Adds exponent parameters (p,q)(p, q) generalizing second-moment normalization, breaking the tight η\etaϵ\epsilon coupling for local stability. Empirically, setups such as (p,q)=(1,2)(p,q)=(1,2) outperform vanilla AdamW (2,1)(2,1) on Transformers and Swin-Transformer by 3%\sim3\%. Stability at the origin requires nonzero weight decay for q>1,p>1q>1,p>1 (Zhang et al., 2021).
  • MARS-AdamW: Integrates STORM-style variance reduction via scaled recursive momentum, reducing the number of tokens required for GPT-2 training by 50%\sim50\% and improving downstream zero-shot accuracy by +1.9%+1.9\% (Yuan et al., 2024).
  • AdaPlus: Merges AdamW's decoupled decay, Nadam’s Nesterov momentum, and AdaBelief’s precise curvature-based step sizing. AdaPlus often matches or exceeds momentum SGD and AdamW across vision, language, and GAN training with no added hyperparameters (Guan, 2023).
  • Weight-predicted AdamW: Introduces future weight prediction for forward/backward passes, boosting convergence rates and final accuracy by $0.08$–0.74%0.74\% on image tasks and lowering perplexity by up to $9.2$ on PTB LSTM (Guan, 2023).
  • APOLLO: Approximates AdamW’s per-element scaling via low-rank random projections, achieving similar generalization at batch sizes and memory scale otherwise unattainable for AdamW (Zhu et al., 2024).

6. Implementation and Tuning Guidelines

Across these studies, the following recommendations emerge:

  • Default hyperparameters (β1,β2)=(0.9,0.999)(\beta_1, \beta_2)=(0.9, 0.999), ϵ=108\epsilon=10^{-8}, λwd[104,102]\lambda_{wd} \in [10^{-4}, 10^{-2}] are safe for most Transformer, ViT, and LLM workloads.
  • Decoupled weight decay is essential; never incorporate regularization into the gradient with adaptive methods.
  • Learning rate scheduling—cosine decay, linear warm-up, or step down—is necessary due to AdamW's sensitivity.
  • For architectures with explicit normalization (batch/layer/qk-norm), implicit meta-adaptive effects further enhance stability; 2-Adam and k-Adam generalizations are effective (Gould et al., 2024).
  • For large-scale models, memory-efficient AdamW variants (APOLLO) or freezing the largest-gradient layers during fine-tuning yield substantial resource savings.

7. Context, Limitations, and Future Directions

AdamW’s success is attributed to its theoretically principled decoupling of weight decay, scale-freeness, and explicit geometric adaptation to high-dimensional, poorly scaled losses. While proximal interpretations and KKT-constrained analyses provide clear mathematical basis, stochastic gradient dynamics and nonconvex convergence rates remain only partially characterized. Although AdamW achieves optimal O(K1/4)O(K^{-1/4}) complexity in high dimensions, the cost of adaptivity scales with d\sqrt{d} in 1\ell_1 norm.

Future directions focus on variance reduction integration (MARS), memory compression (APOLLO), flexible stability (Aida-type exponents), and meta-optimization with layered normalization. The impact of architectural features (embedding outliers, normalization layers) and optimization-state tuning (layer-freezing, low-rank sketches) remains an active area, especially for foundation model fine-tuning and resource-constrained large-scale training.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdamW Optimization.