Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

AdamW: Weight-Decay Scaling Rule

Updated 21 October 2025
  • The paper establishes scaling laws that guide the decoupled weight decay parameter in AdamW, ensuring optimal generalization and robust hyperparameter transfer.
  • The topic is defined by the use of an EMA timescale that harmonizes learning rate and weight decay, leading to predictable optimizer behavior across varying scales.
  • Practical scaling of weight decay with model width, dataset size, and batch size is shown to enhance training stability and improve performance in deep neural networks.

The weight-decay scaling rule for AdamW refers to the mathematical principles and empirical laws governing the selection and adjustment of the decoupled weight decay parameter λ\lambda when training deep neural networks—particularly as model size, dataset size, batch size, and architecture scale. The rule ensures that AdamW operates in its generalization-optimal regime and that learning rate, weight decay, and optimizer timescale interact predictably under scaling, facilitating robust hyperparameter transfer and principled tuning.

1. Decoupled Weight Decay in AdamW

AdamW introduced the concept of decoupled weight decay, wherein the decay term is subtracted separately from the adaptive gradient update, in contrast to the coupled L2_2 regularization that modifies the loss gradient directly (Loshchilov et al., 2017). The AdamW update rule is: θt+1=θtηt[αm^tv^t+ϵ+λθt]\theta_{t+1} = \theta_t - \eta_t \left[ \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right] Here, m^t\hat{m}_t and v^t\hat{v}_t are the bias-corrected first and second moment estimates, ηt\eta_t and α\alpha are learning rate factors, and λ\lambda is the decoupled weight decay coefficient.

This formulation ensures uniform regularization of all parameters, decouples hyperparameter tuning for α\alpha and λ\lambda, and enables stable and improved generalization performance relative to Adam+L2L_2 (Loshchilov et al., 2017, Zhuang et al., 2022). Decoupled decay enhances empirical robustness, particularly in scale-invariant architectures (Kosson et al., 2023), and matches SGD-like generalization in certain regimes (Ding et al., 2023).

2. Weight-Decay Scaling Laws Across Model and Dataset Size

Recent research establishes that the optimizer's timescale, defined as the EMA (exponential moving average) integration timescale over the weights, is the key invariant under scaling (Wang et al., 22 May 2024, Bergsma et al., 19 May 2025). The timescale in iterations is given by: τiter=1ηλ\tau_{\text{iter}} = \frac{1}{\eta \lambda} In terms of dataset and batch size, the practical scaling timescale is: tepoch=BηλDt_{\text{epoch}} = \frac{B}{\eta \lambda D} where BB is batch size, DD is number of training examples. Maintaining constant tepocht_{\text{epoch}} yields scaling rules as both BB and DD grow:

  • To keep tepocht_{\text{epoch}} fixed across settings, λ\lambda should scale linearly with BB for fixed NN and DD (Bergsma et al., 19 May 2025): λoptB\lambda_{\text{opt}} \propto B
  • For increasing dataset size DD, λ\lambda should decrease (for fixed BB and η\eta) so as to preserve the optimizer’s effective timescale.

When scaling model width dd (using maximal-update parameterization, μ\muP), learning rates for matrix parameters are set as η21/d\eta_2 \propto 1/d and practical weight decay should scale as λ2d\lambda_2 \propto \sqrt{d} to preserve steady-state sublayer gain invariance across widths (Fan et al., 17 Oct 2025). This is summarized in the empirical scaling relation for matrix-like parameters: λ2d\lambda_2 \propto \sqrt{d} This scaling ensures the root-mean-square norm (and top singular value) of each weight matrix scales as η/λd0.75\sqrt{\eta/\lambda} \cdot d^{0.75}, preserving network functionality across model width.

3. Stability, Regularization, and Rotational Equilibrium

Contemporary analyses highlight that weight decay, in the presence of modern normalization, primarily regulates the steady-state scale of weight vectors—that is, it establishes a "rotational equilibrium" where the magnitude and angular update per step are constant (Kosson et al., 2023). For scale-invariant architectures, the RMS norm w^\widehat{\|w\|} settles to: w^ηC2λ\widehat{\|w\|} \approx \sqrt{\frac{\eta C}{2\lambda}} where CC is weight vector dimension. The equilibrium angular update

ηr^2ηλ1β11+β1\widehat{\eta_r} \approx \sqrt{2\eta \lambda \frac{1-\beta_1}{1+\beta_1}}

governs the neuron’s effective learning rate. Properly scaled λ\lambda yields homogeneous update rates and obviates extensive learning rate warmup (Kosson et al., 2023, Fan et al., 17 Oct 2025).

4. Mini-Batch Regimes and Generalization Bounds

The permissible range of λ\lambda must be scaled according to batch size BB and data size nn, especially for generalization in stochastic regime (Tang et al., 13 Oct 2025). For AdamW, near-zero test error is achieved when

λAdamWO~(B2n1)\lambda_{AdamW} \sim \widetilde{O}\left(\frac{B^2}{n} \wedge 1\right)

This regime ensures regularization is sufficient to suppress noise memorization, whereas Adam with coupled decay requires much smaller λ\lambda bounded by model initialization statistics: λAdamσ0q2\lambda_{Adam} \sim \sigma_0^{q-2} with σ0\sigma_0 the initialization variance and qq the activation order (Tang et al., 13 Oct 2025).

5. Practical Prescriptions and Transfer

Standard practice now recommends tuning λ\lambda so the optimizer’s EMA timescale τiter\tau_{\text{iter}} is held fixed, thus yielding robust transfer across datasets and model scales (Wang et al., 22 May 2024, Bergsma et al., 19 May 2025): λ=1ητiter\lambda = \frac{1}{\eta \tau_{\text{iter}}} Empirical validation in foundation model pretraining (ResNet, ViT, GPT, Llama) confirms that fixing timescale (and scaling λ\lambda accordingly) preserves optimal base learning rates and stable convergence as models, datasets, and batch sizes are varied (Wang et al., 22 May 2024). For architectures with sublayer normalization, zero-shot transfer across widths is achieved by scaling λ2d\lambda_2 \propto \sqrt{d} for matrix parameters (Fan et al., 17 Oct 2025).

6. Extensions and Alternative Approaches

Adaptive, model-oriented decay rules (e.g., Amos, SPD, CWD) further refine λ\lambda by dynamically coupling it to statistics of the gradient, parameter drift, or architecture-specific scale (Tian et al., 2022, Tian et al., 3 Nov 2024, Chen et al., 14 Oct 2025):

  • Scheduled Weight Decay (SWD) employs a gradient-norm-aware schedule: the penalty is stronger when overall gradient magnitude is high (Xie et al., 2020).
  • Weight norm control (AdamWN) generalizes weight decay by targeting the norm of weights to arbitrary schedules, offering finer control over parameter scale independent of loss-based updates (Loshchilov, 2023).
  • Selective Projection Decay (SPD) regularizes only those layers with inconsistent gradient behavior, preserving pre-trained initialization for foundation model fine-tuning (Tian et al., 3 Nov 2024).
  • Cautious Weight Decay (CWD) applies decay only to coordinates whose sign aligns with the update, maintaining objective fidelity and inducing Pareto-optimal stationary points (Chen et al., 14 Oct 2025).

7. Common Misconceptions and Limitations

Contrary to legacy SGD practice, tuning λ\lambda independently from the learning rate η\eta is not always optimal for AdamW. In AdamW, the effective regularization depends on ηλ\eta\lambda and the optimizer's timescale (for both memory integration and steady-state scale), not on λ\lambda alone (Loshchilov et al., 2017, Zhuang et al., 2022, Wang et al., 22 May 2024). Failing to scale λ\lambda with batch size or model width leads to misalignment in update magnitudes, degrading hyperparameter transfer and generalization (Bergsma et al., 19 May 2025, Fan et al., 17 Oct 2025). Furthermore, these scaling laws may require adaptation for architectures or optimizer families (e.g., Lion, Sophia) differing from AdamW in their weight decay dynamics (Wang et al., 22 May 2024).

Summary Table: AdamW Weight-Decay Scaling Formulas

Regime / Scaling Law Formula Interpretation / Context
EMA timescale (iterations) τiter=1/(ηλ)\tau_{\text{iter}} = 1/(\eta\lambda) Weights are EMA over recent updates
Dataset scaling (epochs) tepoch=B/(ηλD)t_{\text{epoch}} = B/(\eta\lambda D) Invariance under BB, DD scaling
Model width (matrix params) λ2d\lambda_2 \propto \sqrt{d} Preserves sublayer gain in scale-invariant nets
Mini-batch regime λAdamWO~(B2/n1)\lambda_{AdamW} \sim \widetilde{O}(B^2/n \wedge 1) Ensures robust regularization vs. noise
Rotational equilibrium norm w^ηC2λ\widehat{\|w\|} \approx \sqrt{\frac{\eta C}{2\lambda}} Steady-state weight vector scale

These rules collectively formalize a principled approach to setting AdamW's weight decay across common training scenarios, ensuring generalization-optimal behavior, stable dynamics, and robust transfer across compute regimes.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Weight-Decay Scaling Rule for AdamW.