AdamW: Weight-Decay Scaling Rule
- The paper establishes scaling laws that guide the decoupled weight decay parameter in AdamW, ensuring optimal generalization and robust hyperparameter transfer.
- The topic is defined by the use of an EMA timescale that harmonizes learning rate and weight decay, leading to predictable optimizer behavior across varying scales.
- Practical scaling of weight decay with model width, dataset size, and batch size is shown to enhance training stability and improve performance in deep neural networks.
The weight-decay scaling rule for AdamW refers to the mathematical principles and empirical laws governing the selection and adjustment of the decoupled weight decay parameter when training deep neural networks—particularly as model size, dataset size, batch size, and architecture scale. The rule ensures that AdamW operates in its generalization-optimal regime and that learning rate, weight decay, and optimizer timescale interact predictably under scaling, facilitating robust hyperparameter transfer and principled tuning.
1. Decoupled Weight Decay in AdamW
AdamW introduced the concept of decoupled weight decay, wherein the decay term is subtracted separately from the adaptive gradient update, in contrast to the coupled L regularization that modifies the loss gradient directly (Loshchilov et al., 2017). The AdamW update rule is: Here, and are the bias-corrected first and second moment estimates, and are learning rate factors, and is the decoupled weight decay coefficient.
This formulation ensures uniform regularization of all parameters, decouples hyperparameter tuning for and , and enables stable and improved generalization performance relative to Adam+ (Loshchilov et al., 2017, Zhuang et al., 2022). Decoupled decay enhances empirical robustness, particularly in scale-invariant architectures (Kosson et al., 2023), and matches SGD-like generalization in certain regimes (Ding et al., 2023).
2. Weight-Decay Scaling Laws Across Model and Dataset Size
Recent research establishes that the optimizer's timescale, defined as the EMA (exponential moving average) integration timescale over the weights, is the key invariant under scaling (Wang et al., 22 May 2024, Bergsma et al., 19 May 2025). The timescale in iterations is given by: In terms of dataset and batch size, the practical scaling timescale is: where is batch size, is number of training examples. Maintaining constant yields scaling rules as both and grow:
- To keep fixed across settings, should scale linearly with for fixed and (Bergsma et al., 19 May 2025):
- For increasing dataset size , should decrease (for fixed and ) so as to preserve the optimizer’s effective timescale.
When scaling model width (using maximal-update parameterization, P), learning rates for matrix parameters are set as and practical weight decay should scale as to preserve steady-state sublayer gain invariance across widths (Fan et al., 17 Oct 2025). This is summarized in the empirical scaling relation for matrix-like parameters: This scaling ensures the root-mean-square norm (and top singular value) of each weight matrix scales as , preserving network functionality across model width.
3. Stability, Regularization, and Rotational Equilibrium
Contemporary analyses highlight that weight decay, in the presence of modern normalization, primarily regulates the steady-state scale of weight vectors—that is, it establishes a "rotational equilibrium" where the magnitude and angular update per step are constant (Kosson et al., 2023). For scale-invariant architectures, the RMS norm settles to: where is weight vector dimension. The equilibrium angular update
governs the neuron’s effective learning rate. Properly scaled yields homogeneous update rates and obviates extensive learning rate warmup (Kosson et al., 2023, Fan et al., 17 Oct 2025).
4. Mini-Batch Regimes and Generalization Bounds
The permissible range of must be scaled according to batch size and data size , especially for generalization in stochastic regime (Tang et al., 13 Oct 2025). For AdamW, near-zero test error is achieved when
This regime ensures regularization is sufficient to suppress noise memorization, whereas Adam with coupled decay requires much smaller bounded by model initialization statistics: with the initialization variance and the activation order (Tang et al., 13 Oct 2025).
5. Practical Prescriptions and Transfer
Standard practice now recommends tuning so the optimizer’s EMA timescale is held fixed, thus yielding robust transfer across datasets and model scales (Wang et al., 22 May 2024, Bergsma et al., 19 May 2025): Empirical validation in foundation model pretraining (ResNet, ViT, GPT, Llama) confirms that fixing timescale (and scaling accordingly) preserves optimal base learning rates and stable convergence as models, datasets, and batch sizes are varied (Wang et al., 22 May 2024). For architectures with sublayer normalization, zero-shot transfer across widths is achieved by scaling for matrix parameters (Fan et al., 17 Oct 2025).
6. Extensions and Alternative Approaches
Adaptive, model-oriented decay rules (e.g., Amos, SPD, CWD) further refine by dynamically coupling it to statistics of the gradient, parameter drift, or architecture-specific scale (Tian et al., 2022, Tian et al., 3 Nov 2024, Chen et al., 14 Oct 2025):
- Scheduled Weight Decay (SWD) employs a gradient-norm-aware schedule: the penalty is stronger when overall gradient magnitude is high (Xie et al., 2020).
- Weight norm control (AdamWN) generalizes weight decay by targeting the norm of weights to arbitrary schedules, offering finer control over parameter scale independent of loss-based updates (Loshchilov, 2023).
- Selective Projection Decay (SPD) regularizes only those layers with inconsistent gradient behavior, preserving pre-trained initialization for foundation model fine-tuning (Tian et al., 3 Nov 2024).
- Cautious Weight Decay (CWD) applies decay only to coordinates whose sign aligns with the update, maintaining objective fidelity and inducing Pareto-optimal stationary points (Chen et al., 14 Oct 2025).
7. Common Misconceptions and Limitations
Contrary to legacy SGD practice, tuning independently from the learning rate is not always optimal for AdamW. In AdamW, the effective regularization depends on and the optimizer's timescale (for both memory integration and steady-state scale), not on alone (Loshchilov et al., 2017, Zhuang et al., 2022, Wang et al., 22 May 2024). Failing to scale with batch size or model width leads to misalignment in update magnitudes, degrading hyperparameter transfer and generalization (Bergsma et al., 19 May 2025, Fan et al., 17 Oct 2025). Furthermore, these scaling laws may require adaptation for architectures or optimizer families (e.g., Lion, Sophia) differing from AdamW in their weight decay dynamics (Wang et al., 22 May 2024).
Summary Table: AdamW Weight-Decay Scaling Formulas
| Regime / Scaling Law | Formula | Interpretation / Context |
|---|---|---|
| EMA timescale (iterations) | Weights are EMA over recent updates | |
| Dataset scaling (epochs) | Invariance under , scaling | |
| Model width (matrix params) | Preserves sublayer gain in scale-invariant nets | |
| Mini-batch regime | Ensures robust regularization vs. noise | |
| Rotational equilibrium norm | Steady-state weight vector scale |
These rules collectively formalize a principled approach to setting AdamW's weight decay across common training scenarios, ensuring generalization-optimal behavior, stable dynamics, and robust transfer across compute regimes.