Adam with Corrected Weight Decay (AdamC)

Updated 1 July 2025

Adam with Corrected Weight Decay (AdamC) is a set of modifications for adaptive optimizers, like Adam, designed to address problems with how standard weight decay interacts with adaptive scaling and learning rate schedules.
AdamC addresses issues where traditional weight decay or even decoupled methods cause unstable training and suboptimal generalization by interacting poorly with adaptive scaling, normalization layers, and learning rate schedules.
AdamC variants employ techniques such as decoupled decay, gradient-norm-aware scheduling, and learning-rate schedule correction to ensure stable optimization dynamics and achieve generalization performance comparable to or better than SGD.

Adam with Corrected Weight Decay (AdamC) is a family of optimizer modifications and design principles that address deficiencies of traditional weight decay when used with adaptive gradient methods such as Adam. The motivation for AdamC arises from empirical and theoretical findings that naïve application of L2 regularization within Adam, or even decoupled weight decay (as in AdamW), can lead to undesirable optimization dynamics, unstable training, and suboptimal generalization—especially in settings involving normalization layers, learning rate schedules, and large-scale training. AdamC encompasses both specific algorithmic corrections and a broader framework for rigorously integrating weight decay and regularization within adaptive optimizers.

1. Motivation and Historical Background

Adaptive gradient optimizers, notably Adam, have become standard in deep neural network training due to their fast convergence and resilience to gradient scaling. However, researchers identified that Adam, when combined with conventional L2 regularization, can underperform compared to stochastic gradient descent (SGD) with weight decay, often showing inferior generalization, especially in deep architectures and fine-tuning scenarios (Loshchilov et al., 2017). This discrepancy is rooted in the interaction between Adam’s per-parameter scaling (via exponential moving averages of squared gradients) and the regularization term, leading to ineffective or inconsistent penalization of weight norms (Zhang et al., 2018).

AdamW was proposed as a partial solution, decoupling weight decay from the adaptive gradient update (Loshchilov et al., 2017). Nevertheless, this remedy does not address all pathologies, particularly in the presence of learning rate schedules and normalization. Recent research has exposed further shortcomings and prompted the introduction of algorithmic corrections now grouped under the "AdamC" (Adam with Corrected Weight Decay) label (Defazio, 2 Jun 2025).

2. Deficiencies of Standard Weight Decay in Adam

2.1. Coupled Regularization

Standard L2 regularization in Adam is implemented by augmenting the loss with a quadratic penalty, which, when differentiated, adds a term $\lambda \theta$ to the gradient. However, Adam applies per-parameter adaptive scaling, so the penalty’s impact varies depending on the parameter’s past gradient history—a dramatic departure from its behavior in SGD. For parameters with large historical gradients, the effective regularization is weak; for those with small gradients, it may be too strong. This coupling undermines the regularization’s intended effect and disrupts tuning (Loshchilov et al., 2017).

2.2. Norm and Learning Rate Interactions

When using normalization layers (such as BatchNorm or LayerNorm), the functional impact of the weight vector norm disappears, and only the direction (orientation) matters. However, weight decay in Adam (even with decoupling as in AdamW) shrinks the norm depending on the learning rate, creating an undesirable dependence: the equilibrium gradient norm for a normalized layer is

$\frac{\|g_t\|}{\|x_t\|} = \sqrt{\frac{2\lambda}{\gamma_t}}$

which increases sharply as the learning rate $\gamma_t$ decreases during schedule annealing (Defazio, 2 Jun 2025). This dynamic leads to late-stage gradient explosions ("tail blow-up") and unstable optimization.

3. Algorithmic Innovations in AdamC

3.1. Decoupled Weight Decay (AdamW and Beyond)

Decoupled weight decay, as introduced in AdamW, separates the regularization step from the adaptive gradient scaling: $\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \alpha \lambda \theta_t$ This structure ensures the decay term applies directly and uniformly across parameters, independent of local gradient scaling (Loshchilov et al., 2017). It allows for independent tuning of learning rate and decay, facilitating more stable training (Bjorck et al., 2020).

3.2. Gradient-Norm-Aware Decay Scheduling

Further refinements recognize that constant weight decay still leads to excessive final gradient norms or instability (Xie et al., 2020). Scheduled Weight Decay (SWD) introduces a normalized schedule: $\theta_t = \left(1 - \eta / \sqrt{\bar{v}_t + \epsilon} \lambda\right)\theta_{t-1} - \eta/\sqrt{\hat{v}_t + \epsilon} \hat{m}_t$ where $\bar{v}_t$ is the mean squared gradient estimate. This adaptation ensures that the effect of weight decay is appropriately scaled throughout training, promoting flat minima and improved generalization.

3.3. Learning Rate–Schedule Corrected Decay

AdamC specifically solves the pathological gradient blow-up observed with learning rate schedules and normalization layers (Defazio, 2 Jun 2025). Its correction sets the decay term to

$\hat{\lambda}_t = \lambda \frac{\gamma_t}{\gamma_{\max}}$

applied as

$x_{t+1} = x_t - \gamma_t (\text{Adam-terms}) - \gamma_t \hat{\lambda}_t x_t$

This ensures that the steady-state gradient-to-weight-norm ratio is constant and not a function of the decaying learning rate: $\frac{\|g_t\|}{\|x_t\|} = \sqrt{\frac{2\lambda}{\gamma_{\max}}}$ This correction eliminates late-training gradient norm spikes and results in smoother, more stable convergence.

3.4. Weight Norm Control

Moving beyond decay-to-zero, weight norm control ("AdamWN") treats decoupled weight decay as a particular case (target weight norm zero) and generalizes regularization by targeting an explicit weight norm: $\theta_{t+1} = \theta_t - k_t \left(1 - \frac{r_t \|\theta_0\|}{\|\theta_t\|}\right)\theta_t$ with $r_t$ the explicit target ratio and $k_t$ the schedule rate (Loshchilov, 2023). This yields more interpretable regularization, straightforward transfer across batch sizes, and stability under various training schedules.

4. Theoretical and Empirical Analysis

4.1. Theoretical Perspectives

Extensive theoretical work demonstrates that properly decoupled and scheduled weight decay restores convergence properties in nonsmooth, nonconvex optimization (Ding et al., 2023). The Adam family with fully decoupled decay provably converges to stationary points of the regularized problem, matching SGD’s asymptotic behavior and addressing prior gaps in theory. In contrast, empirical and analytical studies show that without such corrections, Adam (even with weight decay) may converge to radically different (and sometimes memorizing) minima than SGD, leading to a persistent generalization gap (Zou et al., 2021).

4.2. Empirical Outcomes

Experimental studies show that AdamC variants (decoupled, schedule-corrected, or norm-controlled):

Lower final test error, closing or eliminating Adam/SGD generalization gaps in vision and language settings (Loshchilov et al., 2017, Xie et al., 2020, Ding et al., 2023).
Maintain robust training dynamics across a wide hyperparameter range and with warmup or non-monotonic learning rate schedules (Xie et al., 2020, Loshchilov, 2023, Defazio, 2 Jun 2025).
Preserve stable weight/gradient norms and avoid end-stage instability, especially in long-duration LLM training (Defazio, 2 Jun 2025).

5.1. Selective and Structured Regularization

In foundation model fine-tuning, indiscriminate weight decay can either force overfitting or catastrophic forgetting. Selective Projection Decay (SPD) (Tian et al., 3 Nov 2024) improves upon AdamC by activating decay per-layer only when it impedes loss reduction, yielding better in-distribution and out-of-distribution generalization.

5.2. Adaptive and Model-Informed Schedules

Amos (Tian et al., 2022) introduces per-variable, theory-driven scaling of both learning rates and decays informed by model architecture and real-time gradient statistics. This yields faster, memory-efficient convergence and reduces the need for manual schedule tuning. Adaptive and scheduled strategies continue to receive attention for robust, transfer-friendly optimization.

5.3. Rotational Equilibrium and Angular Control

Recent insights highlight that, in scale-invariant (normalized) networks, weight decay functionally controls the rate of angular change (rotation) of neuron weight vectors, rather than simply shrinking norms. Directly controlling angular update rate may further improve stability and reduce the need for learning rate warmup (Kosson et al., 2023).

6. Practical Implementation and Deployment

AdamC techniques have been incorporated in all major deep learning frameworks. Most notable is the widespread adoption of AdamW for decoupled decay. The corrected and scheduled variants (AdamC, SWD) can generally be implemented as drop-in replacements by modifying the weight decay schedule per step, optionally scaling by gradient statistics or learning rate schedule (Xie et al., 2020, Defazio, 2 Jun 2025). Reference implementations and code are available for AdamW, SWD, Amos, SPD, and AdamC.

Key implementation considerations:

Apply weight decay only to designated parameter groups (e.g., exclude biases and normalization scales).
For learning-rate schedule correction, use the currently scheduled learning rate and its initial maximum to compute decay scaling.
In norm control, periodically rescale or project weights to match the desired norm.

7. Limitations and Future Directions

While AdamC addresses key pathologies in adaptive optimizer regularization, certain nonconvex optimization landscapes remain problematic—Adam with any regularization may still diverge from SGD’s solution, as shown in memorization-oriented tasks (Zou et al., 2021). Further, some domains (e.g., language modeling) may show mixed results for scheduled or adaptive decay. The field is evolving toward more selective, structured, and model-informed regularization schedules, and toward direct control of functional variables (such as angular direction) rather than purely on parameter norms.

Adam with Corrected Weight Decay (AdamC) encompasses algorithmic solutions that deliver stable, interpretable, and tunable regularization for adaptive optimizers. By addressing the deficiencies of L2 regularization and naïve decay schedules, AdamC variants ensure robust convergence, eliminate schedule-induced instabilities, and achieve generalization competitive with or superior to SGD in a wide range of deep learning settings.