C-AdamW: Enhanced AdamW Optimizer

Updated 2 October 2025

C-AdamW is an enhanced variant of AdamW that employs cautious masking to enforce descent on each parameter update.
It uses a binary masking mechanism to filter updates based on the alignment between the gradient and update direction, ensuring monotonic loss decrease.
Empirical evaluations indicate that C-AdamW accelerates convergence and improves generalization, making it effective for large-scale deep learning tasks.

C-AdamW broadly refers to a family of AdamW variants and enhancements distinguished by modifications in update direction, stability mechanism, or masking—sometimes specifically denoting "Cautious AdamW" but also used in literature to indicate extensions based on AdamW's core principles. The defining characteristic is adaptation or augmentation of the conventional AdamW optimizer, which employs decoupled weight decay for explicit regularization, adaptive per-parameter learning rates, and momentum-based updates. C-AdamW and its relatives seek to improve theoretical convergence properties, optimization monotonicity, empirical convergence speed, final generalization, or computational efficiency.

1. Foundational Structure: AdamW and Its Cautious Variants

AdamW update dynamics are given by

$w_{t+1} = w_t - \epsilon_t \cdot \left( \hat{m}_t / (\sqrt{\hat{v}_t} + \varepsilon) + \lambda w_t \right)$

where $\hat{m}_t$ and $\hat{v}_t$ are bias-corrected EMA estimates of the gradient and its square, $\lambda$ is the decoupled weight decay, and $\epsilon_t$ is the learning rate.

C-AdamW ("Cautious AdamW" (Liang et al., 25 Nov 2024)) introduces a masking mechanism to update only those parameter coordinates for which the update direction and current gradient align: $w_{t+1} = w_t - \epsilon_t \cdot \left[ u_t \odot \phi(u_t \odot g_t) \right]$ with default $\phi(x) = \mathbb{1}\{x > 0\}$ , $u_t$ being the AdamW direction, and $\odot$ denoting elementwise multiplication. This guarantees $\langle u_t, g_t \rangle \ge 0$ for all coordinates, ensuring each masked update is a descent step, thus decreasing the loss monotonically given small enough learning rates.

2. Theoretical Properties and Stability Analysis

The introduction of the cautious mask preserves a weakened Hamiltonian/Lyapunov structure: $H(w, s) = L(w) + K(s)$ where $L$ is the loss and $K(s)$ represents the kinetic energy of momentum dynamics. The cautious variant ensures

$\frac{dL}{dt} = - x^T \phi(x) - (\text{non-negative term}) \leq 0$

holding for each time step in the continuous-time limit. Thus, both energy and loss decrease along trajectories, and the theoretical convergence guarantees of the original momentum method (including AdamW) carry over to C-AdamW for standard choices of the masking function when step sizes are sufficiently small (Liang et al., 25 Nov 2024).

Additionally, the requirement that $x^T(1 - \phi(x)) \leq 0$ (as holds for $\phi(x) = \mathbb{1}\{x > 0\}$ ) is sufficient for descent. Empirically, using per-coordinate masking and a rescaling of the learning rate to compensate for masked coordinates (e.g., $\bar{\epsilon}_t = \epsilon_t \cdot d / (\text{nnz}(\phi_t) + \xi)$ ) maintains optimization efficiency.

3. Algorithmic Implementation

C-AdamW is implemented as a minimal modification to the standard AdamW. In deep learning frameworks such as PyTorch, this can be achieved with one additional line per parameter group:

1 2	m = (u * g > 0).to(g.dtype) p.add_(u * m / (m.mean() + eps), alpha=-lr)

where

u

is the AdamW update direction,

g

is the current gradient, and

p

is the parameter tensor. The mask

m

is binary: it is one where

u

and

g

have matching sign and zero otherwise; this mask is used to filter updates and is normalized by the fraction of active coordinates.

This simple insertion is effective in both small-scale and large-scale runs, and learning rate rescaling is recommended to avoid excessively conservative steps that can arise if too many coordinates are masked.

4. Empirical Evaluation and Performance

Empirical studies across vision (MAE/ImageNet1K) and language modeling (LLaMA pretraining) tasks demonstrate that C-AdamW strictly accelerates convergence relative to standard AdamW. In LLaMA pretraining, speedups of up to $1.47\times$ in sample efficiency were observed, together with consistently lower validation perplexity and improved downstream evaluation scores (Liang et al., 25 Nov 2024). In computer vision, C-AdamW achieves lower final evaluation loss in masked autoencoding tasks.

The methodology is robust across model sizes (from 60M to 1B parameters for LLMs) and different data domains. Furthermore, analogous performance boosts are observed when applying the cautious masking principle to other momentum-based optimizers, such as Lion, yielding C-Lion, with proportional gains.

Empirical and theoretical comparison show that while C-AdamW preserves key properties and ease of use of AdamW, the mask ensures strict per-step descent, potentially leading to improved sample efficiency. In modern LLM pretraining, AdamW remains a robust baseline (Semenov et al., 1 Sep 2025), but C-AdamW offers improved convergence for the same hyperparameters, and requires only minimal adjustment (potentially just slightly scaling the learning rate).

Alternate AdamW extensions, such as AdaPlus (which incorporates Nesterov momentum and precise stepsize adjustment) (Guan, 2023), show that the AdamW framework is amenable to efficient augmentation. Meanwhile, weight prediction–boosted AdamW ("C-AdamW" in (Guan, 2023)) leverages future weight estimates for gradient calculation, consistently yielding 0.34–0.74% higher top-1 accuracy and lower validation loss on standard image and language modeling tasks.

6. Practical Implications, Limitations, and Use Cases

C-AdamW is immediately applicable to large-scale pretraining regimes, transformer models, and deep vision architectures due to its operational simplicity. Because it preserves the monotonic decrease of $L$ and the energy function $H$ , it is notably advantageous in settings where stability and predictable training dynamics are paramount.

The absence of new hyperparameters ensures compatibility with standard AdamW configurations, and its lightweight integration (single line of code) causes negligible computational overhead. The masking function's conservatism, however, may reduce optimization aggressiveness when the update directions and gradients are frequently misaligned; the learning rate must be properly rescaled—especially if the proportion of active coordinates is small.

Theoretical guarantees rely on sufficiently small step sizes, so aggressive learning rates may require empirical validation. The performance gains have been validated in mid- to large-scale settings, but further empirical scrutiny may be needed on extremely large or highly nonconvex optimization problems.

7. Summary Table: Core Features

Variant	Key Modification	Empirical Speedup	PyTorch Integration
AdamW	Decoupled weight decay, Adam-style	Baseline	Standard AdamW
C-AdamW	Cautious masking $(u \odot g > 0)$	Up to $1.47\times$	One-line masking/scaling
AdaPlus	Nesterov, prediction error scaling	1.85–2.0% higher	Momentum, EMA submodule
WeightPred-AdamW	Forward/backprop w/ predicted weights	0.34–0.74% higher	Pre-step prediction routine

The table summarizes distinguishing algorithmic features and the scale of empirically observed improvements.

In conclusion, C-AdamW constitutes a theoretically justified and empirically validated enhancement of AdamW, with a core design that enforces per-coordinate descent and preserves key dynamical invariants. Its minimal implementation overhead, broad empirical benefit, and strong theoretical guarantees in the Hamiltonian and Lyapunov frameworks recommend it as a robust optimization strategy for deep and large-scale neural network training (Liang et al., 25 Nov 2024).