Weight Prediction–Boosted AdamW

Updated 31 October 2025

The paper demonstrates that integrating weight prediction into AdamW accelerates convergence and enhances generalization with minimal computational overhead.
Weight Prediction–Boosted AdamW is defined by forecasting future weight updates using current gradient estimates to perform lookahead computations during training.
Empirical results show improved performance in image classification, NLP, and generative modeling with notable gains in accuracy, convergence speed, and robustness.

Weight Prediction–Boosted AdamW is a class of optimization techniques that augment the standard AdamW optimizer with explicit weight prediction steps, aiming to enhance convergence rate and generalization performance in deep neural network (DNN) training. This methodology, sometimes referred to as "AdamW with weight prediction," leverages the structure of AdamW and recent theoretical insights connecting its update rule to exponential moving averages (EMAs) and to proximal optimization frameworks. Weight prediction can be integrated in various ways, notably as a plug-in step before each gradient computation, and can be coupled with modern scaling rules for weight decay that support robust large-scale training.

1. Mathematical Foundations and Standard AdamW

AdamW decouples weight decay from the adaptive gradient steps found in standard Adam, yielding the following update at step $t$ : $w_t = (1 - \eta \lambda) w_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$ where $\eta$ is the learning rate, $\lambda$ is the weight decay factor, $\hat{m}_t$ and $\hat{v}_t$ are bias-corrected first and second moment estimates of gradients, respectively. This structure enables AdamW to act as a first-order approximation to a proximal gradient method for a composite regularized objective and grants it the property of scale-freeness, i.e., invariance of parameter updates to rescaling of gradients across coordinates (Zhuang et al., 2022).

AdamW’s update can be reinterpreted as an exponential moving average (EMA) of past updates, where the effective EMA timescale in iterations is: $\tau_{\text{iter}} = \frac{1}{\eta \lambda}$ and, measured in epochs,

$\tau_{\text{epoch}} = \frac{\tau_{\text{iter}}}{M}$

with $M$ denoting the number of iterations per epoch (Wang et al., 22 May 2024). Thus, tuning $\lambda$ precisely controls the breadth of historical influence on the current weights.

2. Weight Prediction: Formal Definition and Algorithmic Integration

Weight prediction augments AdamW by forecasting parameter values $s$ steps ahead according to the optimizer’s update rule, then using these future weights for both forward and backward passes: $\hat{w}_{t+s} \approx w_t - s\cdot \eta \frac{\hat{m}_{t+1}}{\sqrt{\hat{v}_{t+1}} + \epsilon}$ where $\hat{m}_{t+1}$ , $\hat{v}_{t+1}$ are computed using the current gradient as a proxy for future updates (Guan, 2023, Guan et al., 2023). The workflow is as follows:

Cache current weights $w_t$ .
Compute predicted future weights $\hat{w}_{t+s}$ .
Use $\hat{w}_{t+s}$ for the forward and backward passes, calculating gradients at this "lookahead" point.
Restore $w_t$ and apply the standard AdamW update using the newly computed gradients.

This procedure is an efficient single-step approximation to full extragradient methods, offering the benefits of "lookahead" optimization with minimal computational and memory overhead (typically $<12\%$ compute, $<5\%$ memory for typical use (Guan et al., 2023)).

3. Theoretical Rationale and Effect on Optimization Dynamics

By employing gradients at predicted future weights, the optimizer anticipates the trajectory of weight evolution, effectively aligning parameter updates with the direction in which the optimizer is implicitly headed. This approach helps:

Reduce gradient staleness and misalignment due to inertia or momentum,
Accelerate convergence by mitigating lag between the step direction and the likely near-future parameter space,
Enhance generalization by smoothing the optimization trajectory, analogously to extragradient or lookahead methods.

The equivalence of AdamW's dynamics to an EMA further contextualizes weight prediction: both mechanisms exploit the underlying trend in parameter updates, aiming to harness historical information for more robust progression along the loss landscape (Wang et al., 22 May 2024).

4. Empirical Performance and Scaling Considerations

Extensive empirical studies have demonstrated that AdamW with weight prediction ("AdamW+WP") yields systematically improved results across a variety of tasks and architectures:

Image classification (CIFAR-10): Top-1 accuracy gains of approximately 0.4–0.8% over baseline AdamW (e.g., DenseNet-121: 93.97% (AdamW) $\rightarrow$ 94.39% (AdamW+WP) (Guan, 2023, Guan et al., 2023)).
Natural language processing: Lower perplexity and higher BLEU/dev accuracy on recurrent and transformer models.
Generative modeling: Substantially improved FID (e.g., WGAN: 97.02 (AdamW) $\rightarrow$ 74.94 (AdamW+WP) (Guan et al., 2023)).
Convergence speed: Faster reduction in loss early and late in training; more robust improvement after learning rate decay.
Robustness: Performance gains are not sensitive to the precise step size $s$ , with optimal $s$ typically between 1 and 3.
Resource usage: Negligible additional computation/memory compared to classical extragradient methods, which require a full extra gradient evaluation per update.

The strategy is also validated on large-scale settings (Llama, StableLM pretraining), demonstrating stability of optimal EMA timescales and, by extension, robustness of the weight prediction scheme (Wang et al., 22 May 2024).

5. Compatibility with Scaling Rules: Model and Dataset Size

Recent theoretical and empirical results show that the optimal EMA timescale in AdamW, when measured in epochs, is approximately invariant to model and dataset size (Wang et al., 22 May 2024). This key observation yields a practical prescription for weight decay:

$\lambda = \frac{1}{\eta M \cdot \tau_{\text{epoch}}}$

with $\tau_{\text{epoch}}$ chosen in the range $[1, \#\text{epochs}]$ . As a consequence:

Scaling up the dataset ( $M$ increases): $\lambda$ should decrease ( $\propto 1/M$ ).
Scaling up the model (fan-in or width increases): $\lambda$ should increase proportionally, matching the scaling of the learning rate, e.g., $\lambda \propto \text{fan-in}$ if $\eta \propto \text{fan-in}^{-1}$ .
Consistent EMA timescale: This ensures that the effective "memory window" over which updates are averaged remains stable, allowing for reliable hyperparameter transfer and better stability as models and datasets grow.

Weight prediction thus benefits from, and integrates naturally with, these scaling laws. Selecting $\lambda$ and $\eta$ to preserve the optimal EMA timescale ensures both the preservation of output statistics and the stability of weight prediction steps across scales.

6. Practical Implementation and Integration

Implementing weight prediction–boosted AdamW requires minimal code modifications: before every data batch, perform the following pseudocode steps:

1	w_pred = w_t - s * eta * m_t_plus_1 / (sqrt(v_t_plus_1) + eps)

The prediction step $s$ typically takes value 1–3. AdamW's learning rate and weight decay should be set according to the EMA timescale rule above. The same principle generalizes to other optimizers (SGDM, AdaBelief, etc.) by using the appropriate base update rule.

Weight prediction–boosted AdamW can be further extended and analyzed in light of several related concepts:

Norm control: Explicitly setting weight norms (as in AdamWN) can offer more direct regularization than decay, potentially improving generalization (Loshchilov, 2023).
Proximal view and scale-freeness: AdamW's scale invariance and its proximal formulation promote stability in the presence of heterogeneous or poorly conditioned gradients (Zhuang et al., 2022).
Alternative weighting strategies: As shown in the weighted adaptive gradient method framework (WAGMF), the weighting assigned to past gradients can be generalized beyond the EMA/exponential scheme of AdamW (e.g., linear weighting in WADA), potentially further improving convergence and regret bounds (Zhong et al., 2021).
Hyperparameter transfer: Empirical scaling rules (muP and extensions) enable transfer of optimizer settings across model sizes and architectures, a crucial property for the practical deployment of weight prediction–boosted AdamW at scale (Wang et al., 22 May 2024, Fan et al., 17 Oct 2025).

A plausible implication is that future optimization algorithms may combine weight prediction, adaptive non-exponential weighting of past updates, and explicit norm or rotation constraints to further improve large-scale learning dynamics.

Table: Key Components in Weight Prediction–Boosted AdamW

Component	Description	Formula/Algorithm Location
Predicted weights	$s$ -step lookahead based on base optimizer	$\hat{w}_{t+s}$ (see above)
Gradient computation	Use gradients at $\hat{w}_{t+s}$	Forward/backward on predicted point
Actual update	Standard AdamW using predicted gradients	$w_t \leftarrow$ AdamW( $\nabla f(\hat{w}_{t+s})$ )
Weight decay scaling	EMA timescale determines $\lambda$	$\lambda = 1/(\eta M\, \tau_\text{epoch})$
Empirical hyperparameter	$s=1{-}3$ steps robust to choice	(Guan, 2023, Guan et al., 2023, Wang et al., 22 May 2024)

Weight prediction–boosted AdamW offers a principled, empirically validated pathway to improve the convergence and generalization of AdamW in both moderate and massively scaled DNN training, especially when harmonized with theoretical scaling rules for adaptive hyperparameters.