Warmup-Stable-Decay (WSD) Schedule

Updated 25 October 2025

The Warmup-Stable-Decay (WSD) schedule is a learning rate strategy that segments training into warmup, stable plateau, and decay phases to enhance both initial stability and efficient convergence.
The constant learning rate in the plateau phase enables rapid progress along low-curvature directions, while the final decay phase sharpens performance by reducing oscillations.
Empirical and theoretical studies demonstrate that WSD improves final model performance and compute efficiency in large-scale training, including transformer pre-training and certified robustness.

The Warmup-Stable-Decay (WSD) schedule is a three-phase learning rate scheduling strategy that has become prominent in large-scale neural network and transformer pre-training, as well as certified robust optimization, due to its empirical and theoretical effectiveness across diverse domains. WSD divides training into a gradual warmup period, an extended constant plateau (“stable” phase), and a final sharp decay (“cooldown” or “annealing” phase). This structure both simplifies scheduling (removing the need to preset the total token/step budget) and improves efficiency and stability, especially in settings with large batches or models.

1. Formulation and Mathematical Structure

WSD is formulated by partitioning the training horizon into three segments: a warmup phase of length $W$ , a stable phase until $T$ , and a decay phase. In canonical form (Hu et al., 9 Apr 2024, Wen et al., 7 Oct 2024), the learning rate $\eta(s)$ at training step $s$ is defined as:

$\eta(s) = \begin{cases} (\frac{s}{W}) \cdot \eta & \text{if } 0 \leq s < W \ \eta & \text{if } W \leq s < T \ f(s-T) \cdot \eta & \text{if } s \geq T \end{cases}$

where:

$W$ is the warmup duration,
$T$ marks the end of the stable plateau,
$\eta$ is the target (maximum) learning rate,
$f(s-T)$ is a monotonic decay function (commonly exponential, linear, cosine, or power-law (Luo et al., 17 Mar 2025, Bergsma et al., 21 Feb 2025)).

Variants include exponential warmup (Kim et al., 2021), rule-of-thumb linear warmup for Adam $\omega_t = \min\{1, (1-\beta_2)/2 \cdot t\}$ (Ma et al., 2019), adaptive warmup based on loss monitoring, and post-hoc checkpoint merging to emulate decay (Tian et al., 23 Jul 2025). The decay phase is typically short (e.g., $\sim$ 10% of tokens (Hu et al., 9 Apr 2024, Wen et al., 7 Oct 2024)) and its shape (cooldown) can significantly affect the final performance (Dremov et al., 2 Aug 2025).

2. Mechanistic Insights and Loss Landscape

Multiple recent theoretical analyses highlight the geometric “river valley” structure of the loss surface in LLM pretraining (Wen et al., 7 Oct 2024, Liu et al., 6 Jul 2025). The stable phase’s high (constant) learning rate enables rapid traversal along the flat “river” direction (low Hessian curvature), but causes oscillatory movement in steep “hill” directions. Only during the decay phase do these oscillations subside, “revealing” the underlying improvement. The validation loss thus drops sharply at decay onset; the loss curve appears flat or even elevated during the plateau, then plunges as learning rate falls.

This behavior is formalized in the valley–river model $L(x, y) = c(y) + \frac12 a(y) x^2$ , with rapid equilibration in “valley” directions and slow descent along “river” directions (Liu et al., 6 Jul 2025). The strong “Mpemba point” refers to an optimal plateau learning rate that pre-equilibrates fast modes, allowing for maximally accelerated convergence in the decay phase.

Recent stochastic analyses (Functional Scaling Laws, FSL (Li et al., 23 Sep 2025)) show that WSD schedules boost the “intrinsic time” available for optimization, with the stable phase allowing more efficient risk reduction before noise is suppressed sharply by decay.

3. Empirical Evidence and Comparative Performance

Empirical studies spanning convolutional networks, LLMs, and certified robust models consistently support WSD’s superiority in final performance and compute efficiency:

Representation Stabilization: In CNNs with large batch sizes, linear warmup prevents instability in deeper layers, keeping CCA similarity high and improving validation accuracy; freezing deep layers during warmup achieves similar effects (Gotmare et al., 2018).
LLM Pretraining: WSD schedules outperform cosine and step decay, with rapid loss drops at decay onset. The optimal compute-efficient data:model ratio is much higher than prior guidelines (e.g., 192:1 vs. Chinchilla’s 20:1) (Hu et al., 9 Apr 2024).
Loss Curve Prediction: Multi-power law models accurately predict WSD-style loss curves and derive schedules slightly outperforming standard WSD or cosine decay (Luo et al., 17 Mar 2025).
Certified Robustness: In IBP-based training, improved initialization and batch norm dramatically reduce warmup length, achieving state-of-the-art verified robustness with 20–60× fewer epochs (Shi et al., 2021).
Linear Decay-to-Zero: When decaying learning rates all the way to zero (rather than to a fraction), compute savings up to 60% are observed for compute-optimal token regimes (Bergsma et al., 21 Feb 2025).

Schedule Type	Final Loss (lower is better)	Typical Compute Efficiency	Stability (large batch)
Cosine (decay to 10%)	Medium	Moderate	Susceptible
WSD (plateau+decay)	Superior	High	Stable
D2Z (linear to zero)	Best (under high TPP)	Highest	High
WSM (merge strategy)	Best (across tasks)	High	High
SF (schedule-free)	Near WSD/D2Z	High	Best at scale

4. Phase-Specific Mechanisms

Warmup Phase:

Early ramp-up controls sharpness by letting transient instability (“loss catapults”) reduce Hessian maxima (Kalra et al., 13 Jun 2024). This self-stabilization ensures deeper layers do not diverge, enabling robust use of higher target $\eta$ without failure (Gotmare et al., 2018, Ma et al., 2019). In certified training, batch norm and tailored initialization can eliminate the need for extended warmup (Shi et al., 2021). For Adam, initializing the second moment with the first squared gradient (GI-Adam) yields built-in warmup (Kalra et al., 13 Jun 2024, Ma et al., 2019).

Stable Phase:

Holding $\eta$ constant enables fast progress along low-curvature directions. In “river valley” landscapes, the optimizer covers ground quickly but with significant oscillations orthogonal to the main valley, hiding gains until decay (Wen et al., 7 Oct 2024, Liu et al., 6 Jul 2025).

Decay/Cooldown Phase:

A sharp drop in learning rate minimizes oscillations, consolidating gains from the plateau (Hu et al., 9 Apr 2024). The shape of this decay (linear, sqrt, cosine, or power-law) controls the bias-variance tradeoff; “sqrt” and “lowered linear” shapes with parameter $0.7$ outperform others (Dremov et al., 2 Aug 2025). Final model quality is sensitive to cooldown tune and AdamW $\beta_2$ (higher values help). Empirical loss landscape visualizations confirm “river valley” descent during cooldown.

5. Practical Recommendations and Optimizations

Adopt linear or exponential warmup for stability (rule-of-thumb: $2/(1-\beta_2)$ steps for Adam (Ma et al., 2019)).
Prefer prolonged stable phases for large-scale or unset compute budget scenarios; checkpoint before entering decay. In continual pretraining, fully converged checkpoints yield better adaptation (Gupta et al., 2023).
Decay phase should span $\sim$ 10% of training, using shapes that empirically balance bias and variance (e.g., sqrt or $0.7$-lowered linear).
In certified training, exploit improved initialization and BN to reduce warmup epochs (Shi et al., 2021).
For efficiency, consider merging checkpoints post-training to emulate decay (WSM), yielding systematic improvements over scheduled decay (Tian et al., 23 Jul 2025).
In scenarios with unpredictable or extendable compute budgets, WSD (and WSD-S (Wen et al., 7 Oct 2024)) is preferred over cosine—compute-agnostic, checkpoint-flexible, and robust.
When gradient noise dominates (high tokens-per-parameter or small batch), decaying LR to near-zero (D2Z) is optimal (Bergsma et al., 21 Feb 2025).
Schedule-free methods (SF-AdamW) remove decay phases entirely and match performance by implicit weight averaging, but may be sensitive to batch size and momentum hyperparameters (Song et al., 14 Jul 2025).

6. Theory–Practice Synthesis and Future Directions

Recent mathematical models (functional scaling laws (Li et al., 23 Sep 2025), multi-power law (Luo et al., 17 Mar 2025), Lyapunov-based SGDM analysis (Kondo et al., 5 Aug 2025)) explicitly show that WSD schedules optimize not only final risk but entire loss curve dynamics by maximizing “intrinsic time” during the plateau and minimizing noise via decay. The formal link between checkpoint merging and decay (Tian et al., 23 Jul 2025), as well as the thermodynamic analogy to the Mpemba effect (strong plateau accelerates cooldown) (Liu et al., 6 Jul 2025), give principled guidelines for tuning.

Machine learning practitioners are advised to favor WSD-style schedules with adaptive warmup, extended plateau, optimized shape for cooldown, and checkpoint strategies where possible. This offers superior convergence, reduced compute, and robust stability across architectures, datasets, and scale regimes.

7. Controversies and Open Considerations

While WSD has become widely adopted, certain aspects remain empirically rather than theoretically justified:

The optimal decay shape and plateau height are influenced by landscape geometry (local curvature, sharpness, and river directions) and may require empirical tuning (Liu et al., 6 Jul 2025, Dremov et al., 2 Aug 2025).
Decay-free methods such as model merging (WSM) and schedule-free optimization (SF) are emerging as alternatives, potentially obviating explicit decay phases (Tian et al., 23 Jul 2025, Song et al., 14 Jul 2025).
Automatic schedule discovery via loss curve surrogates (multi-power law, FSL) may supplant hand-tuned strategies in the future (Luo et al., 17 Mar 2025, Li et al., 23 Sep 2025).

In summary, the WSD schedule is a robust, theoretically and empirically justified strategy for large-scale deep learning, resolving early instability, accelerating “river” progress, and consolidating gains for superior final generalization. Its influence spans neural architecture, continual pretraining, certified robust optimization, and large language modeling, and continues to shape contemporary practice in learning rate schedule design.