Warmup-Stable-Decay (WSD) Paradigm

Updated 25 October 2025

WSD is a learning rate scheduling paradigm that segments training into warmup, stable, and decay phases to improve model convergence and stability.
It employs a gradual warmup to prevent early instability, a prolonged stable phase to maximize effective learning, and a rapid decay phase to suppress gradient noise.
Empirical and theoretical analyses show that WSD outperforms cosine decay by enhancing scaling law efficiency and optimizing the bias–variance trade-off.

Warmup-Stable-Decay (WSD) is a learning rate scheduling paradigm for optimizing deep networks—including LLMs, vision transformers, and kernel regression systems—structured into three sequential phases: a warmup interval during which the learning rate increases from a low (often zero) value to a target peak; a stable phase maintaining the maximum learning rate for most steps; and a final decay or cooldown phase where the learning rate is annealed, often sharply, to facilitate convergence. WSD has emerged as a robust, theoretically justified alternative to schedules such as cosine decay, and is empirically shown to improve convergence dynamics, stability, loss minimization, and scaling law efficiency in both data- and compute-limited regimes.

1. Mechanics and Theoretical Rationale

The essential mechanics of WSD rely on the partitioning of training into warmup, stable, and decay periods:

Warmup phase: The learning rate $\eta_t$ is ramped from zero or a small initial value to its target value $\eta_{\max}$ over a predetermined interval, typically using a linear schedule:

$\eta(t) = \begin{cases} (\eta_{\max}/t_\mathrm{warm}) \cdot t, & t \leq t_\mathrm{warm} \ \eta_{\max}, & t > t_\mathrm{warm} \end{cases}$

This modulates the effective step size and prevents large, unstable parameter updates—particularly in sensitive deeper layers (Gotmare et al., 2018, Kalra et al., 13 Jun 2024, Alimisis et al., 3 Oct 2025).

Stable phase: Maintains the peak learning rate for the majority of training to maximize “intrinsic time” (cumulative effective learning progress), enhancing fast progress along low-curvature (“river”) directions of the loss landscape (Wen et al., 7 Oct 2024, Li et al., 23 Sep 2025). The choice of plateau height, initial warmup duration, and stable phase length are critical; analytical conditions for the optimal plateau (strong Mpemba point) exist for the valley–river model (Liu et al., 6 Jul 2025).
Decay (cooldown) phase: Initiates a rapid reduction in learning rate (“annealing”)—often exponential, linear, or inverse-time shaped—at a scheduled step to eliminate oscillations along “sharp” or “mountain” directions, suppress gradient noise, and descend into a well-conditioned basin of the loss surface (Wen et al., 7 Oct 2024, Dremov et al., 2 Aug 2025).

The theoretical support for WSD leverages several lines of analysis:

Smoothness and Convergence Theory: The $(H_0, H_1)$ -smoothness condition— $\|\nabla^2 f(w)\|_2 \le H_0 + H_1(f(w)-f^*)$ —shows that curvature and convergence rate are determined by sub-optimality, motivating an adaptive schedule that increases effective step-size in early (high-loss) regime (Alimisis et al., 3 Oct 2025).
Loss Landscape Conditioning: Warmup triggers “catapult” events that forcibly reduce sharpness (top Hessian eigenvalue) via transient loss spikes, guiding optimization into flatter regions and enabling safe, high learning rates subsequently (Kalra et al., 13 Jun 2024).
Functional Scaling Laws (FSL): The FSL framework models risk as a sum of approximation error, full-batch progress (scaling as $1/t^s$ in “intrinsic time”), and a convolutional noise term; a prolonged stable high-LR phase extends $t$ , while the decay phase cleans up accumulated noise (Li et al., 23 Sep 2025).
Mpemba Effect in Valley–River Model: Starting with a “hot” (high-LR) plateau then quenching (decaying) the LR yields faster convergence (“strong Mpemba point”), explained by analytical conditions on the amplitude of slowest relaxation modes in the Fokker–Planck operator (Liu et al., 6 Jul 2025).

2. Empirical Findings and Training Dynamics

Extensive empirical studies confirm several characteristic features and advantages of WSD schedules:

Loss Evolution: During the stable phase, validation loss remains elevated due to oscillatory behavior; the decay phase induces a sharp drop, revealing underlying progress made along flat directions (Wen et al., 7 Oct 2024, Dremov et al., 2 Aug 2025).
Layerwise Stability: CCA (Canonical Correlation Analysis) shows warmup especially stabilizes deeper, discriminative layers. Freezing final layers mimics the effect of warmup, confirming its protective role against instability in high-LR, large-batch regimes (Gotmare et al., 2018).
Scaling Law Efficiency: WSD schedules facilitate efficient empirical measurement of data-model scaling laws, with MiniCPM experiments identifying a much higher compute-optimal data-to-model ratio (192×) than previous “Chinchilla Optimal” estimates (20×), enabling more data-efficient training of small models (Hu et al., 9 Apr 2024).
Bias–Variance Trade-off: The cooldown phase shape mediates the bias–variance balance—aggressive cooldowns provide more exploration (higher variance, lower bias, better model souping), whereas faster decay suppresses variance but risks suboptimal minima (Dremov et al., 2 Aug 2025).
Optimizer Interactions: Untuned linear or exponential warmup for Adam-based optimizers is empirically validated to match or outperform more algorithmically complex approaches (e.g., RAdam), and GI-Adam initialization can further automate warmup benefits (Ma et al., 2019, Kalra et al., 13 Jun 2024).

3. Practical Implementation and Schedule Selection

Implementation of WSD schedules typically follows these phases:

Phase	Learning Rate Rule	Objective
Warmup	Linear or exponential	Stability, safe ramp-up
Stable	Constant $\eta_{\max}$	Fast progress, "explore"
Decay	Linear, exponential, or sqrt/inv-time	Fine-tuning, bias-variance balance

Warmup duration is frequently set by “rule-of-thumb” formulas (e.g., linear warmup over $2/(1-\beta_2)$ steps in Adam; exponential warmup $\omega_t=1-\exp[-(1-\beta_2)t]$ ) (Ma et al., 2019).
The stable phase dominates runtime (often 80–90% of steps) and should not be truncated prematurely (Hu et al., 9 Apr 2024, Li et al., 23 Sep 2025).
Decay phase length and shape are critical; “sqrt” or “lowered linear 0.7” shapes minimize bias–variance, and higher AdamW $\beta_2$ values during cooldown increase stability (Dremov et al., 2 Aug 2025).
Practical recommendations include checkpointing at stable-LR, then branching into decay (WSD-S scheme), allowing flexible adaptation to different compute budgets and facilitating continual or domain-adaptive training (Wen et al., 7 Oct 2024).

4. Advanced Variants, Alternatives, and Model Averaging

Recent developments question the necessity of an explicit online decay phase:

WSM: Warmup-Stable-Merge (Tian et al., 23 Jul 2025): Here, model merging of periodically checkpointed states post-stable training can emulate decay strategies as weighted averages of model weights: $\hat{\theta}_{n+k} = \sum_{j=0}^k c_j \theta_{n+j}$ , with the $c_j$ designed to match gradient decay sequences. Extensive benchmarking shows WSM delivers higher performance than online decay schedules (WSD), particularly when merge duration is optimized.
Linear Decay to Zero (D2Z) (Bergsma et al., 21 Feb 2025): Empirical and theoretical analysis (via AdamW’s EMA interpretation) reveals that linear decay to zero outperforms partial decay (e.g., cosine to 10%), offering improved bias reduction and variance suppression; D2Z provides compute savings over standard schedules, especially at high tokens-per-parameter.

5. Interactions with Scaling, Initialization, and Feature Alignment

Weight Decay vs. Maximal Update Parameterization (μP) (Kosson et al., 21 Oct 2025): Independent weight decay, rather than μP learning rate rules, governs stable feature update dynamics across widths after early training; μP effectively serves as an implicit warmup mechanism, with the main stabilizing factor being the product $\eta \lambda$ .
Metric-Based Update Control (Kosson et al., 31 Oct 2024): Monitoring update $\ell_2$ -norm, angular change, and Relative Representation Change (RRC) enables adaptive warmup and reduces fixed warmup requirements. Optimizer modifications (e.g., LionA/LionAR) can normalize updates such that angular and RRC metrics are controlled, further reducing early training instabilities.

6. Loss Landscape and Scaling Law Perspectives

River Valley Landscape (Wen et al., 7 Oct 2024, Liu et al., 6 Jul 2025, Dremov et al., 2 Aug 2025): The “river valley” metaphor models loss surfaces as valleys with steep sides (fast modes) and a flat river (slow global modes). WSD’s stable phase maximizes progression along the river, while decay phase reduces oscillations and ensures effective descent into the basin. Mathematically, SGD dynamics can be decomposed into “river” (primary progress) and “hill” (oscillation due to large LR and noise) components.
Functional Scaling Law Analysis (Li et al., 23 Sep 2025): The optimal schedule balances intrinsic time (by maximizing stable LR period) versus final risk minimization (by aggressive noise suppression during decay). Quantitative risk decomposition via FSL provides explicit guidelines for hyperparameter settings, with rapid decay phases dominating final error terms.

7. Practical Recommendations and Outlook

Optimal training with WSD requires careful but not fragile hyperparameter settings due to its robustness; plateau height can be chosen via analytical conditions such as the strong Mpemba point.
Most training should reside in the stable phase for efficiency; the decay phase should be tuned (typically 10–20% of total steps, shape depending on bias–variance trade-off); model merging and interpolative checkpointing offer new avenues for decay-free finishing (Tian et al., 23 Jul 2025).
Theoretical and empirical analyses affirm WSD’s superiority over constant learning rate, pure decay, or cosine schedules for LLM pretraining, scaling law estimation, and transfer across model sizes.

WSD thus represents both a practical and theoretically grounded framework for large-scale model optimization, incorporating insights from modern loss landscape analysis, SGD stochasticity, scaling law efficiency, and advanced techniques in model averaging and optimizer design.