Warmup-Stable-Decay LRS

Updated 8 January 2026

WSD LRS is a three-phase adaptive learning rate scheduler that comprises a linear warmup, a constant stable phase, and an annealed decay phase to enhance neural model training.
Its design is theoretically grounded in the river valley model, balancing bias and variance for effective exploration of complex loss landscapes.
Empirical benchmarks demonstrate that decay functions like square-root yield consistent performance gains, making WSD LRS a robust choice for scalable language models.

The Warmup-Stable-Decay (WSD) Learning Rate Scheduler is a three-phase adaptive learning rate strategy that has become foundational in the training of large-scale neural LLMs, particularly transformers. WSD divides training into a brief warmup, a long stable-phase with constant learning rate, and a cooldown or decay phase, during which the learning rate is annealed to a low value. The structure, mathematical formulation, theoretical justification, practical implications, and empirical tradeoffs of WSD have been extensively studied across recent literature, providing a unified perspective on scheduler design and optimization in deep learning.

1. Mathematical Structure and Phase Definition

WSD is defined via three contiguous training phases:

Warmup: Learning rate increases linearly from zero (or a small initial value) to a peak value $\eta_{\max}$ over $T_{\rm warm}$ steps,

$\eta(t) = \eta_{\max} \frac{t}{T_{\rm warm}} \quad\quad 0 \leq t < T_{\rm warm}$

Stable: Learning rate is held constant,

$\eta(t) = \eta_{\max} \quad\quad T_{\rm warm} \leq t < T_{\rm stab}$

Decay/Cooldown: Learning rate is annealed smoothly to zero or a minimum value,

$\eta(t) = \eta_{\max}\,f\left(\frac{t-T_{\rm stab}}{T_{\rm tot}-T_{\rm stab}}\right), \quad T_{\rm stab} \leq t \leq T_{\rm tot}$

where $f(\cdot)$ is a monotonically decreasing decay function, such as linear, cosine, exponential, square-root, or custom power-law variants. This template supports both single-phase runs and more advanced multi-budget (“branching”) strategies (Dremov et al., 2 Aug 2025, Wen et al., 2024, Tian et al., 23 Jul 2025, Hu et al., 2024, Li et al., 23 Sep 2025).

Phase Timing and Hyperparameters

Typical settings emphasize a minimal warmup fraction (0.5–2%), a long plateau (80–90%), and a final decay of 10–20% of total steps. The decay function can be selected from several families, with square-root and 1-sqrt laws often empirically dominating (Dremov et al., 2 Aug 2025, Tian et al., 23 Jul 2025, Hu et al., 2024).

Phase	Interval	Typical Fraction	Main Role
Warmup	$[0, T_{\rm warm})$	0.5–2%	Stabilization
Stable	$[T_{\rm warm}, T_{\rm stab})$	80–90%	Exploration
Decay	$[T_{\rm stab}, T_{\rm tot}]$	10–20%	Exploitation/final convergence

2. Theoretical Justification and Loss Landscape

The rationale for WSD arises from both empirical phenomena and formal analysis of optimization in ill-conditioned, anisotropic loss landscapes characteristic of modern LLM pretraining. The “river valley” theory posits that the loss surface comprises sharp “valley” directions (with large Hessian eigenvalues) and flat “river” directions (with small eigenvalues), yielding disparate timescales for optimization.

Stable phase (high constant LR): Large steps cause rapid diffusion and exploration along the wide river manifold, while valley directions equilibrate quickly. This phase accumulates most progress along the loss landscape’s principal directions.
Cooldown/decay phase: By reducing the learning rate, the stochastic variance in sharp (“hill”) directions contracts, focusing optimization tightly into the loss basin and enabling sharp drops in validation loss (Wen et al., 2024, Dremov et al., 2 Aug 2025, Liu et al., 6 Jul 2025).

This theoretical grounding is supported by SDE modeling and principal component visualizations that highlight how performance gains and loss drops are realized almost exclusively during the final decay, as the optimizer transitions from broad exploration to fine convergence (Wen et al., 2024, Dremov et al., 2 Aug 2025, Li et al., 23 Sep 2025).

3. Decay Functional Forms and Bias-Variance Tradeoff

The cooldown phase's effect is highly sensitive to the specific decay function. Common choices include:

Linear: $f_{\rm linear}(\tau) = 1-\tau$
Cosine: $f_{\rm cosine}(\tau) = \frac{1}{2}(1+\cos(\pi\tau))$
Quadratic: $f_{\rm square}(\tau) = 1-\tau^2$
Square-root: $f_{\rm sqrt}(\tau) = 1-\sqrt{\tau}$
Exponential: $f_{\rm exp}(\tau) = \exp(-\alpha\tau)$

The choice of $f(\cdot)$ reflects a fundamental bias–variance tradeoff in the resulting model ensemble (Dremov et al., 2 Aug 2025):

Aggressive (high-variance, low-bias) shapes (e.g., mirror-cosine) yield a wide spread in final model quality, occasionally reaching better minima but with less reproducibility.
Gentle (high-bias, low-variance) shapes (e.g., linear) yield highly repeatable but potentially suboptimal solutions.
Intermediate shapes (notably square-root decay or lowered linear) balance these tendencies and deliver the lowest validation perplexity, achieving 1–2 point gains (2–3% loss improvement) relative to naive linear decay on large transformer training runs (Dremov et al., 2 Aug 2025, Tian et al., 23 Jul 2025).

Empirically, 1-sqrt or square-root decay schedules are preferred in modern fine-tuning and LLM pretraining setups (Tian et al., 23 Jul 2025).

4. Optimizer Interactions and Training Dynamics

The interplay between the learning rate schedule and optimizer hyperparameters becomes critical during cooldown. Specifically:

AdamW’s second moment parameter $\beta_2$ is identified as a key driver of cooldown performance. Increasing $\beta_2$ (lengthening the EMA half-life) smooths gradient noise and can improve validation perplexity by up to 2 points. Retuning $\beta_1$ and $\beta_2$ together can move runs from worst to best within the same decay shape (Dremov et al., 2 Aug 2025).
Batch size can be increased during cooldown, with mild gains, though matching the optimizer’s token half-life to batch dynamics may degrade final loss.
Weight decay has nuanced effects: resetting or removing it may harm best-performing shapes but occasionally helps especially high-variance decays.

Consecutive step cosine similarities, gradient norm shrinkage, and aligned directional derivatives are observed during decay, consistent with rapid “funneling” into locally convex minima and confirming the theoretical “river valley” model (Hu et al., 2024, Dremov et al., 2 Aug 2025).

5. Empirical Benchmarks and Scaling Law Implications

Comparative studies demonstrate WSD’s strong practical performance across data/model scales and learning regimes:

Benchmarks: On transformer models (e.g., MiniCPM) and standard tasks, WSD with optimal decay shapes matches or outperforms cosine and linear schedulers, requiring only 10% of total steps for decay to see maximal final loss drop (Hu et al., 2024, Tian et al., 23 Jul 2025).
Scaling Laws: WSD’s structure is theoretically justified and empirically verified to facilitate compute/data/model scaling explorations. In particular, WSD enables $O(m)$ studies of the data–model law, where data-to-model ratios an order of magnitude higher than the Chinchilla prescription are found optimal under WSD (Hu et al., 2024, Li et al., 23 Sep 2025).
Continual/Domain Adaptation: WSD and its variants (e.g., WSD-S) support continual training and modular checkpoint usage across multiple budgets, outperforming cyclic or cosine-rewarm methods and simplifying code (Wen et al., 2024).

6. Model Merging and Recent Alternatives

Recent work establishes a formal equivalence between WSD’s decay phase and model averaging (merging) methods. Instead of decaying the learning rate online, one may maintain a constant learning rate and retrospectively form convex combinations of model checkpoints, assigning weights that emulate any target decay law (Tian et al., 23 Jul 2025):

$\hat\theta_{n+k} = \sum_{j=0}^k c_j\,\theta_{n+j}$

where the sequence $w_i = \sum_{j=i}^k c_j$ corresponds to the effective learning rate profile. The WSM (Warmup-Stable and Merge) framework outperforms WSD baselines by focusing optimization on the merge duration, and evidence suggests that the main critical hyperparameter is the window for merging, not the decay law per se (Tian et al., 23 Jul 2025).

7. Implementation, Best Practices, and Practical Guidelines

Pseudocode for integrating WSD into PyTorch- or similar training frameworks requires, at each step, recomputation of $\eta(t)$ prior to optimizer step execution. Key best practices include (Tian et al., 23 Jul 2025, Dremov et al., 2 Aug 2025):

Warm up for a small fraction of total steps (0.5–2%), sufficient for gradient stabilization.
Hold $\eta_{\max}$ until late training, devoting 10–20% of tokens to cooldown.
Deploy square-root or 1-sqrt decay as the default; these yield robust bias–variance balance.
Tune AdamW’s $\beta_2$ higher (e.g., $0.95^{0.5} \approx 0.974$ ) in cooldown.
For fine-tuning, use a short decay over 5–10% of new steps with aggressive annealing.
If training must be extended post-decay, revert to the last stable checkpoint or switch to decay-free/merging schedulers for continuous operation.

Hyperparameter	Typical Value / Range	Role
Warmup length	0.5–2% of steps	Early stabilization
Stable phase	80–90% of steps	Exploration, progress
Decay/cooldown	10–20% of steps	Convergence, exploitation
AdamW $\beta_2$	$0.95^{0.5} \approx 0.974$	Stabilize variance
Decay shape	1-sqrt or square-root	Bias-variance tradeoff

The warmup–stable–decay paradigm, with its explicit separation of optimization roles by subphase, remains foundational. Recent advances in model merging (WSM) and river valley landscape theory inform nuanced adaptations and underline WSD’s ongoing central role in the calibration of high-performance LLMs (Dremov et al., 2 Aug 2025, Wen et al., 2024, Li et al., 23 Sep 2025, Liu et al., 6 Jul 2025, Hu et al., 2024, Tian et al., 23 Jul 2025).