Warmup–Stable–Decay (WSD) LR Schedule
- WSD is a three-phase learning rate schedule that segments training into a warmup phase to prevent divergence, a stable plateau to maximize learning, and a decay phase to fine-tune convergence.
- Its formulation enables precise control over warmup, plateau, and decay durations, leveraging high-dimensional loss landscape geometry and scaling laws for optimal performance.
- Empirical studies show that WSD outperforms traditional cosine or direct-decay schedules in compute efficiency, loss convergence, and flexible checkpoint management.
The Warmup–Stable–Decay (WSD) Pattern is a three-phase learning rate (LR) scheduling paradigm that has become fundamental in large-scale training of deep models, especially LLMs and transformers. It decomposes the learning rate trajectory into (1) a warmup phase to suppress divergence, (2) a sustained plateau (“stable”) at the peak learning rate, and (3) a cooldown or decay phase that anneals the LR toward zero or a small floor. WSD outperforms traditional cosine or direct-decay schedules in compute/data efficiency, loss convergence, and practical flexibility. Its mechanistic foundations are now understood: WSD exploits the geometric structure of high-dimensional loss surfaces (“valley–river” landscapes), formal scaling-law optimality, and even statistical physics analogies such as the Mpemba effect.
1. Mathematical Formulation and Variants
The canonical WSD learning rate schedule with total steps is defined by three parameters: number of warmup steps , plateau (“stable”) steps , and decay steps , with . Let denote the plateau learning rate, and the terminal value.
with a monotonically decreasing function, common choices being (exponential), (cosine), or (power/polynomial) (Hu et al., 2024, Song et al., 14 Jul 2025, Belloni et al., 13 Jan 2026).
Tabulated examples of decay functions: | Decay type | form | Typical behavior | |-----------------|-----------------------|---------------------------------------| | Linear decay | $1-u$ | Uniform decrease to zero | | Cosine decay | | Slow – fast – slow; smooth tail-off | | Exponential | | Fastest at start of decay | | Power | | Tunable shape (e.g., optimal) |
2. Theoretical Foundations and Scaling Laws
WSD’s three-phase structure is motivated by functional scaling law (FSL) theory and convex/nonconvex optimization analysis. Under the FSL framework, for teacher–student kernel regression and SGD with data/model power-law spectra, the optimal LR schedule for hard tasks (capacity exponent , source exponent ) is WSD: a long plateau at (subject to stability), followed by a short vanishing polynomial decay of exponent (Li et al., 6 Feb 2026, Li et al., 23 Sep 2025, Luo et al., 17 Mar 2025).
Key results:
- The “stable” phase builds up intrinsic time (aggregate step magnitude) required for learning the signal.
- By delaying decay, one concentrates all loss-reduction into the final phase, which the SGD forgetting kernel renders effective for noise suppression.
- The decay window may shrink as , so for large , almost all steps are spent at peak LR (Li et al., 6 Feb 2026).
- Empirically, losses predicted by FSL for WSD match actual LLM pretraining to loss units (Li et al., 23 Sep 2025).
For norm-constrained optimizers and optimization under suboptimality-dependent smoothness, WSD arises as the step-size naturally transitions from a ramp (“warmup”), through a plateau (peak), and finally decays as $1/t$ once the gap is small (Riabinin et al., 5 Feb 2026).
3. Geometric and Dynamical Mechanisms: The Valley–River Model
Modern analyses interpret WSD through the “valley–river” or “river valley” landscape, where:
- Optimization trajectories decompose into stiff (valley, ) and flat (river, ) directions, i.e., with .
- The SGD/Langevin dynamics see fast equilibration in directions, while slow progress and heavy stochastic oscillations persist in “river” directions at high LR (Liu et al., 6 Jul 2025, Wen et al., 2024).
- During the stable phase, the model oscillates above the true valley floor but moves rapidly downstream in . When LR decays, these oscillations collapse and the parameter falls onto the true minimum, causing a pronounced loss drop (“loss cliff” phenomenon).
The “Mpemba effect” analogy further clarifies the role of the strong plateau: like a hotter system cooling faster when quenched, choosing the “strong Mpemba point” maximally accelerates convergence during decay by canceling the slowest mode (Liu et al., 6 Jul 2025).
4. Practical Structure, Tuning, and Empirical Observations
WSD schedules in practice use:
- Warmup: of steps, linear or polynomial ramp to to avoid divergence and permit optimizer statistics (e.g., AdamW’s moments) to stabilize.
- Stable phase: of steps at constant . The learning rate is chosen by estimating Hessian extreme eigenvalues (Lanczos/PCA) and using (Liu et al., 6 Jul 2025).
- Decay: Final of steps; schedule may be cosine, polynomial, or exponential, decaying to a small value or zero (Hu et al., 2024, Song et al., 14 Jul 2025).
Characteristics:
- The loss remains nearly flat during the plateau, with most actual progress in the river (low-curvature) direction being hidden by hill-direction variance (Dremov et al., 2 Aug 2025, Wen et al., 2024).
- Upon entering decay, loss drops sharply. Decay shape and duration significantly influence final perplexity—sqrt or “lowered linear” are optimal for bias-variance trade-off (Dremov et al., 2 Aug 2025).
- The phase fractions are robust across scales and models (Belloni et al., 13 Jan 2026).
Optimizing batch size for WSD: The E(S) theory reveals the existence of (minimum batch for progress) and $B_{\opt}$ (batch minimizing total token consumption), both increasing with training depth (Zhou et al., 8 Jan 2026).
5. Applications and Extensions: Scaling, Domain Adaptation, and Checkpoint Management
WSD’s “forkable” architecture enables practical flexibility:
- Continuous training: Any plateau checkpoint can be decayed later, obviating the need for schedules tied to total compute a priori (Wen et al., 2024, Hu et al., 2024).
- Domain adaptation: Mixing domain-relevant data in decay improves downstream generalization relative to post-hoc fine-tuning (Hu et al., 2024).
- Scaling laws: Training to a long plateau and decaying from multiple token levels covers a grid of model/data pairs efficiently, permitting data-model scaling-law measurement with instead of runs (Hu et al., 2024).
- WSD-S: Single-branch variants that sequence successive decays for multiple budgets, matching per-budget tuned cosine baselines (Wen et al., 2024).
Recent developments in checkpoint-averaging (WSM) reinterpret the decay phase as model merging, demonstrating that any desired decay can be emulated by merging a window of stable-phase checkpoints, improving both performance and generalization (Tian et al., 23 Jul 2025).
6. Comparative Analysis and Limitations
Empirical studies show WSD outperforms cosine decay for the same compute, especially for large LLMs and in hard-signal regimes (Li et al., 6 Feb 2026, Li et al., 23 Sep 2025, Luo et al., 17 Mar 2025, Belloni et al., 13 Jan 2026). However, WSD schedules require a manual or heuristic trigger to start decay, and stable-phase checkpoints do not reveal final performance until decay is applied, complicating convergence diagnostics (Song et al., 14 Jul 2025).
Some alternatives such as Schedule-Free AdamW (Song et al., 14 Jul 2025) and WSM checkpoint merging (Tian et al., 23 Jul 2025) offer similar or improved efficiency and flexibility by avoiding explicit decay phases.
7. Generalization Beyond Transformers and Universal Dynamics
WSD’s effectiveness and underlying river–valley geometry are not specific to transformer LMs. Experiments on standard CNNs (CIFAR-10) reveal qualitatively identical optimizer path features (two dominant directions, sharpness increase during decay, similar quasi-convexity indices) (Belloni et al., 13 Jan 2026). This suggests that the WSD pattern reflects universal geometric properties of high-dimensional loss landscapes in deep learning.
In summary, the Warmup–Stable–Decay pattern is a theoretically justified, empirically validated, and practically flexible learning rate strategy that exploits timescale separation in optimization dynamics, optimally balances signal extraction and noise forgetting, and adapts naturally to scaling and checkpointing workflows (Liu et al., 6 Jul 2025, Li et al., 6 Feb 2026, Wen et al., 2024, Belloni et al., 13 Jan 2026, Hu et al., 2024, Li et al., 23 Sep 2025, Song et al., 14 Jul 2025, Dremov et al., 2 Aug 2025, Luo et al., 17 Mar 2025, Zhou et al., 8 Jan 2026, Tian et al., 23 Jul 2025, Riabinin et al., 5 Feb 2026).