Warmup-Stable-Decay Scheduling
- WSD scheduling is a learning rate paradigm divided into linear warmup, a constant plateau, and a controlled decay phase that enhances training dynamics.
- It leverages the loss landscape by initially promoting exploration with high learning rates and then reducing variance during a rapid cooldown phase.
- The approach supports flexible compute allocation, effective checkpoint reuse, and facilitates domain adaptation in transformer and LLM training.
The Warmup-Stable-Decay (WSD) learning rate scheduling paradigm is designed to optimize the training dynamics of large-scale neural architectures, notably transformers, by splitting training into three distinct continuous phases: a linear warmup ramp, a long plateau (“stable” phase) at peak rate, and a short, controlled annealing or decay (“cooldown”) phase. WSD originated as a compute-agnostic alternative to classic cosine annealing, offering flexibility in compute budget allocation, checkpoint reuse, and principled integration of domain adaptation and model scaling studies (Wen et al., 2024, &&&1&&&, Dremov et al., 2 Aug 2025). Its widespread adoption in LLM pretraining pipelines reflects both empirical and theoretical advances elucidating its optimization structure, generalization behavior, and practical trade-offs.
1. Formal Definition and Parameterization
The WSD schedule is characterized by the following phases and parameterization:
- Warmup: Linear ramp from zero to peak rate over steps,
- Stable Plateau: Constant learning rate for steps,
- Cooldown (Decay): Monotonically decreasing schedule shape over steps, where
Canonical decay functions include:
- Linear:
- Square:
- Sqrt:
- Cosine:
- Lowered linear: (Dremov et al., 2 Aug 2025, Hu et al., 2024, Wen et al., 2024, Tian et al., 23 Jul 2025, Li et al., 23 Sep 2025)
The decay phase typically occupies $8$– of total steps depending on task hardness and performance sensitivity (Dremov et al., 2 Aug 2025, Wen et al., 2024, Hu et al., 2024, Li et al., 23 Sep 2025). The schedule is continuous at phase boundaries.
2. Training Dynamics and Loss Landscape Perspective
WSD’s distinct loss curve arises from the interplay between high learning rate exploration and final-phase annealing. During the stable phase, large step sizes induce oscillations in the “hill” directions of the loss landscape while enabling rapid traversal downstream along the “river valley”—a one-dimensional manifold characterized by low curvature (Wen et al., 2024, Dremov et al., 2 Aug 2025). The rapid cooldown suppresses oscillations, projecting the iterate closer to the true minimum and thereby revealing accumulated optimization progress. This mechanism is formally modeled:
- Gradient flow in the valley:
- SGD stationary variance (hill-loss): Proportional to
- Cooldown phase:
Suppresses , reducing variance and sharp-drop in observed loss (Wen et al., 2024, Dremov et al., 2 Aug 2025).
Loss landscape coordinates —global downstream and local-gradient directions—exhibit empirical “river valley” geometry, with optimal decay shapes striking the best balance between exploration and exploitation (Dremov et al., 2 Aug 2025).
3. Cooldown Shape, Bias-Variance Trade-offs, and Model Selection
Cooldown shape governs a critical bias-variance trade-off in final model quality:
- Aggressive decay (linear, square): Promotes exploration, higher variance, but can yield inconsistent or suboptimal fits.
- Conservative decay (lowered linear, small ): Yields low variance, high bias toward the pre-cooldown state.
- Intermediate shapes (, ): Empirically minimize combined bias and variance, producing the best perplexity and generalization performance (Dremov et al., 2 Aug 2025).
Table: Cooldown shape vs. bias-variance regime
| Shape | Exploration | Bias | Variance |
|---|---|---|---|
| Linear/Square | High | Low | High |
| Lowered Linear | Low | High | Low |
| Sqrt/ | Balanced | Minimal | Minimal |
Empirical recommendations favor the -decay or lowered-linear (), robust across model and dataset scales with 1.5 PPL improvement over pure linear or square decays (Dremov et al., 2 Aug 2025).
4. Optimizer Hyperparameter Effects During Cooldown
AdamW’s exponential moving average coefficients () interact substantially with the cooldown regime:
- Raising to $0.99$ (with –$0.5$) improves final perplexity by ; full tuning of both betas can induce swings comparable to scheduler shape effects.
- The ordering of cooldown shapes by bias-variance remains stable under reasonable hyperparameters.
- Weight decay ($0.1$ is standard) and batch size have secondary effect; aggressive upsampling or disabling weight decay produces marginal gains or setbacks depending on decay shape (Dremov et al., 2 Aug 2025).
5. Theoretical Foundation and Functional Scaling Laws
The WSD schedule has rigorous SDE and kernel regression analyses, culminating in the Functional Scaling Law (FSL) that predicts the evolution of population risk for general LRSs (Li et al., 23 Sep 2025):
Here, is model size, batch size, peak rate, final rate, cooldown intrinsic time. WSD achieves optimal scaling exponents and removes logarithmic factors found in direct decay schedules (e.g. exponential), especially in compute- or data-limited regimes (Li et al., 23 Sep 2025). FSL supports zero-shot prediction and optimization of loss trajectories for unseen schedules and is consistently validated across model sizes and architectures.
6. Applications in LLM Training and Domain Adaptation
WSD is widely employed for training transformers and small LLMs, including MiniCPM, where it facilitates efficient scaling law studies (data–model scaling) and continuous/incremental training. It enables checkpoint reuse: a stable-phase checkpoint plus a fixed-length decay matches full-length cosine baselines. In MiniCPM, 10% decay completes convergence and enables mixing new data (domain adaptation or SFT) strictly during the decay phase (Hu et al., 2024).
Empirical advantages:
- No precommitment to total compute.
- Checkpoint agnostic: any stable checkpoint can be finalized via short decay.
- Domain adaptation and data-mix-in handled precisely during cooldown.
7. Extensions, Model Merging, and Practical Guidelines
Research on WSM (Warmup-Stable and Merge) demonstrates that WSD’s decay phase can be replaced by merging a tail of recent constant-LR checkpoints, yielding superior performance (Tian et al., 23 Jul 2025). Theoretical correspondences map monotone decay schedules to principled merging weights, supporting linear, cosine, and inverse-sqrt schemes. Key guidelines:
- Merge duration (tail window) is the dominant factor: longer tail merging yields steadier improvement.
- Decay phase or merging window should occupy $8$– of the total budget for optimal annealing benefits.
Practical recommendations:
- Cooldown length: $8$–.
- Shape: Sqrt () or lowered-linear ().
- AdamW : Increase to $0.99$ during cooldown.
- Decay window/merge duration: Extend as far as resource constraints allow.
WSD, underpinned by landscape modeling, scaling law theory, and robust empirical validation, remains a foundational tool for LLM, SLM, and transformer optimization, checkpoint management, and data-efficient scaling research (Dremov et al., 2 Aug 2025, Wen et al., 2024, Hu et al., 2024, Tian et al., 23 Jul 2025, Li et al., 23 Sep 2025).