Papers
Topics
Authors
Recent
2000 character limit reached

Warmup-Stable-Decay Scheduling

Updated 7 January 2026
  • WSD scheduling is a learning rate paradigm divided into linear warmup, a constant plateau, and a controlled decay phase that enhances training dynamics.
  • It leverages the loss landscape by initially promoting exploration with high learning rates and then reducing variance during a rapid cooldown phase.
  • The approach supports flexible compute allocation, effective checkpoint reuse, and facilitates domain adaptation in transformer and LLM training.

The Warmup-Stable-Decay (WSD) learning rate scheduling paradigm is designed to optimize the training dynamics of large-scale neural architectures, notably transformers, by splitting training into three distinct continuous phases: a linear warmup ramp, a long plateau (“stable” phase) at peak rate, and a short, controlled annealing or decay (“cooldown”) phase. WSD originated as a compute-agnostic alternative to classic cosine annealing, offering flexibility in compute budget allocation, checkpoint reuse, and principled integration of domain adaptation and model scaling studies (Wen et al., 2024, &&&1&&&, Dremov et al., 2 Aug 2025). Its widespread adoption in LLM pretraining pipelines reflects both empirical and theoretical advances elucidating its optimization structure, generalization behavior, and practical trade-offs.

1. Formal Definition and Parameterization

The WSD schedule is characterized by the following phases and parameterization:

  • Warmup: Linear ramp from zero to peak rate ηmax\eta_{\max} over TwT_w steps,

η(t)=ηmaxtTw0tTw\eta(t) = \eta_{\max} \frac{t}{T_w} \qquad 0 \leq t \leq T_w

  • Stable Plateau: Constant learning rate for TsT_s steps,

η(t)=ηmaxTw<tTTd\eta(t) = \eta_{\max} \qquad T_w < t \leq T - T_d

  • Cooldown (Decay): Monotonically decreasing schedule shape S(x)S(x) over TdT_d steps, where

x=t(TTd)Td,x[0,1]x = \frac{t - (T - T_d)}{T_d}, \quad x \in [0, 1]

η(t)=ηmaxS(x)\eta(t) = \eta_{\max} S(x)

Canonical decay functions include:

  • Linear: Slin(x)=1xS_{\rm lin}(x) = 1 - x
  • Square: Ssq(x)=1x2S_{\rm sq}(x) = 1 - x^2
  • Sqrt: S(x)=1xS_{\rm \sqrt{}}(x) = 1 - \sqrt{x}
  • Cosine: Scos(x)=1+cos(πx)2S_{\rm cos}(x) = \frac{1 + \cos(\pi x)}{2}
  • Lowered linear: Sα(x)=(1α)+α(1x), α[0,1]S_\alpha(x) = (1-\alpha) + \alpha(1-x)\,,\ \alpha \in [0,1] (Dremov et al., 2 Aug 2025, Hu et al., 2024, Wen et al., 2024, Tian et al., 23 Jul 2025, Li et al., 23 Sep 2025)

The decay phase typically occupies $8$–20%20\% of total steps depending on task hardness and performance sensitivity (Dremov et al., 2 Aug 2025, Wen et al., 2024, Hu et al., 2024, Li et al., 23 Sep 2025). The schedule is continuous at phase boundaries.

2. Training Dynamics and Loss Landscape Perspective

WSD’s distinct loss curve arises from the interplay between high learning rate exploration and final-phase annealing. During the stable phase, large step sizes induce oscillations in the “hill” directions of the loss landscape while enabling rapid traversal downstream along the “river valley”—a one-dimensional manifold characterized by low curvature (Wen et al., 2024, Dremov et al., 2 Aug 2025). The rapid cooldown suppresses oscillations, projecting the iterate closer to the true minimum and thereby revealing accumulated optimization progress. This mechanism is formally modeled:

  • Gradient flow in the valley: ddtθ=L(θ)\frac{d}{dt}\theta = -\nabla L(\theta)
  • SGD stationary variance (hill-loss): Proportional to 12(d1)ησ2\frac{1}{2}(d-1)\eta\sigma^2
  • Cooldown phase:

Suppresses η\eta, reducing variance and sharp-drop in observed loss (Wen et al., 2024, Dremov et al., 2 Aug 2025).

Loss landscape coordinates (e1,e2)(e_1, e_2)—global downstream and local-gradient directions—exhibit empirical “river valley” geometry, with optimal decay shapes striking the best balance between exploration and exploitation (Dremov et al., 2 Aug 2025).

3. Cooldown Shape, Bias-Variance Trade-offs, and Model Selection

Cooldown shape S(x)S(x) governs a critical bias-variance trade-off in final model quality:

  • Aggressive decay (linear, square): Promotes exploration, higher variance, but can yield inconsistent or suboptimal fits.
  • Conservative decay (lowered linear, small α\alpha): Yields low variance, high bias toward the pre-cooldown state.
  • Intermediate shapes (SS_{\rm \sqrt{}}, S0.7S_{0.7}): Empirically minimize combined bias and variance, producing the best perplexity and generalization performance (Dremov et al., 2 Aug 2025).

Table: Cooldown shape vs. bias-variance regime

Shape Exploration Bias Variance
Linear/Square High Low High
Lowered Linear Low High Low
Sqrt/α0.7\alpha\sim0.7 Balanced Minimal Minimal

Empirical recommendations favor the \sqrt{}-decay or lowered-linear (α0.7\alpha\approx0.7), robust across model and dataset scales with \approx1.5 PPL improvement over pure linear or square decays (Dremov et al., 2 Aug 2025).

4. Optimizer Hyperparameter Effects During Cooldown

AdamW’s exponential moving average coefficients (β1,β2\beta_1, \beta_2) interact substantially with the cooldown regime:

  • Raising β2\beta_2 to $0.99$ (with p0.3p\approx0.3–$0.5$) improves final perplexity by 0.2\sim0.2; full tuning of both betas can induce swings comparable to scheduler shape effects.
  • The ordering of cooldown shapes by bias-variance remains stable under reasonable β\beta hyperparameters.
  • Weight decay ($0.1$ is standard) and batch size have secondary effect; aggressive upsampling or disabling weight decay produces marginal gains or setbacks depending on decay shape (Dremov et al., 2 Aug 2025).

5. Theoretical Foundation and Functional Scaling Laws

The WSD schedule has rigorous SDE and kernel regression analyses, culminating in the Functional Scaling Law (FSL) that predicts the evolution of population risk for general LRSs (Li et al., 23 Sep 2025):

E[RK]σ22Msβ+Ts+σ2B[b+(ab)min{M,T21/β}T2]\mathbb{E}[R_K] - \frac{\sigma^2}{2} \approx M^{-s\beta} + T^{-s} + \frac{\sigma^2}{B} \left[b + (a-b) \frac{\min\{M,T_2^{1/\beta}\}}{T_2}\right]

Here, MM is model size, BB batch size, aa peak rate, bb final rate, T2T_2 cooldown intrinsic time. WSD achieves optimal scaling exponents and removes logarithmic factors found in direct decay schedules (e.g. exponential), especially in compute- or data-limited regimes (Li et al., 23 Sep 2025). FSL supports zero-shot prediction and optimization of loss trajectories for unseen schedules and is consistently validated across model sizes and architectures.

6. Applications in LLM Training and Domain Adaptation

WSD is widely employed for training transformers and small LLMs, including MiniCPM, where it facilitates efficient scaling law studies (data–model scaling) and continuous/incremental training. It enables checkpoint reuse: a stable-phase checkpoint plus a fixed-length decay matches full-length cosine baselines. In MiniCPM, 10% decay completes convergence and enables mixing new data (domain adaptation or SFT) strictly during the decay phase (Hu et al., 2024).

Empirical advantages:

  • No precommitment to total compute.
  • Checkpoint agnostic: any stable checkpoint can be finalized via short decay.
  • Domain adaptation and data-mix-in handled precisely during cooldown.

7. Extensions, Model Merging, and Practical Guidelines

Research on WSM (Warmup-Stable and Merge) demonstrates that WSD’s decay phase can be replaced by merging a tail of recent constant-LR checkpoints, yielding superior performance (Tian et al., 23 Jul 2025). Theoretical correspondences map monotone decay schedules to principled merging weights, supporting linear, cosine, and inverse-sqrt schemes. Key guidelines:

  • Merge duration (tail window) is the dominant factor: longer tail merging yields steadier improvement.
  • Decay phase or merging window should occupy $8$–20%20\% of the total budget for optimal annealing benefits.

Practical recommendations:

  • Cooldown length: $8$–20%20\%.
  • Shape: Sqrt (1x1-\sqrt{x}) or lowered-linear (α0.7\alpha\sim0.7).
  • AdamW β2\beta_2: Increase to $0.99$ during cooldown.
  • Decay window/merge duration: Extend as far as resource constraints allow.

WSD, underpinned by landscape modeling, scaling law theory, and robust empirical validation, remains a foundational tool for LLM, SLM, and transformer optimization, checkpoint management, and data-efficient scaling research (Dremov et al., 2 Aug 2025, Wen et al., 2024, Hu et al., 2024, Tian et al., 23 Jul 2025, Li et al., 23 Sep 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Warmup-Stable-Decay (WSD) Scheduling.