Papers
Topics
Authors
Recent
2000 character limit reached

Warmup-Stable-Decay LRS

Updated 8 January 2026
  • WSD LRS is a three-phase adaptive learning rate scheduler that comprises a linear warmup, a constant stable phase, and an annealed decay phase to enhance neural model training.
  • Its design is theoretically grounded in the river valley model, balancing bias and variance for effective exploration of complex loss landscapes.
  • Empirical benchmarks demonstrate that decay functions like square-root yield consistent performance gains, making WSD LRS a robust choice for scalable language models.

The Warmup-Stable-Decay (WSD) Learning Rate Scheduler is a three-phase adaptive learning rate strategy that has become foundational in the training of large-scale neural LLMs, particularly transformers. WSD divides training into a brief warmup, a long stable-phase with constant learning rate, and a cooldown or decay phase, during which the learning rate is annealed to a low value. The structure, mathematical formulation, theoretical justification, practical implications, and empirical tradeoffs of WSD have been extensively studied across recent literature, providing a unified perspective on scheduler design and optimization in deep learning.

1. Mathematical Structure and Phase Definition

WSD is defined via three contiguous training phases:

  • Warmup: Learning rate increases linearly from zero (or a small initial value) to a peak value ηmax\eta_{\max} over TwarmT_{\rm warm} steps,

η(t)=ηmaxtTwarm0t<Twarm\eta(t) = \eta_{\max} \frac{t}{T_{\rm warm}} \quad\quad 0 \leq t < T_{\rm warm}

  • Stable: Learning rate is held constant,

η(t)=ηmaxTwarmt<Tstab\eta(t) = \eta_{\max} \quad\quad T_{\rm warm} \leq t < T_{\rm stab}

  • Decay/Cooldown: Learning rate is annealed smoothly to zero or a minimum value,

η(t)=ηmaxf(tTstabTtotTstab),TstabtTtot\eta(t) = \eta_{\max}\,f\left(\frac{t-T_{\rm stab}}{T_{\rm tot}-T_{\rm stab}}\right), \quad T_{\rm stab} \leq t \leq T_{\rm tot}

where f()f(\cdot) is a monotonically decreasing decay function, such as linear, cosine, exponential, square-root, or custom power-law variants. This template supports both single-phase runs and more advanced multi-budget (“branching”) strategies (Dremov et al., 2 Aug 2025, Wen et al., 2024, Tian et al., 23 Jul 2025, Hu et al., 2024, Li et al., 23 Sep 2025).

Phase Timing and Hyperparameters

Typical settings emphasize a minimal warmup fraction (0.5–2%), a long plateau (80–90%), and a final decay of 10–20% of total steps. The decay function can be selected from several families, with square-root and 1-sqrt laws often empirically dominating (Dremov et al., 2 Aug 2025, Tian et al., 23 Jul 2025, Hu et al., 2024).

Phase Interval Typical Fraction Main Role
Warmup [0,Twarm)[0, T_{\rm warm}) 0.5–2% Stabilization
Stable [Twarm,Tstab)[T_{\rm warm}, T_{\rm stab}) 80–90% Exploration
Decay [Tstab,Ttot][T_{\rm stab}, T_{\rm tot}] 10–20% Exploitation/final convergence

2. Theoretical Justification and Loss Landscape

The rationale for WSD arises from both empirical phenomena and formal analysis of optimization in ill-conditioned, anisotropic loss landscapes characteristic of modern LLM pretraining. The “river valley” theory posits that the loss surface comprises sharp “valley” directions (with large Hessian eigenvalues) and flat “river” directions (with small eigenvalues), yielding disparate timescales for optimization.

  • Stable phase (high constant LR): Large steps cause rapid diffusion and exploration along the wide river manifold, while valley directions equilibrate quickly. This phase accumulates most progress along the loss landscape’s principal directions.
  • Cooldown/decay phase: By reducing the learning rate, the stochastic variance in sharp (“hill”) directions contracts, focusing optimization tightly into the loss basin and enabling sharp drops in validation loss (Wen et al., 2024, Dremov et al., 2 Aug 2025, Liu et al., 6 Jul 2025).

This theoretical grounding is supported by SDE modeling and principal component visualizations that highlight how performance gains and loss drops are realized almost exclusively during the final decay, as the optimizer transitions from broad exploration to fine convergence (Wen et al., 2024, Dremov et al., 2 Aug 2025, Li et al., 23 Sep 2025).

3. Decay Functional Forms and Bias-Variance Tradeoff

The cooldown phase's effect is highly sensitive to the specific decay function. Common choices include:

  • Linear: flinear(τ)=1τf_{\rm linear}(\tau) = 1-\tau
  • Cosine: fcosine(τ)=12(1+cos(πτ))f_{\rm cosine}(\tau) = \frac{1}{2}(1+\cos(\pi\tau))
  • Quadratic: fsquare(τ)=1τ2f_{\rm square}(\tau) = 1-\tau^2
  • Square-root: fsqrt(τ)=1τf_{\rm sqrt}(\tau) = 1-\sqrt{\tau}
  • Exponential: fexp(τ)=exp(ατ)f_{\rm exp}(\tau) = \exp(-\alpha\tau)

The choice of f()f(\cdot) reflects a fundamental bias–variance tradeoff in the resulting model ensemble (Dremov et al., 2 Aug 2025):

  • Aggressive (high-variance, low-bias) shapes (e.g., mirror-cosine) yield a wide spread in final model quality, occasionally reaching better minima but with less reproducibility.
  • Gentle (high-bias, low-variance) shapes (e.g., linear) yield highly repeatable but potentially suboptimal solutions.
  • Intermediate shapes (notably square-root decay or lowered linear) balance these tendencies and deliver the lowest validation perplexity, achieving 1–2 point gains (2–3% loss improvement) relative to naive linear decay on large transformer training runs (Dremov et al., 2 Aug 2025, Tian et al., 23 Jul 2025).

Empirically, 1-sqrt or square-root decay schedules are preferred in modern fine-tuning and LLM pretraining setups (Tian et al., 23 Jul 2025).

4. Optimizer Interactions and Training Dynamics

The interplay between the learning rate schedule and optimizer hyperparameters becomes critical during cooldown. Specifically:

  • AdamW’s second moment parameter β2\beta_2 is identified as a key driver of cooldown performance. Increasing β2\beta_2 (lengthening the EMA half-life) smooths gradient noise and can improve validation perplexity by up to 2 points. Retuning β1\beta_1 and β2\beta_2 together can move runs from worst to best within the same decay shape (Dremov et al., 2 Aug 2025).
  • Batch size can be increased during cooldown, with mild gains, though matching the optimizer’s token half-life to batch dynamics may degrade final loss.
  • Weight decay has nuanced effects: resetting or removing it may harm best-performing shapes but occasionally helps especially high-variance decays.

Consecutive step cosine similarities, gradient norm shrinkage, and aligned directional derivatives are observed during decay, consistent with rapid “funneling” into locally convex minima and confirming the theoretical “river valley” model (Hu et al., 2024, Dremov et al., 2 Aug 2025).

5. Empirical Benchmarks and Scaling Law Implications

Comparative studies demonstrate WSD’s strong practical performance across data/model scales and learning regimes:

  • Benchmarks: On transformer models (e.g., MiniCPM) and standard tasks, WSD with optimal decay shapes matches or outperforms cosine and linear schedulers, requiring only 10% of total steps for decay to see maximal final loss drop (Hu et al., 2024, Tian et al., 23 Jul 2025).
  • Scaling Laws: WSD’s structure is theoretically justified and empirically verified to facilitate compute/data/model scaling explorations. In particular, WSD enables O(m)O(m) studies of the data–model law, where data-to-model ratios an order of magnitude higher than the Chinchilla prescription are found optimal under WSD (Hu et al., 2024, Li et al., 23 Sep 2025).
  • Continual/Domain Adaptation: WSD and its variants (e.g., WSD-S) support continual training and modular checkpoint usage across multiple budgets, outperforming cyclic or cosine-rewarm methods and simplifying code (Wen et al., 2024).

6. Model Merging and Recent Alternatives

Recent work establishes a formal equivalence between WSD’s decay phase and model averaging (merging) methods. Instead of decaying the learning rate online, one may maintain a constant learning rate and retrospectively form convex combinations of model checkpoints, assigning weights that emulate any target decay law (Tian et al., 23 Jul 2025):

θ^n+k=j=0kcjθn+j\hat\theta_{n+k} = \sum_{j=0}^k c_j\,\theta_{n+j}

where the sequence wi=j=ikcjw_i = \sum_{j=i}^k c_j corresponds to the effective learning rate profile. The WSM (Warmup-Stable and Merge) framework outperforms WSD baselines by focusing optimization on the merge duration, and evidence suggests that the main critical hyperparameter is the window for merging, not the decay law per se (Tian et al., 23 Jul 2025).

7. Implementation, Best Practices, and Practical Guidelines

Pseudocode for integrating WSD into PyTorch- or similar training frameworks requires, at each step, recomputation of η(t)\eta(t) prior to optimizer step execution. Key best practices include (Tian et al., 23 Jul 2025, Dremov et al., 2 Aug 2025):

  • Warm up for a small fraction of total steps (0.5–2%), sufficient for gradient stabilization.
  • Hold ηmax\eta_{\max} until late training, devoting 10–20% of tokens to cooldown.
  • Deploy square-root or 1-sqrt decay as the default; these yield robust bias–variance balance.
  • Tune AdamW’s β2\beta_2 higher (e.g., 0.950.50.9740.95^{0.5} \approx 0.974) in cooldown.
  • For fine-tuning, use a short decay over 5–10% of new steps with aggressive annealing.
  • If training must be extended post-decay, revert to the last stable checkpoint or switch to decay-free/merging schedulers for continuous operation.
Hyperparameter Typical Value / Range Role
Warmup length 0.5–2% of steps Early stabilization
Stable phase 80–90% of steps Exploration, progress
Decay/cooldown 10–20% of steps Convergence, exploitation
AdamW β2\beta_2 0.950.50.9740.95^{0.5} \approx 0.974 Stabilize variance
Decay shape 1-sqrt or square-root Bias-variance tradeoff

The warmup–stable–decay paradigm, with its explicit separation of optimization roles by subphase, remains foundational. Recent advances in model merging (WSM) and river valley landscape theory inform nuanced adaptations and underline WSD’s ongoing central role in the calibration of high-performance LLMs (Dremov et al., 2 Aug 2025, Wen et al., 2024, Li et al., 23 Sep 2025, Liu et al., 6 Jul 2025, Hu et al., 2024, Tian et al., 23 Jul 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Warmup-Stable-Decay Learning Rate Scheduler (WSD LRS).