Papers
Topics
Authors
Recent
2000 character limit reached

Fine-Grained Warmup-Stable-Decay (FG-WSD)

Updated 13 December 2025
  • FG-WSD is a family of schedulers that precisely controls the warmup, stable, and decay phases in large-scale transformer training.
  • It divides training into fine-tuned stages with explicit data curriculum adjustments to balance exploration and exploitation.
  • Empirical results demonstrate that FG-WSD achieves lower validation perplexity and improved performance on reasoning benchmarks.

Fine-Grained Warmup–Stable–Decay (FG-WSD) is a family of learning-rate and data-curriculum schedulers for large-scale neural network training, especially prominent in transformer-based LLMs. Built as an extension of the traditional Warmup–Stable–Decay (WSD) schedules, FG-WSD introduces explicit, tunable control over every phase—warmup, stable, decay/cooldown—and refines both the functional forms of the learning rate decay and the stagewise progression of data quality. Recent research has established FG-WSD as a near-optimal strategy under both theoretical scaling laws and empirical performance, particularly for reasoning-centric LLMs (Dremov et al., 2 Aug 2025, Yang et al., 6 Dec 2025, &&&2&&&).

1. Formal Definition and Schedule Structure

FG-WSD divides training into discrete, fine-controlled stages:

  • Warmup: Initial phase (e.g., 0–1% of total steps/tokens) where the learning rate is increased monotonically from zero to a maximum ηmax\eta_{\max}.
  • Stable Plateau: One or more subphases with fixed learning rate ηmax\eta_{\max}, during which the data mixture may transition from diverse/general to high-quality samples.
  • Decay (Cooldown): Final phase(s) where the learning rate decays smoothly to a low value ηmin\eta_{\min}, over a prescribed functional form.

Formally, for total TT steps, partitioned as T=Tw+Ts,1++Ts,k+TdT = T_w + T_{s,1} + \dots + T_{s,k} + T_d, with TwT_w (warmup), Ts,iT_{s,i} (stable-phase ii), TdT_d (decay/cooldown):

  • Warmup (0t<Tw0 \leq t < T_w): η(t)=ηmax(t/Tw)\eta(t) = \eta_{\max} \cdot (t / T_w)
  • Stable (Twt<Tw+jTs,jT_w \leq t < T_w + \sum_j T_{s,j}): η(t)=ηmax\eta(t) = \eta_{\max}
  • Decay (Tt<TT_* \leq t < T): η(t)=ηmaxs(tTTd)\eta(t) = \eta_{\max} \cdot s\left(\frac{t - T_*}{T_d}\right)

where TT_* is the onset of decay and s(x)s(x), x[0,1]x \in [0,1], is the chosen decay/cooldown shape (Dremov et al., 2 Aug 2025, Yang et al., 6 Dec 2025, Li et al., 23 Sep 2025).

A distinguishing feature of FG-WSD is the explicit possibility of subdividing the stable phase and associating each segment with a distinct data mixture. For example, Nanbeige4-3B uses:

Stage Tokens (×10¹²) Learning Rate Data Mixture
Warmup 0.1 04.5×1040 \to 4.5 \times 10^{-4} Full
Diversity-Enriched Stable 12.4 4.5×1044.5 \times 10^{-4} Diverse (0.525 HQ)
High-Quality Stable 6.5 4.5×1044.5 \times 10^{-4} 100% HQ
Decay 4.0 4.5×1041.5×1064.5 \times 10^{-4} \to 1.5 \times 10^{-6} HQ or mixed

2. Cooldown Shape Families and Bias–Variance Trade-Off

In the decay phase, FG-WSD allows arbitrary monotone functions s(x)s(x) to define the learning rate trajectory. Commonly employed shapes include:

  • Linear: slin(x)=1xs_{\text{lin}}(x) = 1 - x
  • Lowered Linear: sLL,c(x)=1cxs_{\text{LL},c}(x) = 1 - c x, c(0,1)c \in (0,1)
  • Polynomial (Power-law): sp(x)=(1x)ps_p(x) = (1-x)^p, p>0p > 0
  • Exponential: sexp,α(x)=exp(αx)s_{\exp,\alpha}(x) = \exp(-\alpha x), α>0\alpha > 0
  • Cosine: scos(x)=1+cos(πx)2s_{\cos}(x) = \frac{1+\cos(\pi x)}{2}
  • Mirror-cosine: smcos(x)=2(1x)(1+cos(πx))/2s_{\text{mcos}}(x) = 2(1-x) - (1+\cos(\pi x))/2
  • Square-root: ssqrt(x)=1xs_{\text{sqrt}}(x) = 1 - \sqrt{x}

Each shape modulates the optimizer's exploration–exploitation dynamics. Aggressive shapes (mirror-cosine, square) maintain higher learning rates, promoting exploration and producing lower bias but higher run-to-run variance. Conservative shapes (linear, gentle polynomials) emphasize early exploitation, reducing variance at the potential cost of suboptimal convergence (higher bias) (Dremov et al., 2 Aug 2025).

Empirical analysis on transformers demonstrates that bias–variance decomposition of final validation loss is tightly controlled by this functional choice. The sum of bias and variance is minimized for intermediate shapes, notably square-root and lowered-linear with c0.7c \approx 0.7, yielding optimal perplexity in single-run settings.

3. Data Curriculum and Stagewise Mixture Progression

FG-WSD further generalizes traditional schedulers by aligning the stable-phase segments with a fine-grained curriculum in data quality. In the Nanbeige4-3B regime:

  • Stable Phase I: Data is drawn from a specifically balanced mixture favoring diversity. The upsampled high-quality subset HH' (6.5 T tokens) is paired with a medium-quality MM' slice (6.5 T tokens), giving a mixture probability wH(1)=0.525w_H^{(1)} = 0.525 for HQ data.
  • Stable Phase II: Switch exclusively to high-quality (wH(2)=1w_H^{(2)} =1).
  • Decay: Optionally, the high-quality focus is maintained or relaxed per task needs (Yang et al., 6 Dec 2025).

Piecewise or continuously increasing wH(k)w_H^{(k)}—the probability of sampling HQ—establishes a "quality-progressive" training curriculum, potentially further subdivided for more granular control.

4. Functional Scaling Law (FSL) and Theoretical Analysis

Under the Functional Scaling Law (FSL) formalism (Li et al., 23 Sep 2025), FG-WSD emerges as theoretically near-optimal for large-scale pretraining. FSL models training via an SDE with time-varying step size and batch, establishing that:

  • The risk/loss at the end of training decomposes into model approximation error, a full-batch decay term, and an SGD noise convolution term:

E[RK]12σ2Msβ+Ts+σ2B[b+(ab)min{M,T21/β}T2]E[R_K] - \frac{1}{2}\sigma^2 \sim M^{-s\beta} + T^{-s} + \frac{\sigma^2}{B}\left[b + (a-b)\frac{\min\{M, T_2^{1/\beta}\}}{T_2}\right]

where T2T_2 is the decay-phase "intrinsic time", a,ba, b are ηmax,ηmin\eta_{\max}, \eta_{\min}, and s,βs, \beta are data difficulty and model capacity exponents.

  • FG-WSD removes suboptimal logD\log D factors present in exponential decay, yielding improved scaling exponents—especially in the "easy regime" (s>11/βs > 1-1/\beta) and for compute-limited settings.
  • Plateau (stable phase) occupies 80–95% of training, with a short, aggressive decay (5–15%) concentrating the final convergence. This timing optimizes both convergence and generalization (Li et al., 23 Sep 2025).

5. Empirical Evidence and Ablation Results

Comparative ablations validate that FG-WSD consistently outperforms vanilla WSD schedulers, especially on reasoning-heavy benchmarks. For a 1B-parameter model (1T tokens), benchmark improvements are pronounced:

Scheduler GSM8k CMath BBH MMLU CMMLU MMLU-Pro
Vanilla WSD 27.1 34.5 29.3 49.2 50.3 16.87
Fine-Grained WSD 34.3 39.5 31.6 50.6 51.9 18.64

Final validation perplexity is minimized by square-root and lowered-linear schedules, with bias+variance sums lowest for these shapes (Yang et al., 6 Dec 2025, Dremov et al., 2 Aug 2025).

6. Practical Implementation and Optimization Recipes

  • For single runs, square-root (ssqrt(x)=1xs_{\text{sqrt}}(x) = 1 - \sqrt{x}) or lowered-linear with c0.7c \approx 0.7 are preferred for the decay function.
  • In model-averaging regimes ("soups"), aggressive high-variance shapes are recommended, with subsequent retuning of optimizer hyperparameters.
  • During cooldown, increasing AdamW β2\beta_2 to $0.99$–$0.999$ delivers consistent perplexity gains of $0.1$–$0.2$ points; β1\beta_1 fixed at $0.9$.

Hyperparameter recommendations align warmup and decay fractions with both model and data scaling. Batch size, weight decay, and optimizer state also interact nontrivially with the bias–variance properties induced by FG-WSD.

7. Loss Landscape and Optimization Dynamics

Visualization of loss in joint weight-space directions ("global" training path and "local" optimizer steps) reveals that:

  • Early in decay, the parameter trajectory explores "river valleys" in the loss landscape with multiple orthogonal ridges.
  • Aggressive cooldown shapes extend exploration along these valleys before contraction to a final basin, supporting low-bias but high-variance outcomes.
  • Conservative shapes rapidly commit to a specific basin, lowering variance but potentially increasing bias (Dremov et al., 2 Aug 2025).

These dynamics confirm the theoretical predictions of the exploration–exploitation trade-off induced by fine-grained decay scheduling.


FG-WSD thus constitutes a flexible, theoretically validated paradigm for orchestrating both learning rate and data curriculum at high resolution. Its deployment is central in modern transformer training—manifesting clear improvements in both sample efficiency and generalization, particularly when reasoning or high-quality language understanding are prioritized (Dremov et al., 2 Aug 2025, Yang et al., 6 Dec 2025, Li et al., 23 Sep 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Fine-Grained Warmup-Stable-Decay (FG-WSD).