Fine-Grained Warmup-Stable-Decay (FG-WSD)

Updated 13 December 2025

FG-WSD is a family of schedulers that precisely controls the warmup, stable, and decay phases in large-scale transformer training.
It divides training into fine-tuned stages with explicit data curriculum adjustments to balance exploration and exploitation.
Empirical results demonstrate that FG-WSD achieves lower validation perplexity and improved performance on reasoning benchmarks.

Fine-Grained Warmup–Stable–Decay (FG-WSD) is a family of learning-rate and data-curriculum schedulers for large-scale neural network training, especially prominent in transformer-based LLMs. Built as an extension of the traditional Warmup–Stable–Decay (WSD) schedules, FG-WSD introduces explicit, tunable control over every phase—warmup, stable, decay/cooldown—and refines both the functional forms of the learning rate decay and the stagewise progression of data quality. Recent research has established FG-WSD as a near-optimal strategy under both theoretical scaling laws and empirical performance, particularly for reasoning-centric LLMs (Dremov et al., 2 Aug 2025, Yang et al., 6 Dec 2025, &&&2&&&).

1. Formal Definition and Schedule Structure

FG-WSD divides training into discrete, fine-controlled stages:

Warmup: Initial phase (e.g., 0–1% of total steps/tokens) where the learning rate is increased monotonically from zero to a maximum $\eta_{\max}$ .
Stable Plateau: One or more subphases with fixed learning rate $\eta_{\max}$ , during which the data mixture may transition from diverse/general to high-quality samples.
Decay (Cooldown): Final phase(s) where the learning rate decays smoothly to a low value $\eta_{\min}$ , over a prescribed functional form.

Formally, for total $T$ steps, partitioned as $T = T_w + T_{s,1} + \dots + T_{s,k} + T_d$ , with $T_w$ (warmup), $T_{s,i}$ (stable-phase $i$ ), $T_d$ (decay/cooldown):

Warmup ( $0 \leq t < T_w$ ): $\eta(t) = \eta_{\max} \cdot (t / T_w)$
Stable ( $T_w \leq t < T_w + \sum_j T_{s,j}$ ): $\eta(t) = \eta_{\max}$
Decay ( $T_* \leq t < T$ ): $\eta(t) = \eta_{\max} \cdot s\left(\frac{t - T_*}{T_d}\right)$

where $T_*$ is the onset of decay and $s(x)$ , $x \in [0,1]$ , is the chosen decay/cooldown shape (Dremov et al., 2 Aug 2025, Yang et al., 6 Dec 2025, Li et al., 23 Sep 2025).

A distinguishing feature of FG-WSD is the explicit possibility of subdividing the stable phase and associating each segment with a distinct data mixture. For example, Nanbeige4-3B uses:

Stage	Tokens (×10¹²)	Learning Rate	Data Mixture
Warmup	0.1	$0 \to 4.5 \times 10^{-4}$	Full
Diversity-Enriched Stable	12.4	$4.5 \times 10^{-4}$	Diverse (0.525 HQ)
High-Quality Stable	6.5	$4.5 \times 10^{-4}$	100% HQ
Decay	4.0	$4.5 \times 10^{-4} \to 1.5 \times 10^{-6}$	HQ or mixed

2. Cooldown Shape Families and Bias–Variance Trade-Off

In the decay phase, FG-WSD allows arbitrary monotone functions $s(x)$ to define the learning rate trajectory. Commonly employed shapes include:

Linear: $s_{\text{lin}}(x) = 1 - x$
Lowered Linear: $s_{\text{LL},c}(x) = 1 - c x$ , $c \in (0,1)$
Polynomial (Power-law): $s_p(x) = (1-x)^p$ , $p > 0$
Exponential: $s_{\exp,\alpha}(x) = \exp(-\alpha x)$ , $\alpha > 0$
Cosine: $s_{\cos}(x) = \frac{1+\cos(\pi x)}{2}$
Mirror-cosine: $s_{\text{mcos}}(x) = 2(1-x) - (1+\cos(\pi x))/2$
Square-root: $s_{\text{sqrt}}(x) = 1 - \sqrt{x}$

Each shape modulates the optimizer's exploration–exploitation dynamics. Aggressive shapes (mirror-cosine, square) maintain higher learning rates, promoting exploration and producing lower bias but higher run-to-run variance. Conservative shapes (linear, gentle polynomials) emphasize early exploitation, reducing variance at the potential cost of suboptimal convergence (higher bias) (Dremov et al., 2 Aug 2025).

Empirical analysis on transformers demonstrates that bias–variance decomposition of final validation loss is tightly controlled by this functional choice. The sum of bias and variance is minimized for intermediate shapes, notably square-root and lowered-linear with $c \approx 0.7$ , yielding optimal perplexity in single-run settings.

3. Data Curriculum and Stagewise Mixture Progression

FG-WSD further generalizes traditional schedulers by aligning the stable-phase segments with a fine-grained curriculum in data quality. In the Nanbeige4-3B regime:

Stable Phase I: Data is drawn from a specifically balanced mixture favoring diversity. The upsampled high-quality subset $H'$ (6.5 T tokens) is paired with a medium-quality $M'$ slice (6.5 T tokens), giving a mixture probability $w_H^{(1)} = 0.525$ for HQ data.
Stable Phase II: Switch exclusively to high-quality ( $w_H^{(2)} =1$ ).
Decay: Optionally, the high-quality focus is maintained or relaxed per task needs (Yang et al., 6 Dec 2025).

Piecewise or continuously increasing $w_H^{(k)}$ —the probability of sampling HQ—establishes a "quality-progressive" training curriculum, potentially further subdivided for more granular control.

4. Functional Scaling Law (FSL) and Theoretical Analysis

Under the Functional Scaling Law (FSL) formalism (Li et al., 23 Sep 2025), FG-WSD emerges as theoretically near-optimal for large-scale pretraining. FSL models training via an SDE with time-varying step size and batch, establishing that:

The risk/loss at the end of training decomposes into model approximation error, a full-batch decay term, and an SGD noise convolution term:

$E[R_K] - \frac{1}{2}\sigma^2 \sim M^{-s\beta} + T^{-s} + \frac{\sigma^2}{B}\left[b + (a-b)\frac{\min\{M, T_2^{1/\beta}\}}{T_2}\right]$

where $T_2$ is the decay-phase "intrinsic time", $a, b$ are $\eta_{\max}, \eta_{\min}$ , and $s, \beta$ are data difficulty and model capacity exponents.

FG-WSD removes suboptimal $\log D$ factors present in exponential decay, yielding improved scaling exponents—especially in the "easy regime" ( $s > 1-1/\beta$ ) and for compute-limited settings.
Plateau (stable phase) occupies 80–95% of training, with a short, aggressive decay (5–15%) concentrating the final convergence. This timing optimizes both convergence and generalization (Li et al., 23 Sep 2025).

5. Empirical Evidence and Ablation Results

Comparative ablations validate that FG-WSD consistently outperforms vanilla WSD schedulers, especially on reasoning-heavy benchmarks. For a 1B-parameter model (1T tokens), benchmark improvements are pronounced:

Scheduler	GSM8k	CMath	BBH	MMLU	CMMLU	MMLU-Pro
Vanilla WSD	27.1	34.5	29.3	49.2	50.3	16.87
Fine-Grained WSD	34.3	39.5	31.6	50.6	51.9	18.64

Final validation perplexity is minimized by square-root and lowered-linear schedules, with bias+variance sums lowest for these shapes (Yang et al., 6 Dec 2025, Dremov et al., 2 Aug 2025).

6. Practical Implementation and Optimization Recipes

For single runs, square-root ( $s_{\text{sqrt}}(x) = 1 - \sqrt{x}$ ) or lowered-linear with $c \approx 0.7$ are preferred for the decay function.
In model-averaging regimes ("soups"), aggressive high-variance shapes are recommended, with subsequent retuning of optimizer hyperparameters.
During cooldown, increasing AdamW $\beta_2$ to $0.99$–$0.999$ delivers consistent perplexity gains of $0.1$–$0.2$ points; $\beta_1$ fixed at $0.9$.

Hyperparameter recommendations align warmup and decay fractions with both model and data scaling. Batch size, weight decay, and optimizer state also interact nontrivially with the bias–variance properties induced by FG-WSD.

7. Loss Landscape and Optimization Dynamics

Visualization of loss in joint weight-space directions ("global" training path and "local" optimizer steps) reveals that:

Early in decay, the parameter trajectory explores "river valleys" in the loss landscape with multiple orthogonal ridges.
Aggressive cooldown shapes extend exploration along these valleys before contraction to a final basin, supporting low-bias but high-variance outcomes.
Conservative shapes rapidly commit to a specific basin, lowering variance but potentially increasing bias (Dremov et al., 2 Aug 2025).

These dynamics confirm the theoretical predictions of the exploration–exploitation trade-off induced by fine-grained decay scheduling.

FG-WSD thus constitutes a flexible, theoretically validated paradigm for orchestrating both learning rate and data curriculum at high resolution. Its deployment is central in modern transformer training—manifesting clear improvements in both sample efficiency and generalization, particularly when reasoning or high-quality language understanding are prioritized (Dremov et al., 2 Aug 2025, Yang et al., 6 Dec 2025, Li et al., 23 Sep 2025).