Fine-Grained Warmup-Stable-Decay Scheduler

Updated 17 December 2025

FG-WSD is a refined learning-rate scheduler that stages a data quality curriculum within the stable phase to improve convergence and specialization.
It decouples learning rate adjustments from data mixture transitions, fostering robust early exploration followed by precise late-stage refinement.
Empirical results on Nanbeige4-3B show faster convergence and enhanced performance on benchmarks requiring advanced reasoning and mathematical precision.

Fine-Grained Warmup–Stable–Decay (FG-WSD) is an enhancement of the standard Warmup–Stable–Decay (WSD) learning-rate schedule, designed to optimize large-scale LLM pre-training by decoupling learning-rate control from data-mixture progression. FG-WSD introduces a curriculum over data quality within the stable phase, facilitating early exploration and late-stage refinement. This methodology, introduced and empirically validated in the context of Nanbeige4-3B pre-training, yields improved stability, faster convergence, and elevated performance, particularly on benchmarks requiring advanced reasoning and mathematical capabilities (Yang et al., 6 Dec 2025). Theoretical analysis under the Functional Scaling Law (FSL) framework establishes optimality regimes and guides hyperparameter selection for FG-WSD (Li et al., 23 Sep 2025).

1. Conceptual Foundation and Motivations

FG-WSD extends the conventional WSD schedule, which comprises three sequential phases: a short linear warmup to peak learning rate (LR), a flat LR plateau (stable phase) with a fixed data mixture, and a smooth LR decay to a floor value. FG-WSD modifies the stable plateau by partitioning it into multiple sub-stages; during each sub-stage, the data mixture is incrementally up-sampled in terms of dataset quality. This temporal decoupling ensures that changes in data mixture occur independently of LR adjustments, in contrast to conventional WSD where these transitions are often synchronized.

The rationale is twofold: varying data quality under a constant LR enables robust exploration followed by precise specialization, and uncoupling data and LR transitions yields a smoother optimization trajectory. The approach is particularly well-suited to pre-training regimes where dataset quality can be stratified and prioritized in a staged fashion (Yang et al., 6 Dec 2025).

2. Mathematical and Algorithmic Formulation

FG-WSD schedules both the learning rate $\eta(t)$ and the data sampling mixture $M(t)$ as piecewise functions of training progress, indexed either by tokens processed or by intrinsic time in the FSL formalism (Li et al., 23 Sep 2025). The formulation for Nanbeige4-3B utilizes the following notation:

$T_\text{total}$ : Total training tokens (23T)
$T_w$ : Warmup tokens (0.1T)
$T_d$ : Diversity-enriched stable tokens (12.4T)
$T_h$ : High-quality stable tokens (6.5T)
$T_\text{dec}$ : Decay tokens (4T)
$\eta_\text{max}$ : Peak LR ( $4.5 \times 10^{-4}$ )
$\eta_\text{final}$ : Final LR ( $1.5 \times 10^{-6}$ )
$D_\text{full}$ , $D_\text{div}$ , $D_\text{HQ}$ : Full corpus, diversity-enriched subset, and high-quality up-sampled subset

The schedule is given piecewise:

$0 \leq t \leq T_w$ : $\eta(t) = \eta_\text{max} \cdot \frac{t}{T_w}$ ; sample from $D_\text{full}$
$T_w < t \leq T_w+T_d$ : $\eta(t) = \eta_\text{max}$ ; sample from $D_\text{div}$
$T_w+T_d < t \leq T_w+T_d+T_h$ : $\eta(t) = \eta_\text{max}$ ; sample from $D_\text{HQ}$
$T_w+T_d+T_h < t \leq T_\text{total}$ : $\eta(t) = \eta_\text{max}(1 - t'/T_\text{dec}) + \eta_\text{final}(t'/T_\text{dec})$ , $t' = t - (T_w + T_d + T_h)$ ; sample from $D_\text{HQ}$

The corresponding pseudocode succinctly encapsulates these staged transitions over both LR and data source, ensuring methodological transparency and reproducibility (Yang et al., 6 Dec 2025).

3. Theoretical Analysis and Scaling Properties

The FG-WSD paradigm is analytically grounded within the Functional Scaling Law (FSL) framework, wherein the learning dynamics of mini-batch SGD are modeled via stochastic differential equations with explicit incorporation of LR scheduling (Li et al., 23 Sep 2025). The expected population risk is characterized as:

$E[R(v_t)] = \tfrac{1}{2}\sigma^2 + M^{-s\beta} + \frac{1}{t^s} + \int_0^t \mathcal{K}(t-r)[e(r) + \sigma^2]\gamma(r)dr$

Here, $M$ is the number of effective features, $s$ and $\beta$ are problem-dependent exponents, and $\gamma(r)$ encodes the LR schedule. For FG-WSD, the schedule is introduced as a sequence of blockwise-constant or linearlydecaying LRs aligned with data-mixture transitions.

Closed-form scaling rules emerge for both data-limited and compute-limited regimes, allowing optimal choices for LR, batch size, decay-phase fraction, and model width. A key result is that, for hard learning regimes ( $s < 1 - 1/\beta$ ), the asymptotically optimal decay phase becomes vanishingly brief, justifying empirical practices. For easy regimes, the decay-phase fraction is constant and independent of data scale. These analyses legitimize the allocation of a substantial stable phase and a curriculum on data quality within FG-WSD (Li et al., 23 Sep 2025).

4. Empirical Validation and Hyperparameterization

Empirical studies within Nanbeige4-3B and associated 1B-parameter ablations substantiate the effectiveness of FG-WSD. Ablation results on reasoning-intensive benchmarks clearly demarcate gains attributable to data-curriculum staging (see table below):

Benchmark	GSM8K	CMath	BBH	MMLU	CMMLU	MMLU-Pro
Vanilla WSD	27.1	34.5	29.3	49.2	50.3	16.87
FG-WSD	34.3	39.5	31.6	50.6	51.9	18.64

Absolute gains reach +7.2 points (GSM8K), with highest impact on mathematically rigorous tasks. Implementation in Nanbeige4-3B utilizes the following hyperparameters:

Stage	Tokens	Learning Rate	Data Source
Warmup	0.1 T	$0 \rightarrow 4.5 \times 10^{-4}$ (linear)	Full corpus (23T)
Diversity-Stable	12.4 T	$4.5 \times 10^{-4}$ (constant)	$D_\text{div}$ ( $\sim$ 12.5T)
High-Qual-Stable	6.5 T	$4.5 \times 10^{-4}$ (constant)	$D_\text{HQ}$ (6.5T)
Decay	4 T	$4.5 \times 10^{-4} \rightarrow 1.5 \times 10^{-6}$	$D_\text{HQ}$

Sub-phases subdivide the stable stage, and data selection is refined as training advances. Context length extension (up to 64K) was implemented during decay, but without modifying the scheduler.

5. Benefits, Dynamics, and Observed Effects

FG-WSD demonstrates specific and reproducible benefits in a range of pre-training contexts:

Accelerates initial convergence by permitting broad corpus exposure at a maximally flat LR.
Enhances final reasoning accuracy, especially on benchmarks requiring chain-of-thought or mathematical sophistication.
Decouples shifts in data sampling from learning-rate transitions, yielding a more stable and interpretable loss curve throughout training.
Empirical superiority over warmup-cosine-decay and vanilla WSD was observed on Math (AIME), Science (GPQA), Tool Use (BFCL), and alignment (Arena-Hard) tasks.

This suggests that the added curriculum mechanism in the stable phase is crucial for extracting maximal signal from high-quality subsets, without losing the advantages of wide-corpus exploration in early training.

6. Extensions, Optimization, and Practical Guidance

The FSL framework offers a surrogate loss model that, once fit to loss curves under a given LR schedule, can predict and optimize loss trajectories under FG-WSD and variants. Practitioners can numerically backpropagate through the surrogate, optimizing token-by-token LR subject to schedule constraints. This supports schedule search and fine-tuning beyond static rule-based strategies (Li et al., 23 Sep 2025).

Optimal schedule selection depends on whether the regime is data-limited or compute-limited, and on task difficulty parameters $(s, \beta)$ . The framework provides explicit guidance for tuning the fraction of tokens allocated to warmup, stable, and decay stages, along with the degree of data-quality refinement per sub-stage. In all regimes, a very brief decay is often justified for hard tasks, while the warmup length only weakly affects leading order risk.

A plausible implication is that FG-WSD acts as a general, plug-in scheduler, retaining the implementation simplicity of WSD while bringing the performance benefits of explicit data curricula and staged specialization, with both theoretical justifications and empirical validation in state-of-the-art small LLM regimes (Yang et al., 6 Dec 2025, Li et al., 23 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models (2025)

Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fine-Grained Warmup-Stable-Decay (FG-WSD) Scheduler.

Fine-Grained Warmup-Stable-Decay Scheduler

1. Conceptual Foundation and Motivations

2. Mathematical and Algorithmic Formulation

3. Theoretical Analysis and Scaling Properties

4. Empirical Validation and Hyperparameterization

5. Benefits, Dynamics, and Observed Effects

6. Extensions, Optimization, and Practical Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Fine-Grained Warmup-Stable-Decay Scheduler

1. Conceptual Foundation and Motivations

2. Mathematical and Algorithmic Formulation

3. Theoretical Analysis and Scaling Properties

4. Empirical Validation and Hyperparameterization

5. Benefits, Dynamics, and Observed Effects

6. Extensions, Optimization, and Practical Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research