Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fine-Grained Warmup-Stable-Decay Scheduler

Updated 17 December 2025
  • FG-WSD is a refined learning-rate scheduler that stages a data quality curriculum within the stable phase to improve convergence and specialization.
  • It decouples learning rate adjustments from data mixture transitions, fostering robust early exploration followed by precise late-stage refinement.
  • Empirical results on Nanbeige4-3B show faster convergence and enhanced performance on benchmarks requiring advanced reasoning and mathematical precision.

Fine-Grained Warmup–Stable–Decay (FG-WSD) is an enhancement of the standard Warmup–Stable–Decay (WSD) learning-rate schedule, designed to optimize large-scale LLM pre-training by decoupling learning-rate control from data-mixture progression. FG-WSD introduces a curriculum over data quality within the stable phase, facilitating early exploration and late-stage refinement. This methodology, introduced and empirically validated in the context of Nanbeige4-3B pre-training, yields improved stability, faster convergence, and elevated performance, particularly on benchmarks requiring advanced reasoning and mathematical capabilities (Yang et al., 6 Dec 2025). Theoretical analysis under the Functional Scaling Law (FSL) framework establishes optimality regimes and guides hyperparameter selection for FG-WSD (Li et al., 23 Sep 2025).

1. Conceptual Foundation and Motivations

FG-WSD extends the conventional WSD schedule, which comprises three sequential phases: a short linear warmup to peak learning rate (LR), a flat LR plateau (stable phase) with a fixed data mixture, and a smooth LR decay to a floor value. FG-WSD modifies the stable plateau by partitioning it into multiple sub-stages; during each sub-stage, the data mixture is incrementally up-sampled in terms of dataset quality. This temporal decoupling ensures that changes in data mixture occur independently of LR adjustments, in contrast to conventional WSD where these transitions are often synchronized.

The rationale is twofold: varying data quality under a constant LR enables robust exploration followed by precise specialization, and uncoupling data and LR transitions yields a smoother optimization trajectory. The approach is particularly well-suited to pre-training regimes where dataset quality can be stratified and prioritized in a staged fashion (Yang et al., 6 Dec 2025).

2. Mathematical and Algorithmic Formulation

FG-WSD schedules both the learning rate η(t)\eta(t) and the data sampling mixture M(t)M(t) as piecewise functions of training progress, indexed either by tokens processed or by intrinsic time in the FSL formalism (Li et al., 23 Sep 2025). The formulation for Nanbeige4-3B utilizes the following notation:

  • TtotalT_\text{total}: Total training tokens (23T)
  • TwT_w: Warmup tokens (0.1T)
  • TdT_d: Diversity-enriched stable tokens (12.4T)
  • ThT_h: High-quality stable tokens (6.5T)
  • TdecT_\text{dec}: Decay tokens (4T)
  • ηmax\eta_\text{max}: Peak LR (4.5×1044.5 \times 10^{-4})
  • ηfinal\eta_\text{final}: Final LR (1.5×1061.5 \times 10^{-6})
  • DfullD_\text{full}, DdivD_\text{div}, DHQD_\text{HQ}: Full corpus, diversity-enriched subset, and high-quality up-sampled subset

The schedule is given piecewise:

  • 0tTw0 \leq t \leq T_w: η(t)=ηmaxtTw\eta(t) = \eta_\text{max} \cdot \frac{t}{T_w}; sample from DfullD_\text{full}
  • Tw<tTw+TdT_w < t \leq T_w+T_d: η(t)=ηmax\eta(t) = \eta_\text{max}; sample from DdivD_\text{div}
  • Tw+Td<tTw+Td+ThT_w+T_d < t \leq T_w+T_d+T_h: η(t)=ηmax\eta(t) = \eta_\text{max}; sample from DHQD_\text{HQ}
  • Tw+Td+Th<tTtotalT_w+T_d+T_h < t \leq T_\text{total}: η(t)=ηmax(1t/Tdec)+ηfinal(t/Tdec)\eta(t) = \eta_\text{max}(1 - t'/T_\text{dec}) + \eta_\text{final}(t'/T_\text{dec}), t=t(Tw+Td+Th)t' = t - (T_w + T_d + T_h); sample from DHQD_\text{HQ}

The corresponding pseudocode succinctly encapsulates these staged transitions over both LR and data source, ensuring methodological transparency and reproducibility (Yang et al., 6 Dec 2025).

3. Theoretical Analysis and Scaling Properties

The FG-WSD paradigm is analytically grounded within the Functional Scaling Law (FSL) framework, wherein the learning dynamics of mini-batch SGD are modeled via stochastic differential equations with explicit incorporation of LR scheduling (Li et al., 23 Sep 2025). The expected population risk is characterized as:

E[R(vt)]=12σ2+Msβ+1ts+0tK(tr)[e(r)+σ2]γ(r)drE[R(v_t)] = \tfrac{1}{2}\sigma^2 + M^{-s\beta} + \frac{1}{t^s} + \int_0^t \mathcal{K}(t-r)[e(r) + \sigma^2]\gamma(r)dr

Here, MM is the number of effective features, ss and β\beta are problem-dependent exponents, and γ(r)\gamma(r) encodes the LR schedule. For FG-WSD, the schedule is introduced as a sequence of blockwise-constant or linearlydecaying LRs aligned with data-mixture transitions.

Closed-form scaling rules emerge for both data-limited and compute-limited regimes, allowing optimal choices for LR, batch size, decay-phase fraction, and model width. A key result is that, for hard learning regimes (s<11/βs < 1 - 1/\beta), the asymptotically optimal decay phase becomes vanishingly brief, justifying empirical practices. For easy regimes, the decay-phase fraction is constant and independent of data scale. These analyses legitimize the allocation of a substantial stable phase and a curriculum on data quality within FG-WSD (Li et al., 23 Sep 2025).

4. Empirical Validation and Hyperparameterization

Empirical studies within Nanbeige4-3B and associated 1B-parameter ablations substantiate the effectiveness of FG-WSD. Ablation results on reasoning-intensive benchmarks clearly demarcate gains attributable to data-curriculum staging (see table below):

Benchmark GSM8K CMath BBH MMLU CMMLU MMLU-Pro
Vanilla WSD 27.1 34.5 29.3 49.2 50.3 16.87
FG-WSD 34.3 39.5 31.6 50.6 51.9 18.64

Absolute gains reach +7.2 points (GSM8K), with highest impact on mathematically rigorous tasks. Implementation in Nanbeige4-3B utilizes the following hyperparameters:

Stage Tokens Learning Rate Data Source
Warmup 0.1 T 04.5×1040 \rightarrow 4.5 \times 10^{-4} (linear) Full corpus (23T)
Diversity-Stable 12.4 T 4.5×1044.5 \times 10^{-4} (constant) DdivD_\text{div} (\sim12.5T)
High-Qual-Stable 6.5 T 4.5×1044.5 \times 10^{-4} (constant) DHQD_\text{HQ} (6.5T)
Decay 4 T 4.5×1041.5×1064.5 \times 10^{-4} \rightarrow 1.5 \times 10^{-6} DHQD_\text{HQ}

Sub-phases subdivide the stable stage, and data selection is refined as training advances. Context length extension (up to 64K) was implemented during decay, but without modifying the scheduler.

5. Benefits, Dynamics, and Observed Effects

FG-WSD demonstrates specific and reproducible benefits in a range of pre-training contexts:

  • Accelerates initial convergence by permitting broad corpus exposure at a maximally flat LR.
  • Enhances final reasoning accuracy, especially on benchmarks requiring chain-of-thought or mathematical sophistication.
  • Decouples shifts in data sampling from learning-rate transitions, yielding a more stable and interpretable loss curve throughout training.
  • Empirical superiority over warmup-cosine-decay and vanilla WSD was observed on Math (AIME), Science (GPQA), Tool Use (BFCL), and alignment (Arena-Hard) tasks.

This suggests that the added curriculum mechanism in the stable phase is crucial for extracting maximal signal from high-quality subsets, without losing the advantages of wide-corpus exploration in early training.

6. Extensions, Optimization, and Practical Guidance

The FSL framework offers a surrogate loss model that, once fit to loss curves under a given LR schedule, can predict and optimize loss trajectories under FG-WSD and variants. Practitioners can numerically backpropagate through the surrogate, optimizing token-by-token LR subject to schedule constraints. This supports schedule search and fine-tuning beyond static rule-based strategies (Li et al., 23 Sep 2025).

Optimal schedule selection depends on whether the regime is data-limited or compute-limited, and on task difficulty parameters (s,β)(s, \beta). The framework provides explicit guidance for tuning the fraction of tokens allocated to warmup, stable, and decay stages, along with the degree of data-quality refinement per sub-stage. In all regimes, a very brief decay is often justified for hard tasks, while the warmup length only weakly affects leading order risk.

A plausible implication is that FG-WSD acts as a general, plug-in scheduler, retaining the implementation simplicity of WSD while bringing the performance benefits of explicit data curricula and staged specialization, with both theoretical justifications and empirical validation in state-of-the-art small LLM regimes (Yang et al., 6 Dec 2025, Li et al., 23 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fine-Grained Warmup-Stable-Decay (FG-WSD) Scheduler.