Two-Stage SFT-RL Pipelines

Updated 14 April 2026

Two-stage SFT-RL pipelines combine supervised fine-tuning on expert demonstrations with reinforcement learning to refine LLM behaviors using task-aligned rewards.
The SFT stage rapidly establishes valid solution modes and stabilizes learning, while the RL stage enhances exploration and corrects overfitting through reward-driven updates.
Empirical results show improved sample efficiency, enhanced generalization, and robust performance across language, coding, and complex reasoning tasks.

A two-stage SFT→RL pipeline is a post-training regimen for LLMs and related architectures that first applies Supervised Fine-Tuning (SFT) on curated demonstrations before online or on-policy Reinforcement Learning (RL) with respect to a task-aligned reward signal. These pipelines operationalize a principle of initialization and refinement: SFT leverages existing expert data to rapidly establish valid solution modes and stabilize learning, while RL subsequently sharpens performance by incorporating reward-driven exploration, solution diversity, and distributional shift handling. This approach has become foundational in both language-only and multimodal (e.g., vision-language) model alignment for complex reasoning tasks, coding, mathematical problem-solving, and other high-value AI capabilities.

1. The Standard Two-Stage SFT→RL Framework

The canonical SFT→RL pipeline consists of two sequential stages:

1. Supervised Fine-Tuning (SFT) Stage:

A pretrained backbone model is fine-tuned on a dataset of high-quality, human- or LLM-generated prompt–response pairs, typically including chain-of-thought (CoT) reasoning traces for tasks necessitating stepwise inference. The SFT optimizes the autoregressive negative log-likelihood loss: $\mathcal{L}_{\rm SFT}(\theta) = -\,\mathbb{E}_{(x,y)\sim\mathcal D}\bigl[\log \pi_\theta(y\mid x)\bigr]$ Here, SFT provides a strong inductive anchor for the policy, ensuring non-trivial initial solution rates even in environments with sparse rewards (Jiang et al., 14 Mar 2026, Yoshihara et al., 11 Jul 2025).

2. Reinforcement Learning (RL) Stage:

The SFT-trained model is further adapted via RL to maximize expected reward, where the reward is typically based on explicit correctness, verifier approval, or human-preference modeling. RL updates are performed via on-policy algorithms such as PPO or Group Relative Policy Optimization (GRPO). The generic clipped PPO/GRPO objective is: $\mathcal{L}_{\rm RL}(\theta) = -\,\E_{\tau\sim\pi_\theta}\bigl[R(\tau)\,\nabla_\theta\log\pi_\theta(\tau)\bigr] - \beta\,\mathrm{KL}(\pi_\theta\|\pi_{\rm ref})$ The KL penalty regularizes the updated policy against the SFT initialization to prevent drift, language collapse, or reward hacking (Jiang et al., 14 Mar 2026, Liu et al., 1 Jun 2025).

The sequential pipeline can be extended with curriculum splitting (e.g., Decoupling SFT on easier instances from RL on harder ones (Hu et al., 11 Mar 2026)), adaptive mixing or fallback mechanisms (Liu et al., 1 Jun 2025, Chen et al., 19 May 2025), or explicit curriculum learning constructs.

2. Motivations and Theoretical Foundations

Empirical and analytical results consistently demonstrate that SFT and RL are complementary. SFT excels in sample-efficient acquisition of domain and format knowledge, driving rapid gains in basic solution accuracy and stabilizing early learning. However, pure SFT overfits to demonstration support and cannot incentivize off-distribution exploration or self-reflective correctness detection (Zhao et al., 4 Jan 2026, Matsutani et al., 25 Sep 2025). RL, in contrast, is necessary to concentrate probability mass on optimal behaviors, correct failure modes not covered by demonstrations, compress solution traces, or discover alternative strategies (Yoshihara et al., 11 Jul 2025, Liu et al., 16 Jun 2025). RL gradients are essential for optimizing decision components such as STOP/RESAMPLE selectors underpinning self-reflection in LLM policies (Zhao et al., 4 Jan 2026).

Recent theoretical work formalizes this complementarity through the "Gradient Attribution Property" and related analyses. SFT objectives induce highly biased, low-variance updates, but collapse reward attribution to generative tokens, failing to optimize decisional policies. RL surrogate rewards (group-normalized advantages), in contrast, provide balanced gradients that simultaneously update sampling and verification policies, undergirding the emergence of error-detection and self-correction (Zhao et al., 4 Jan 2026, Matsutani et al., 25 Sep 2025).

3. Advanced Variants and Extensions

Recent research introduces a variety of architectural and algorithmic extensions to the vanilla two-stage scheme to address known limitations:

Curriculum and Difficulty Decoupling:

DeReason proposes splitting the training data into SFT and RL subsets via difficulty scoring, allocating non-reasoning-intensive examples to SFT and challenging ones to RL, improving both sample efficiency and overall generalization (Hu et al., 11 Mar 2026).

Dynamic or Adaptive Mixing:

Frameworks such as SuperRL (Liu et al., 1 Jun 2025) and SASR (Chen et al., 19 May 2025) adaptively switch or mix between SFT and RL within the training loop, using reward feedback or gradient statistics (e.g., norm ratios) to select the objective per instance or minibatch, stabilizing training and ensuring learning signal coverage even when reward is sparse.

Reward Learning in SFT:

Integrating inverse RL or preference-based reward learning into the SFT stage (RFT/IRFT (Li et al., 2024)) can alleviate demonstration distribution shift and provide a regularizing signal that sharpens the model’s ability to distinguish high- and low-quality outputs.

Branched Rollouts and Partial Demonstration Injection:

BREAD (Zhang et al., 20 Jun 2025) supplements standard SFT→RL with expert prefixes ("anchors") whenever the model fails to generate correct traces, curricularly reducing reward sparsity.

Bias-Variance Tradeoff Reconciliation:

DYPO (Zhu et al., 10 Apr 2026) and related frameworks structurally combine variance-reduced RL objectives (e.g., group alignment loss) and bias-reduced SFT (e.g., multi-teacher KL distillation) with a dynamic reward-based gating system, demonstrating provable variance control and bias minimization.

Cooperative (Bilevel) SFT+RL:

Bilevel schemes (BRIDGE) directly meta-train SFT to act as a “teacher” that efficiently guides RL, sidestepping catastrophic forgetting and promoting synergy through cooperative optimization (Chen et al., 8 Sep 2025).

4. Performance, Sample Efficiency, and Empirical Findings

Empirical results across language, code, mathematics, STEM reasoning, and vulnerability detection demonstrate that two-stage SFT→RL pipelines:

Dramatically accelerate learning vs. RL from scratch, particularly in environments with sparse or hard-to-discover reward signals (e.g., OpenR1-Math, DeepScaleR) (Liu et al., 1 Jun 2025, Yoshihara et al., 11 Jul 2025).
Maximize both accuracy (via SFT) and token/computation efficiency (via RL): SFT pushes solution rate to the ceiling, while RL compresses reasoning traces and reduces sample redundancy (Yoshihara et al., 11 Jul 2025, Liu et al., 16 Jun 2025).
Improve cross-domain and cross-modal reasoning, especially in low- and moderate-capacity models and in low-data regimes; SFT’s sample efficiency reliably exceeds RL’s until larger scales, after which RL surpasses SFT in asymptotic accuracy (Yu et al., 14 Dec 2025).
Are sensitive to SFT data quality and curriculum: excessive SFT ("over-SFT") can degrade RL’s exploration and plateauing performance, while insufficient SFT slows RL and increases reward hacking risk (Li et al., 15 Feb 2026).
Benefit from well-tuned RL sampling temperature and entropy to balance exploration and exploitation (Liu et al., 16 Jun 2025).

Quantitative highlights include:

SuperRL achieves 75% EM on GSM8K in ∼200 steps vs ∼500 for PPO/GRPO; on sparse Math tasks, reward mean nearly doubles vs. vanilla RL (Liu et al., 1 Jun 2025).
Difficulty-aware curricula (DeReason) yield +1.9% pass@1 over SFT-only and outperform both SFT→random RL and SFT-only on MMLU-Pro and GPQA-Diamond (Hu et al., 11 Mar 2026).
Token efficiency (output length reduction) of up to 25–30% post-RL with preserved or improved accuracy in competitive mathematics (Yoshihara et al., 11 Jul 2025).
In large-scale benchmarks, SFT provides >10× data efficiency vs. RL, but RL outpaces SFT as the data budget and model capacity grow (Yu et al., 14 Dec 2025).

5. Challenges, Failure Modes, and Open Controversies

Despite the success of the SFT→RL paradigm, several limitations and active research directions remain:

Catastrophic Forgetting:

Decoupled two-stage pipelines suffer from rapid loss of SFT-acquired behaviors at RL onset, evidenced by chain-of-thought length collapse and reward dips (Chen et al., 8 Sep 2025). Bilevel and continued SFT supervision are proposed remedies.

Reward Sparsity and Imitation Bias:

In hard reasoning domains, SFT-trained small models may have near-zero probability of discovering successful traces in RL, resulting in a lack of positive gradients (failure of the SFT+GRPO recipe). Branched rollouts and expert-anchored prefixes (BREAD) mitigate this (Zhang et al., 20 Jun 2025).

SFT Overfitting and Exploration Stagnation:

Excessive SFT training produces over-specialized policies unable to explore or self-improve under RL; hybrid or curriculum strategies address this (Li et al., 15 Feb 2026, Chen et al., 10 Jul 2025).

Synergy Dilemma in Multimodal or Long-CoT Architectures:

For VLMs, naively stacking SFT and RL often fails to yield additive benefits; SFT can worsen simplicity and RL may reduce reasoning depth—the "synergy dilemma." Adaptive, difficulty-aware scheduling and more nuanced data-model alignment are necessary (Chen et al., 10 Jul 2025, Yu et al., 14 Dec 2025).

Reward Model Aliasing and Deceptive Reward:

RL agents may learn to maximize proxy rewards with no improvement or even regression in genuine reasoning accuracy; careful reward model design and SFT preconditioning are critical (Yu et al., 14 Dec 2025).

6. Practical Guidelines and Implementation Recipes

Practical two-stage SFT→RL pipelines require careful attention to the following elements:

Stage	Key Components	Typical Hyperparameters
SFT	Supervised NLL on (x, y*), CoT data	LR=1e-5, batch=32–128, 2–10 ep
RL	GRPO/PPO with group advantages, KL	LR=1e-6, K=5–16 rollouts, KL=1e-3–0.04, β=0.01–0.1, PPO clip=0.1–0.2

Always ensure high-quality, distribution-aligned SFT data (avoid data leaks or imitative overfitting).
Moderate SFT—sufficiently long to stabilize but not so extended as to inhibit RL diversity—is advisable.
In RL, implement KL regularization and monitor both reward and held-out accuracy to detect reward hacking.
When possible, partition data curricularly with non-reasoning tasks for SFT and high-reasoning tasks for RL.
Consider adaptive or hybrid pipelines (e.g., SuperRL, SASR, DYPO) for domains with severe reward sparsity or instability (Liu et al., 1 Jun 2025, Chen et al., 19 May 2025, Zhu et al., 10 Apr 2026).
In small models, inject partial expert hints to bridge expressivity gaps (Zhang et al., 20 Jun 2025).

Open-source implementations are provided for several recent pipelines (e.g., AceReason-Nemotron-1.1-7B, Kaggle-AIMO2-Fast-Math-R1).

7. Outlook and Future Directions

Two-stage SFT→RL pipelines remain the dominant framework for LLM reasoning, alignment, and complex generative tasks. In parallel, the field is converging toward more unified or adaptive recipes that dynamically arbitrate between SFT and RL objectives, directly optimize curriculum allocations, and supplement reward-shaping with meta-learning and multi-objective criteria. Balancing efficiency, stability, and generalization—especially under distributional shift and reward sparsity—continues to motivate algorithmic advances. Analytical tools for credit assignment, trajectory analysis, and entropy regulation are increasingly deployed to understand and close remaining gaps between human-level and model-level reasoning (Matsutani et al., 25 Sep 2025, Zhao et al., 4 Jan 2026).

In summary, the two-stage SFT→RL pipeline is both theoretically grounded and empirically validated as an effective, extensible paradigm for high-accuracy, robust, and efficient LLM post-training. Its continued evolution is central to the progress of scalable, generalizable, and trustworthy AI (Jiang et al., 14 Mar 2026).