Sequential SFT-then-RL Pipeline

Updated 19 December 2025

The sequential SFT-then-RL pipeline is a modular approach that first uses supervised fine-tuning to embed task-specific knowledge before applying reinforcement learning for adaptive improvement.
Key contributions include rigorous data curation, optimal checkpoint selection, and reward-driven policy updates that ensure stability and scalability across various LLM applications.
Empirical benchmarks demonstrate significant performance gains in accuracy and task alignment across models in code, reasoning, and vision-language domains.

A sequential Supervised Fine-Tuning-then-Reinforcement Learning (SFT-then-RL) pipeline is the prevailing methodology for post-training LLMs and multimodal agents to achieve advanced task alignment and reasoning capabilities. This pipeline first relies on SFT to inject task-specific knowledge, followed by RL for performance adaptation via reward-driven optimization, often with additional regularization to preserve the benefits of SFT. The paradigm is characterized by its modularity, stability, and high empirical ceilings, and has been implemented in state-of-the-art systems across code, reasoning, vision-language, and generative model domains.

1. Pipeline Structure and Workflow

A canonical sequential SFT-then-RL pipeline consists of three core stages:

Data Curation: Automated or semi-automated selection and filtering of high-quality supervision traces from sources like curated code repositories, synthetic reasoning datasets, or procedure-distilled model outputs. Techniques such as SPICE-based filtering and rejection sampling may be employed to ensure clarity, difficulty-appropriate, and coverage-filtered data (Chen et al., 3 Aug 2025).
Supervised Fine-Tuning (SFT): The model $\pi_\theta$ is initialized and optimized via cross-entropy minimization on the chosen dataset,

$L_\text{SFT}(\theta) = -\mathbb{E}_{(x,y)\sim\mathcal{D}_\text{SFT}} \left[ \sum_{t=1}^T \log \pi_\theta(y_t \mid y_{<t}, x) \right],$

typically for a fixed number of epochs and batch size, with checkpoint selection governed by validation loss.

Reinforcement Learning (RL): The SFT-initialized policy is further refined to maximize reward through policy gradient algorithms (e.g., PPO, GRPO), optionally regularized by KL-divergence to the SFT reference,

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] - \lambda \mathrm{KL}(\pi_\theta(\cdot) \| \pi_\text{ref}(\cdot)),$

where $R(\tau)$ is a verifiable reward (e.g., passing all test cases, correct answer extraction) and $\lambda$ sets the regularization strength (Chen et al., 3 Aug 2025, Zhang et al., 20 Jun 2025, Yoshihara et al., 11 Jul 2025).

This pipeline supports fully automatic, scalable deployment by orchestrating data ingestion, supervised optimization, and distributed asynchronous RL—often using frameworks like Ray for rollout parallelism (Chen et al., 3 Aug 2025). The full system is documented by task-specific pseudocode and hyperparameter settings that ensure reproducibility.

2. Mathematical Formulation and Objectives

The SFT phase minimizes cross-entropy over high-quality demonstration trajectories, instilling prior knowledge and format regularization. For sequence models,

$L_\text{SFT}(\theta) = -\sum_{i=1}^N \sum_{t=1}^{T_i} \log \pi_\theta(y_t^{(i)} \mid y_{<t}^{(i)}, x^{(i)}).$

The RL phase maximizes the expectation of a verifiable, often sparse, reward signal,

$J_\text{RL}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] - \lambda \,\text{KL}(\pi_\theta(\cdot)\|\pi_\text{ref}(\cdot)).$

Policy optimization is implemented via PPO-style or group normalizing surrogates, exploiting advantage estimators and KL-penalties to ensure stable policy updates. For example, in KL-regularized PPO,

$g(\theta) \approx \mathbb{E}_{\tau\sim\pi_\theta}\left[ \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (R(\tau) - b) - \lambda \nabla_\theta \mathrm{KL}(\pi_\theta(\cdot|s_t) \| \pi_\text{ref}(\cdot|s_t)) \right],$

with baseline $b=0$ in binary-reward settings.

In multistage settings, the SFT loss is set to zero once RL commences, though optional mixed gradients, $L_\text{total} = \alpha L_\text{SFT} + (1-\alpha) L_\text{RL}$ for small $\alpha$ , can be used during early RL steps (Yu et al., 14 Dec 2025).

3. Empirical Results and Scaling Laws

Empirical studies consistently demonstrate that sequential SFT-then-RL pipelines outperform pure SFT or pure RL, especially when staged according to validation loss and data scaling principles:

Model	Accuracy (%) SFT	Final Acc. (%) SFT→RL	Reference
Code LLM (SWE-Bench-V)	12.7	17.4	(Chen et al., 3 Aug 2025)
VLM (Qwen2-VL-7B)	47.7	50.2	(Yu et al., 14 Dec 2025)
Math LLM (AIME 14B)	65.2	66.0	(Yoshihara et al., 11 Jul 2025)

Key patterns include:

SFT boosts performance by establishing a task-appropriate initialization, often producing steeper gains for lower-capacity models or under extended data regimes (Yu et al., 14 Dec 2025, Liu et al., 16 Jun 2025).
RL applied to an SFT-warm start further improves the ceiling by exposing the policy to distributional feedback, verifiable signals, or task-specific brevity penalties (Yoshihara et al., 11 Jul 2025).
Data scale and trajectory difficulty in SFT directly determine both the “foundation” and the “plasticity” available for RL; selecting the best SFT checkpoint (minimum validation loss) before RL maximizes the empirical ceiling (Ding et al., 12 Dec 2025).

4. Scheduling, Hyperparameters, and Best Practices

Comprehensive guidelines have been formalized to maximize final performance:

SFT Stage: Train until validation loss saturates in the “Stable” or at most “Mild Overfitting” regime, using large, high-quality SFT datasets. Data diversity and difficulty curation act as critical multipliers (Ding et al., 12 Dec 2025), with prompt diversity contributing more strongly than response multiplicity (Liu et al., 16 Jun 2025).
Transition Criterion: Switch to RL when SFT validation loss reaches its plateau (within 2% of its minimum), avoiding severe overfitting which collapses RL plasticity (Ding et al., 12 Dec 2025).
RL Stage: Use KL coefficients in the range $\lambda \in [0.001, 0.04]$ (task-dependent) to balance reward pursuit and policy stability; select batch sizes and rollout counts proportional to task and model scale (e.g., batch=32, G=8–10 rollouts/thread, RL steps ∼50–200) (Yoshihara et al., 11 Jul 2025, Chen et al., 3 Aug 2025). Sampling temperature during RL should maintain entropy (e.g., $H_T(\pi)\approx 0.3$ in “nats”) to avoid premature convergence (Liu et al., 16 Jun 2025).
Data Mix and Curriculum: For code and reasoning, mixing atomic “building block” tasks and composite samples ensures generalization, while staged exposure to length or difficulty-crafted samples serves as a curriculum (Chen et al., 14 Jun 2024, Zhang et al., 20 Jun 2025).
Checkpoint Selection: Always pass the minimum validation-loss SFT model to RL. Monitor held-out accuracy during RL, employing early-stopping if reward increases without accuracy gains (to avoid “deceptive rewards”) (Yu et al., 14 Dec 2025).

5. Limitations, Failure Modes, and Extensions

Sequential SFT-then-RL pipelines face several documented limitations:

Catastrophic Forgetting: Pure RL phases following SFT can rapidly “forget” behaviors acquired by SFT, especially in the absence of ongoing supervised guidance or properly tuned KL penalties (Chen et al., 8 Sep 2025).
Expressiveness and Sparse Reward Trap: The SFT phase may provide no useful supervision if the expert traces are out of reach for a lower-capacity model, leading to RL failures due to absent reward signals. Methods such as BREAD anchor rollouts on partial expert prefixes to densify rewards and avoid this barrier (Zhang et al., 20 Jun 2025).
Pseudo Reasoning Lock-In: In multimodal and VLM settings, heavy SFT on expert reasoning traces may trap the policy into “pseudo reasoning” forms—mimicking structure, but not achieving authentic generalization or improvement under subsequent RL (Chen et al., 10 Apr 2025).
Tuning vs. Generality Tradeoff: Over-specialized SFT phases can impair out-of-domain robustness; interventions such as RL-from-scratch or atomic-only RL can help recover generalization for code and LLMs (Chen et al., 14 Jun 2024).

Algorithmic extensions include per-instance adaptive pipelines (SuperRL) that fall back to SFT whenever RL fails to elicit positive reward signals (Liu et al., 1 Jun 2025), stepwise adaptive blending (SASR) with dynamic SFT/RL mixing based on training dynamics (Chen et al., 19 May 2025), and bilevel cooperative SFT-RL that meta-learns the optimal degree of SFT retention through upper-level optimization (Chen et al., 8 Sep 2025).

6. Application Domains and Empirical Benchmarks

Sequential SFT-then-RL pipelines are foundational in a wide range of domains:

Code LLMs/SWE: Pipelines such as RepoForge use autonomous curation, SFT on multi-turn code repair trajectories, and KL-regularized RL to achieve state-of-the-art on SWE-Bench-Verified (Chen et al., 3 Aug 2025).
Mathematical and Logical Reasoning: Frameworks for mathematical LLMs (e.g., AIME, MATH500, LightR1) deploy extended SFT and token-efficient RL (GRPO) to maximize reasoning accuracy and brevity (Yoshihara et al., 11 Jul 2025, Liu et al., 16 Jun 2025).
Vision-LLMs (VLMs): Qwen2-VL and similar architectures use rejection-sampled expert traces for SFT and group-based policy optimization for RL, achieving improved cross-modal generalization and data efficiency (Yu et al., 14 Dec 2025).
Autoregressive Generative Models: ReasonGen-R1 injects explicit reasoning as textual rationales before image token generation, with SFT followed by RL on visual quality rewards to robustly improve compositionality and image alignment (Zhang et al., 30 May 2025).
Instruction-Following: Pipelines combining SFT and RL with supervised reward (RLSR) achieve superior instruction adherence and open-ended generation quality, outperforming SFT-only baselines (Wang et al., 16 Oct 2025).
Hybrid and Adaptive Approaches: SuperRL and SASR incorporate real-time SFT-RL switching for sparse reward robustness and convergence acceleration (Liu et al., 1 Jun 2025, Chen et al., 19 May 2025).

7. Historical Evolution and Future Directions

The sequential SFT-then-RL pipeline has solidified as the community’s “default” post-training protocol for advanced LLM alignment, reasoning, and agentic behavior. Empirical studies such as those codified by the “Plasticity-Ceiling Framework” (Ding et al., 12 Dec 2025) provide explicit scaling law guidance and checkpointing rules to maximize task ceiling while preserving update plasticity, with extensive ablation validating these principles across domains and architectures.

Contemporary research explores adaptive blends, episode-anchored branching, and meta-optimization to overcome the rigidity and sample inefficiency of the naive sequential approach, leading to more data/compute-optimal and robust learning. Notably, as models and task definitions diversify, attention has shifted to the interface between SFT trace complexity, RL reward sparsity, and model capacity—highlighting the need for more nuanced, dynamical, and data-driven control of the SFT→RL transition and the integration of trajectory anchors or meta-learners to unlock further gains (Zhang et al., 20 Jun 2025, Chen et al., 8 Sep 2025).

The sequential SFT-then-RL pipeline thus serves as both a benchmark and a launchpad for new research into scalable, stable, and efficient post-training of LLM-based agents across modalities and domains.