Two-Stage SFT + GRPO: LLM Optimization Pipeline

Updated 28 November 2025

The paper introduces a two-stage method that first maximizes model accuracy via extended SFT before refining efficiency with GRPO.
It leverages a composite reward function and intra-group comparisons to stabilize policy gradients and optimize task-specific behaviors.
The pipeline demonstrates state-of-the-art results across domains like mathematical, visual, and medical reasoning, ensuring robust generalization.

A Two-Stage SFT + GRPO Training Pipeline is a systematic paradigm for optimizing LLMs and multimodal neural architectures, designed to leverage the complementary strengths of Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). This approach has been rigorously validated as a practical and effective blueprint for domains such as mathematical reasoning, scientific information extraction, visual reasoning, recommendation systems, and medical AI, establishing new state-of-the-art results across diverse tasks. The central principle is to first maximize accuracy and task-specific structure via extended SFT, then apply GRPO to optimize efficiency or additional downstream desiderata, frequently with a composite reward function. This article provides a comprehensive technical description of the two-stage pipeline, focusing on its core methodology, algorithmic underpinnings, empirical best practices, and domain-generalization properties as established in recent literature.

1. Pipeline Structure and Motivation

The Two-Stage SFT + GRPO pipeline comprises:

Stage 1: Supervised Fine-Tuning (SFT): The model is exposed to curated datasets of high-difficulty, annotated solution traces, with the objective of maximizing accuracy via next-token cross-entropy minimization. Careful control over dataset diversity, sequence length, and optimization regime (e.g., multi-epoch SFT, low learning rate, curriculum ordering) is critical for saturating performance.
Stage 2: Group Relative Policy Optimization (GRPO): The SFT-tuned model is further trained with reinforcement learning, using reward functions that encode not only correctness but also token efficiency, output length, adherence to format, and other task-specific behavioral constraints. GRPO eschews value networks, instead employing intra-group comparisons to stabilize advantage estimation and policy gradients.

This two-stage recipe exploits the fact that SFT and RL serve non-redundant roles: a prolonged SFT phase pushes accuracy to its limit, while a gentle, KL-constrained GRPO phase enables the policy to compress and rationalize behaviors—for example, by shortening mathematical solutions without loss of correctness (Yoshihara et al., 11 Jul 2025), or removing hallucinated items in recommendations (Zhu et al., 23 Oct 2025).

2. Methodological Details

2.1 Supervised Fine-Tuning

The SFT phase trains the LLM on task-specific demonstration data $(x, y_{1:T})$ , minimizing

$L_{\mathrm{SFT}}(\theta) = -\sum_{t=1}^T \log \pi_\theta(y_t \mid y_{<t}, x)$

Key SFT implementation features include:

Datasets: Task-specific, high-quality traces (e.g., 7,900 hard math problems in (Yoshihara et al., 11 Jul 2025); 22M annotated visual data in (Jiang et al., 14 Oct 2025)).
Batching and Tokenization: Sequence packing enabled; long contexts supported (up to 24,000 tokens).
Optimization: AdamW with initial learning rate $1\times 10^{-5}$ , cosine decay; parameter-efficient LoRA or DoRA adapters for very large models or low-precision inference (Adly et al., 18 Jun 2025).
Prompt engineering: Explicit reasoning templates and system prompts (such as "Please reason step by step, and put your final answer within \texttt{\textbackslash boxed{\ldots}}").
Epoch schedule: Experiments show that extending SFT to $\sim$ 10 epochs is essential to push accuracy to its ceiling, especially on small, high-difficulty datasets (Yoshihara et al., 11 Jul 2025).

2.2 Group Relative Policy Optimization (GRPO)

The GRPO stage takes the SFT model as initialization and proceeds as follows:

Rollout Sampling: For each prompt, generate a group of $N$ candidate solution traces.
Reward Assignment: Compute a scalar reward $R(\tau)$ $R (τ)$ for each trace based on a composite of:
- Format adherence,
- Semantic correctness (regex, answer verification),
- Cosine similarity to reference embeddings (Yoshihara et al., 11 Jul 2025),
- Output length penalties or diversity/novelty (Zhu et al., 23 Oct 2025).

The canonical reward formula is:

$R(\tau) = \alpha\, r_1(\tau) + \beta\, r_2(\tau) + \gamma\, r_3(\tau)$

where each $r_i$ targets a distinct behavioral dimension.

Advantage Estimation: Compute relative advantage for each trace within the group:

$\hat{A}_i = R(\tau_i) - \bar{R}_{\mathrm{group}}$

Policy Update: Solve the clipped surrogate with KL penalty:

$L_{\mathrm{GRPO}}(\theta) = -\mathbb{E}_i\left[ \min(r_i(\theta)\,\hat{A}_i,\, \mathrm{clip}(r_i(\theta),1-\epsilon,1+\epsilon)\,\hat{A}_i) \right] + \beta_{\mathrm{KL}}\ \mathrm{KL}(\pi_{\theta_{\rm old}}\,\|\,\pi_\theta)$

with $r_i(\theta) = \pi_\theta(\tau_i)/\pi_{\theta_{\rm old}}(\tau_i)$ , and periodic $\theta_{\rm old} \leftarrow \theta$ (Yoshihara et al., 11 Jul 2025).

3. Algorithmic Pseudocode

The following pseudocode (quoted from (Yoshihara et al., 11 Jul 2025)) summarizes the canonical two-stage workflow:

θ ← base_model
for epoch in 1...10:
    for batch B in SFT data:
        L_sft ← -sum_{(x, y) in B} sum_{t=1}^{|y|} log π_θ(y_t|y_{<t}, x)
        θ ← θ - η_sft * ∇_θ L_sft

θ_old ← θ
for step in 1...50:
    sample mini-batch {x_j}
    for each x_j: generate N=8 samples τ_{j,1..8} ∼ π_θ
    compute rewards R_{j,i}
    group baseline bar_R_j ← mean_i R_{j,i}
    advantages hat_A_{j,i} ← R_{j,i} - bar_R_j
    compute GRPO loss as above
    θ ← θ - η_rl * ∇_θ L_GRPO
    periodically: θ_old ← θ

4. Empirical Evaluation and Ablations

Systematic evaluation and ablation across high-difficulty benchmarks demonstrate several key findings:

Extended SFT: Accuracy on AIME 2024/2025 rises with SFT epochs, plateauing only at epoch 10; short SFT (1–3 epochs) increases solution length and destabilizes RL (Yoshihara et al., 11 Jul 2025).
GRPO for Efficiency: GRPO phase reduces solution length by $\sim$ 20% (e.g., from 10.3k to 7.9k tokens on AIME 2024 at 14B scale) without accuracy loss; in some cases, accuracy slightly improves post-GRPO (Yoshihara et al., 11 Jul 2025).
Reward Engineering: The combination of format, cosine similarity, and length penalty yields the optimal Pareto frontier (high accuracy, low token budget); dropping components degrades efficiency or accuracy (Yoshihara et al., 11 Jul 2025).
Robustness: KL-regularization prevents collapse; group size $N=8$ suffices for stable advantage estimation.
Generalization: Recipe validated in domains beyond mathematics (see "MimicSFT + R²GRPO" for SciIE (Li et al., 28 May 2025)), recommendations (Xie et al., 24 Jun 2025, Zhu et al., 23 Oct 2025), visual reasoning (Guan et al., 15 Aug 2025, Dao et al., 20 Feb 2025), and medical reasoning (Adly et al., 18 Jun 2025).

5. Design Rationale and Best Practices

The pipeline's structure and hyperparameters are justified by quantitative and qualitative analysis:

Prolonged SFT: Necessary for models to approach maximal accuracy given limited demonstration data; short fine-tuning phases induce under-specification and RL instability (Yoshihara et al., 11 Jul 2025).
Gentle GRPO: A restrained RL phase (e.g., 50 steps with low learning rate) targets efficiency and structural refinement, rather than overhauling accuracy.
KL Penalty: Maintains proximity to SFT-initialized policy and prevents catastrophic divergence.
Reward Scaling: Length penalty parameters are calibrated such that a $1\%$ increase in output length corresponds to a fixed decrement in total reward, balancing brevity and completeness.
System Prompts: Explicit instructions such as "reason step by step" and "final answer within \boxed{}" anchor solution format and are critical for downstream verifiability.

6. Limitations, Generalizations, and Extensions

While the two-stage SFT + GRPO paradigm is robust across tasks, context-specific limitations and considerations have emerged:

Overfitting Risks: Excessive SFT epochs or small, homogeneous datasets can produce overfit policies that suppress RL gains (Kang et al., 2 Oct 2025).
Dataset Composition: Including a mix of example lengths in SFT, and emphasizing hard instances, enhances the effectiveness of subsequent RL (Kang et al., 2 Oct 2025).
Alternative Schedules: Adaptive switching (cf. SASR (Chen et al., 19 May 2025)) between SFT and GRPO guided by training signal statistics can further improve stability.
Reward Misspecification: Inadequate or ill-calibrated reward components can bias RL toward trivial or verbose outputs (Yoshihara et al., 11 Jul 2025); ablations confirm the necessity of component diversity.
Cross-Domain Applicability: The pipeline generalizes beyond text and mathematics to structured visual reasoning, scientific IE, medical reasoning, and navigation (Li et al., 28 May 2025, Adly et al., 18 Jun 2025, Guan et al., 15 Aug 2025, Zhao et al., 4 Jun 2025), often with task-specific adaptation of datasets, prompts, and rewards.

7. Conclusion and Future Directions

The Two-Stage SFT + GRPO Training Pipeline delivers a reproducible, high-performance recipe for end-to-end LLM optimization: extended SFT saturates accuracy, while KL-regularized GRPO selectively improves solution efficiency, format adherence, and other secondary metrics. This methodology has set new state-of-the-art results in the AI Mathematical Olympiad (AIMO) and diverse other tasks, and is supported by comprehensive open-source releases of code and checkpoints (Yoshihara et al., 11 Jul 2025). Continued research is likely to refine reward engineering, explore adaptive SFT–RL schedules, and expand the paradigm into increasingly complex domains—an approach supported by the robust empirical and theoretical footing documented to date.