Rejection-sampling Fine-Tuning (RFT)

Updated 9 November 2025

Rejection-sampling Fine-Tuning (RFT) is an iterative process that fine-tunes models using only self-generated outputs validated by task-specific criteria.
The approach employs proposal sampling and supervised fine-tuning cycles to improve data diversity for applications like agentic RL, mathematical reasoning, and tool-use.
Empirical results show RFT boosts performance over standard fine-tuning by increasing correct-chain diversity, though it risks oversampling simpler tasks.

Rejection-sampling Fine-Tuning (RFT) is a class of iterative model distillation algorithms for LLMs in which only self-generated trajectories or completions that satisfy externally verifiable success criteria are included in each round of fine-tuning. It has become a foundational data augmentation and semi-supervised learning tool in agentic RL environments, mathematical reasoning, tool-augmented LLMs, and LLM alignment with human preferences. Variants of RFT appear under names including Self-Taught Reasoners (STaR), statistical rejection sampling optimization (RSO), and Hint-RFT. The approach relies on proposal sampling, rejection based on task-specific validity, and supervised fine-tuning cycles—all while avoiding reinforcement learning or explicit reward modeling.

1. Formal Algorithmic Structure

RFT algorithms typically consist of a multi-phase iterative loop. The method can be instantiated in the context of contextual MDPs, supervised data augmentation, preference alignment, and tool-using LLMs. At the core, the agent distribution $\pi_\theta(a\,|\,s)$ is repeatedly rolled out to generate candidate trajectories $\tau$ (or output sequences), and only trajectories that meet an acceptance predicate—usually a binary reward, correctness, or a combination of explicit tool use and answer validity—are retained for the fine-tuning set.

A representative structure from (Lan et al., 17 Apr 2025):

RFT(Dₑ, π_θ, I, k)

$D^+\leftarrow\{\tau_e\in D_e : R(\tau_e)=1\}$
Optimize $\theta$ by supervised fine-tuning (SFT) loss on $D^+$
For $i=1 \ldots I$ $i = 1 \dots I$ :
- For each subtask $s_0\in C_{train}$ $s_{0} \in C_{t r ain}$ :
  - Generate $k$ rollouts $\{\tau_j\sim\pi_\theta(\cdot|s_0)\}$
  - $D_i\leftarrow D_i \cup\left\{\tau_j\,:\, R(\tau_j)=1\right\}$
- $D^+\leftarrow D^+\cup D_i$
- Optimize $\theta$ by SFT loss on $D^+$
Return $\pi_\theta$

In mathematical terms, let $D^t_+$ denote the set of positive, accepted samples at iteration $t$ ; the per-token SFT loss is typically:

$\mathcal{L}_{\mathrm{SFT}}(\pi_\theta) = -\sum_{l=0}^L m_l \cdot \log\pi_\theta(t_l\,|\,t_{<l})$

where $m_l$ masks non-action tokens for action-centric fine-tuning in agentic LLMs. Acceptance is governed by $R(\tau)=1$ or analogous predicates.

2. Applications and Variants

RFT has become standard in several settings:

Agentic LLMs in interactive environments: RFT incrementally grows the set of successfully completed trajectories for behavior cloning (Lan et al., 17 Apr 2025). The process uses contextual MDPs and sparse success signals, learning a policy solely from “solved” examples.
Self-Taught Reasoners / STaR: In mathematical and chain-of-thought (CoT) reasoning, RFT augments a human-annotated dataset $\mathcal{D}$ with multiple, model-generated correct and distinct chains per example, dramatically increasing diversity and improving downstream accuracy (Yuan et al., 2023, Koh et al., 22 May 2025).
Preference optimization: RFT is adapted as statistical rejection sampling optimization (RSO), in which candidate completions are sampled from the supervised policy and accepted with probability proportional to a reward model, matching the optimal KL-regularized policy (Liu et al., 2023).
Tool-using LLMs: In Hint-RFT, model outputs must invoke external tools (e.g., code execution), and only outputs with valid, successful tool use and correct answers are retained (Li et al., 6 Mar 2025).

The following table summarizes key RFT variants and their contexts:

Variant	Application Domain	Acceptance Criterion
Classic RFT	Agentic MDPs, Math CoT	Binary success $R(\tau)=1$ , answer correctness
AdaSTaR	Reasoning benchmarks	Outcome plus diversity/curriculum
RSO	Preference alignment	Top samples by reward model
Hint-RFT	Tool-augmented LLMs	Correct answer and successful tool use

3. Empirical Properties and Scaling Laws

Extensive experiments validate the empirical gains and limitations of RFT:

In agentic settings (WebShop-11k), RFT (with $k=6$ rollouts per subtask) achieves a win rate of 53.6%, vastly above GPT-4’s 35.6% (Lan et al., 17 Apr 2025).
In reasoning benchmarks (ARC-C, ANLI, GSM8K, etc.), RFT consistently outperforms SFT, with the magnitude of improvement approximately log-linear in the number of distinct, accepted CoTs per example (Yuan et al., 2023). Doubling the diversity of correct paths yields roughly constant but diminishing returns in accuracy.
AdaSTaR, an adaptive sampling variant, reduces training FLOPs by 58.6% on average and achieves highest test accuracy across six testbeds (Koh et al., 22 May 2025).
For mathematical reasoning, combining rejected samples across multiple SFT models further amplifies accuracy, with LLaMA-7B reaching 49.3% on GSM8K, up from 35.9% for SFT alone (Yuan et al., 2023).
In preference optimization, RSO achieves superior proxy reward wins and human preferences compared to SLiC and DPO, and maintains stable acceptance rates (~20–50%) at $<10\%$ additional computational overhead (Liu et al., 2023).
In tool-using LLMs, Hint-RFT delivers pass@1 rates up to 95.0% in AMC23, a large +12.5 point gain over basic RFT, by enforcing tool invocation in accepted data (Li et al., 6 Mar 2025).

4. Limitations and Biases

Despite its generality, RFT presents recurring limitations:

Simplicity bias: Successful trajectories tend to correspond to easier subtasks, while challenging, out-of-distribution cases are rarely “solved” and thus underrepresented in the training set. For instance, in WebShop, RFT disproportionately collects successful data from simple tasks, leaving harder subtasks unlearned (Lan et al., 17 Apr 2025).
Expert coverage constraint: RFT cannot learn from planning or partial-credit behaviors in “failed” expert trajectories, even if those failures contain key sub-solutions or rarer skills. This can cause important capabilities (e.g., complex plans) not to be learned, as observed when $\sim65\%$ of WebShop’s hard tasks are unsolved by the expert policy.
Sampling imbalance: In vanilla RFT/STaR, examples that are easy for the model are over-sampled, appearing in the fine-tuning set multiple times, while hard examples receive sparse supervised signal. Attempts to rebalance by sampling diversity may inadvertently amplify “false positives” (e.g., flawed but correct-answer CoTs), as shown by a $\sim9\%$ increase when using diversity-prioritized sampling (Koh et al., 22 May 2025).
Intrinsic model generation limits: The improvement from RFT ultimately saturates with the diversity of correct completions the model can produce; large, overfit models often generate few truly distinct correct chains (Yuan et al., 2023).

5. Adaptive and Augmented RFT

To address the above deficiencies, recent work augments RFT with adaptive sampling, expert failure mining, and tool-use constraints:

AdaSTaR (Koh et al., 22 May 2025): Introduces adaptive diversity (AdaD) and curriculum (AdaC), forming a priority heap over training examples to preferentially re-sample under-trained (typically hard) cases. AdaC modulates update frequency based on model accuracy, phasing in difficult data only as the model improves.
EEF (Exploring Expert Failures) (Lan et al., 17 Apr 2025): Rather than discarding failed expert trajectories, this approach simulates from intermediate states within unsuccessful attempts, identifies beneficial action subsequences, and selectively integrates them (excluding observed harmful prefixes). WebShop win rate rises from 53.6% (RFT) to 62.0% (EEF: GPT-3 & 4).
Hint-RFT (Li et al., 6 Mar 2025): In the tool-augmented setting, ensures the model learns to invoke external APIs or code by accepting only outputs containing verified tool use and correct results.
Statistical RSO (Liu et al., 2023): Samples output pairs from the estimated optimal policy (via rejection over SFT completions weighted by a reward model), resulting in provable convergence to the true MLE for the KL-regularized objective. RSO produces batch-wise acceptance rates and can seamlessly replace or enhance DPO/SLiC without adding a value function or complex reward modeling pipeline.

6. Comparison to Alternative Data Augmentation and Fine-Tuning Approaches

RFT differs from several related paradigms:

Supervised Fine-Tuning (SFT): Standard SFT fits only to human-written (often sparse or single) demonstrations. RFT-efficiency and gains are multiplicative in the diversity and correctness of model-generated augmentations (Yuan et al., 2023).
RLHF (e.g., PPO): While RL with human feedback requires online policy optimization, value functions, and careful tuning, RFT and its derivatives are fully offline, leveraging model/self-generated data and simple acceptance predicates.
Verifier-based/Tree-search Augmentations (e.g., CoRE, STAR): These require additional components—a trained verifier or explicit search tree—while RFT typically leverages Boolean filtering criteria and basic deduplication (e.g., by equation lists in CoT settings).
Self-query and self-revising: Attempts to generate new questions for rejected chains or revise failed CoTs bring little to negative improvement for mathematical reasoning (Yuan et al., 2023).

7. Implementation and Practical Considerations

RFT requires no model architecture changes and only standard supervised learning infrastructure. Empirically tested settings include:

LLM architecture: Off-the-shelf models such as LLaMA-3 8B, QwQ-32B, T5-Large/XXL, Gemma 7B, and Qwen 2.5/3.2 3B (Lan et al., 17 Apr 2025, Koh et al., 22 May 2025, Li et al., 6 Mar 2025).
Prompting template: For agents, interleaved with “Think[…]”, “Action[…]”, “Obs[…]”; for CoT, standard chain-of-thought or code execution format (Lan et al., 17 Apr 2025, Li et al., 6 Mar 2025).
Hyperparameters: Fine-tuning iterations $I=3$ –$10$; rollouts per subtask $k=1$ –$100$ (math reasoning); learning rates $5\times10^{-5}$ – $7\times10^{-6}$ ; batch sizes $64$–$128$; epochs per round $3$–$6$.
Hardware: DeepSpeed ZeRO-3, 32×A100, context length up to 16,384, with efficient attention (FlashAttention2 in tool-use applications) (Li et al., 6 Mar 2025).

Careful filtering, such as masking harmful action prefixes or deduplicating equation lists, is critical for leveraging nontrivial model behavior while avoiding error propagation.

In summary, Rejection-sampling Fine-Tuning (RFT) is a simple yet powerful iterative fine-tuning framework for LLMs that incorporates only externally verifiable, self-generated successful data into the training loop. Its reach extends from agentic RL settings to mathematical reasoning, code/CoT tasks, human preference optimization, and tool-use induction. The main strengths of RFT are scalability, minimal dependence on additional modeling infrastructure, and empirical log-linear performance gains tied to correct plan diversity. Its main limitations are bias toward simple tasks and reliance on self-generation diversity, challenges now increasingly addressed via adaptive, curriculum-based, and failure-mining augmentations.