Rejection Fine-Tuning (RFT)
- Rejection Fine-Tuning is a method that filters and selects training examples based on correctness criteria and safety signals.
- It employs candidate generation, automated verification, and deduplication to enhance model robustness and generalization.
- Empirical studies show RFT boosts performance metrics—e.g., GSM8K accuracy improves from 35.9% to 41.7% for lower-capacity models.
Rejection Fine-Tuning (RFT) is a suite of methods that augment and refine the supervised or reinforcement learning of large models—particularly language and reasoning models—by explicitly filtering, selecting, or synthesizing training examples according to correctness criteria, task-specific rejection signals, or alignment targets. RFT leverages rejection sampling, reward-based selection, and structured data augmentation to produce more reliable, robust, and generalizable models, with diverse applications ranging from mathematical reasoning to multi-agent simulation, adversarial robustness, and safe alignment.
1. Conceptual Foundations and Definitions
RFT is characterized by its use of an explicit rejection—or filtering—mechanism on candidate training examples during model fine-tuning. In the prototypical setting for mathematical reasoning (Yuan et al., 2023), an already supervised-fine-tuned (SFT) model generates multiple candidate reasoning paths for each input; only those that yield the correct final answer and pass external verification (e.g., programmatic evaluation) are retained. The novel dataset is constructed as
where is a question, is its ground-truth answer, and are the distinct, validated reasoning chains generated by the model .
RFT differs from standard fine-tuning in that its selection process is guided by success criteria (correct final answers and/or trajectory-level correctness), and it may employ deduplication so that only genuinely distinct reasoning paths (determined, for example, by equation or action sequence identity) are used to enrich supervision.
RFT is also used (sometimes interchangeably with "reinforcement fine-tuning") for variants that optimize against reinforcement, reward, or behavioral alignment signals. In these cases, a learned or rule-based reward model can incentivize the rejection of out-of-distribution, unsafe, or low-confidence predictions (Xu et al., 27 Mar 2024, Song et al., 20 May 2025, Ham et al., 9 Jun 2025).
2. Methodological Workflow and Mechanisms
The central steps of RFT, as codified in mathematical reasoning (Yuan et al., 2023), can be summarized as follows:
- Candidate Generation: For each input with known answer , a supervised model generates candidate reasoning paths (with temperature-controlled sampling for output diversity).
- Automated Verification: Each is evaluated for two properties: (i) does the final answer match and (ii) are all intermediate calculations correct (e.g., via a Python programmatic evaluator)?
- Deduplication: Distinctness is enforced at the level of solution structure (e.g., via list of equations or action traces).
- Augmented Fine-Tuning: The union of original and additional, diverse, correct (and deduplicated) samples is used for subsequent supervised fine-tuning.
Where rejection signals relate to safety or refusal (e.g., the model's ability to abstain from answering out-of-domain or unanswerable questions), RFT protocols may also involve training a reward model to provide higher rewards for correct refusals and truthful rejections than for hallucinated or overconfident answers (Xu et al., 27 Mar 2024, Song et al., 20 May 2025, Ham et al., 9 Jun 2025).
Metrics-Oriented and Hybrid Strategies: Advanced protocols include metric-oriented policy optimization (MPO) that directly aligns the reward with complex, possibly non-differentiable behavior metrics (e.g., simulation realism) (Pei et al., 28 Sep 2025) or hybrid approaches that blend SFT and RFT via demonstration prefixes (Huang et al., 2 Jul 2025).
3. Empirical Findings and Performance Impact
Across mathematical reasoning and agentic benchmarks, RFT yields substantial gains, especially for lower-capacity or less performant models. Empirical observations include:
- For LLaMA-7B, SFT achieves maj1@1 accuracy of 35.9% on GSM8K, but RFT with distinct paths improves this to 41.7%; aggregating across models () further boosts this to 49.3% (Yuan et al., 2023).
- On agent tasks, RFT can be iteratively applied over self-generated trajectories to enable small or medium LLMs to surpass their expert demonstrators in win rate (e.g., increasing from 35.6% to 62% on WebShop with expert-failure mining integrated into RFT (Lan et al., 17 Apr 2025)).
- In multi-agent simulation, RFT in the SMART-R1 pipeline (Pei et al., 28 Sep 2025) enables the model to optimize for the Realism Meta metric, culminating in a state-of-the-art realism score of 0.7858 on the WOSAC benchmark.
A consistent pattern is that RFT brings more pronounced improvement to weaker models and that benefit saturates as model capacity and pretraining quality increase, due to diminishing diversity in new solution/behavioral paths generated by the base model (Yuan et al., 2023).
4. Innovations, Variants, and Extensions
Several technical enhancements have been proposed to generalize and strengthen RFT:
- Reasoning-highlighted Fine-Tuning: Token-level weighting schemes, such as SHAD+RFT, emphasize reasoning tokens over boilerplate in the loss objective, giving more learning signal to sample-specific content and less to repetitive formatting, yielding improved agentic and tool-use capabilities (Ye et al., 19 Dec 2024).
- Refusal- and Alignment-based Filtering: Methods such as Refusal-Feature-guided Teacher (ReFT) exploit internal representation differences—derived from safety-aligned models—to identify and filter harmful prompts, ensuring safe alignment during user-data fine-tuning (Ham et al., 9 Jun 2025).
- Process-level Rewarding: RL-based variants synthesize or extract intermediate reasoning chains or behavioral plans, using process reward models to supervise not just final outcomes but also reasoning steps, improving domain transfer when explicit process data is unavailable (Zhang et al., 22 Dec 2024).
- Prefix Sampling Hybrids: Prefix-RFT bridges SFT and RFT by anchoring exploration to a demonstration prefix followed by on-policy continuation, harmonizing stability from demonstrations with exploration benefits (Huang et al., 2 Jul 2025).
- Hint-based RFT: Hint-RFT combines strategically injected hints (to prompt tool invocations) with rejection sampling, enabling models to learn robust self-checking and tool-using behaviors (as in the START model for code-based or scientific reasoning (Li et al., 6 Mar 2025)).
5. Comparative Analyses: RFT vs. SFT and Other Approaches
Empirical studies systematically compare RFT with SFT and hybrid strategies:
| Aspect | Supervised Fine-Tuning (SFT) | Rejection Fine-Tuning (RFT) | Hybrid (e.g., Prefix-RFT) |
|---|---|---|---|
| Data Augmentation | Relies on human-annotated data | Augments with diverse, self-generated paths | Blends demonstration prefix + RL |
| Resistance to Forgetting | Prone to catastrophic forgetting in sequence | Preserves prior knowledge via implicit regularization | Intermediate, depending on update weighting |
| Robustness to OOD | Sensitive; may hallucinate or overfit format | Improves generalization, mitigates hallucination tax | Improved if hybridized with refusal tuning |
| Implementation Cost | Moderate (no RL required) | Higher, but often automated; no reward model in MCQ math | Slightly higher than RFT |
Notably, RFT is less prone to catastrophic forgetting during continual post-training, attributed to data-dependent implicit regularization: RFT updates are scaled by reward variance, suppressing large parameter shifts for uncertain or distributionally mismatched samples (Lai et al., 7 Jul 2025, Zhang et al., 30 Jun 2025). Hybrid and prefix-based schemes tend to offer further improvements by synergizing demonstration anchoring and reward-driven exploration (Huang et al., 2 Jul 2025).
6. Limitations, Open Challenges, and Trade-offs
RFT is not devoid of limitations:
- Hallucination Tax: Standard RFT can degrade the model’s refusal behavior on ambiguous or unanswerable questions, causing it to hallucinate plausible but unsupported answers. Introducing synthetic unanswerable data (e.g., SUM dataset) during fine-tuning can mitigate this degradation (Song et al., 20 May 2025).
- Diminishing Returns: Well-trained, high-capacity models exhibit limited benefit from additional RFT cycles, as their output space is already rich in correct and diverse reasoning paths (Yuan et al., 2023).
- Hyperparameter Sensitivity: Some variants, particularly in robustness or adversarial settings, are highly sensitive to loss weighting and learning rates; solutions such as low-rank disentanglement and automated schedulers (as in AutoLoRa) alleviate optimization instability (Xu et al., 2023).
- Data Quality Dependence: In agentic or multi-step settings, performance is sensitive to the quality and coverage of positive examples; exploration of expert failures and prefix hybridization partially address the OOD and rare-subtask deficiencies (Lan et al., 17 Apr 2025, Huang et al., 2 Jul 2025).
- Safety and Alignment: When user data is potentially adversarial, as in Finetuning-as-a-Service, robust filtering and alignment distillation using the model's own refusal features is necessary to prevent harmful outputs without sacrificing task performance (Ham et al., 9 Jun 2025).
A plausible implication is that future advances will rely on adaptive rejection, curriculum-based augmentation, and more granular alignment with both correctness and refusal targets.
7. Applications and Broader Impact
RFT has demonstrated utility across a spectrum of domains:
- Mathematical Reasoning: RFT substantially improves accuracy on benchmarks such as GSM8K, especially in low-resource or small-model regimes (Yuan et al., 2023).
- Agentic Planning and Tool Use: By focusing learning on genuine reasoning or action tokens, RFT enables LLM agents to achieve superior multi-step task execution, including complex navigation and tool invocation (Ye et al., 19 Dec 2024, Li et al., 6 Mar 2025).
- Safe Alignment and Finetuning: RFT-informed strategies ensure robust alignment in user-customizable models for safety-critical applications by filtering and distilling safe behaviors (Ham et al., 9 Jun 2025).
- Continual and Domain-specific Adaptation: RFT supports stable addition of new skills in multimodal large models, exhibiting low forgetting and improved generalization, a key advantage in perpetual model updating scenarios (Lai et al., 7 Jul 2025, Zhang et al., 30 Jun 2025).
- Simulation and Autonomous Systems: RFT with metric-oriented reward optimization directly targets behavioral fidelity in large-scale, multi-agent traffic simulation (Pei et al., 28 Sep 2025).
The broad mechanism of rejection sampling and filtered optimization in fine-tuning thus provides a principled foundation for enhancing accuracy, generalization, safety, and adaptability of large reasoning and agent models.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free