Reinforced Fine-Tuning (ReFT): Boosting LLM Reasoning

Updated 28 August 2025

ReFT is a post-training paradigm that integrates reinforcement learning with fine-tuning to explore diverse reasoning paths and optimize task outcomes.
It employs a two-stage process—starting with supervised fine-tuning followed by reinforcement learning (using PPO) that rewards only correct final answers.
ReFT enhances model generalization, sample efficiency, and adaptability in fields from mathematical problem-solving to scientific and multimodal reasoning.

Reinforced Fine-Tuning (ReFT) is a post-training paradigm that integrates reinforcement learning (RL) into the fine-tuning of large-scale neural models, particularly LLMs and multimodal LLMs. Developed to overcome the inherent limitations of purely supervised approaches, ReFT systematically expands the scope of model learning by actively exploring and optimizing over multiple reasoning paths, harnessing reward signals derived from ground-truth task success. ReFT has demonstrated substantial improvements in reasoning, generalization, sample efficiency, and adaptability across domains such as mathematical problem-solving, scientific reasoning, visual tasks, materials design, and more.

1. Motivation and Core Principles

Traditional Supervised Fine-Tuning (SFT) of LLMs, especially for reasoning tasks, relies heavily on static datasets annotated with Chain-of-Thought (CoT) explanations. In typical math reasoning datasets, each question is annotated with a single “canonical” CoT path, despite the existence of many valid reasoning trajectories. SFT thus suffers from limited generalization: it optimizes the likelihood of a fixed route without explicit encouragement to explore alternative but valid solutions.

ReFT addresses these structural limitations through two core mechanisms:

Exploration Over Reasoning Paths: Rather than confining learning to a single demonstrated path per instance, ReFT explicitly encourages the model to traverse diverse CoT reasoning routes. Rewards are assigned strictly on the correctness (or utility) of the final outcome, not the adherence to a unique annotated procedure.
Two-Stage Optimization: The ReFT pipeline consists of an SFT “warm-up” phase, followed by reinforcement learning—most often with policy gradient algorithms such as Proximal Policy Optimization (PPO). The RL phase samples numerous reasoning trajectories for each problem, granting reward signals when ground-truth criteria are met, thereby leveraging the compositional and exploratory power of modern neural architectures (Luong et al., 17 Jan 2024).

2. Methodological Foundation

2.1. Two-Stage Fine-Tuning Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

The model is fine-tuned via cross-entropy loss over paired (question, CoT) samples:

$\mathcal{L}_{SFT}(\theta) = -\mathbb{E}_{e \sim \mathcal{D}} \left[ \sum_{t} \log \pi_\theta(a_t \mid s_t) \right]$

where $s_t$ is the state (question + previous tokens) and $a_t$ is the token at step $t$ .

Stage 2: Reinforcement Learning with PPO

The model samples full CoT completions; the answer is extracted and scored:

$r(s_t, a_t, s_{t+1}) = \begin{cases} 1, & \text{if EXTRACT$(s_{t+1})$ equals ground-truth answer} \ 0.1, & \text{if a valid but incorrect answer is extracted} \ 0, & \text{if no answer is extracted} \end{cases}$

The reward includes an explicit KL divergence penalty to regularize updates:

$r_\text{total} = r - \beta \cdot D_{KL}(\pi_\theta \parallel \pi_{init})$

Advantage estimation uses GAE:

$\hat{A}_t = \sum_\ell (\gamma\lambda)^\ell \delta_{t+\ell},\quad \delta_t = -V(s_t) + r_\text{total}(s_t, a_t, s_{t+1}) + \gamma V(s_{t+1})$

The final RL loss combines policy and value function terms, with PPO clipping for stable updates.

2.2. Diversity and Generalization

By enabling unsupervised trajectory exploration, ReFT allows the model to discover multiple valid inference paths. The reward design means only the correctness of the terminal answer matters—not the pathway. This “reward-on-outcome” approach produces a richer and less biased supervision signal, enhancing the model’s ability to generalize to out-of-distribution or adversarial problems.

2.3. Inference-Time Strategies

Inference can be further enhanced by:

Majority voting over sampled completions.
Reward-model based ranking and re-ranking. Such strategies leverage the diversity in sampled solutions to select the most reliable or consensus answer (Luong et al., 17 Jan 2024).

3. Empirical Performance and Results

ReFT has been evaluated extensively on mathematical reasoning datasets such as GSM8K, MathQA, and SVAMP, using foundational models (e.g., Galactica-6.7B and CodeLLAMA-7B). The empirical results are as follows:

Model	Dataset	Training Regime	Accuracy Improvement
CodeLLAMA-7B	GSM8K	SFT $\rightarrow$ ReFT	+9 points
CodeLLAMA-7B	MathQA	SFT $\rightarrow$ ReFT	Significantly higher accuracy; numeric output prevents reward hacking

ReFT outperforms SFT and both offline and online self-training competitors.
When augmented with inference-time majority voting or re-ranking, further gains are possible.
The gains arise without the need for extra or augmented training questions: both SFT and ReFT learn from an identical set of question statements.
Reward hacking can occur when the search space is artificially limited (e.g., multiple-choice questions); this is mitigated by rewarding only for correct, unconstrained numeric outputs.

4. Generalization, Stability, and Limitations

A defining strength of ReFT is its generalization on previously unseen problems, achieved without extra data. Sampling multiple reasoning paths during RL and regularizing with a KL penalty help maintain training stability and prevent mode collapse. However, care must be taken when applying RL to settings with small or highly constrained output spaces, as reward exploitation may occur.

Potential limitations include:

Sensitivity to the reward design; poorly chosen rewards or evaluation metrics can result in suboptimal training dynamics or reward hacking.
Increased compute during RL training due to sampling numerous completions per data point.

5. Extension to Other Reasoning and Domain Tasks

While the initial focus is on math problem-solving, the ReFT paradigm is directly extensible to any domain where multiple valid inference chains or explanations exist. Examples include:

Scientific reasoning (e.g., data analysis pipelines, experimental design).
Program synthesis and code generation (where multiple correct implementations may exist).
Multi-step decision processes in language or multi-modal environments (reasoning over dialog, instruction following).

In these contexts, reward functions must be carefully crafted to recognize correct (but structurally diverse) outputs, and where possible, programmatic verification or proxy reward models can be employed.

6. Future Research Directions

The initial demonstration of ReFT in (Luong et al., 17 Jan 2024) invites a wider investigation into:

Improved reward shaping and automated reward model learning for complex, less verifiable domains.
Efficient exploration strategies to reduce the compute overhead of trajectory sampling.
Integrating ReFT with curriculum learning or human-in-the-loop feedback.
Application to multimodal and cross-modal reasoning tasks with complex, multi-path verification criteria.

Subsequent research has extended ReFT principles to visual reasoning (Liu et al., 3 Mar 2025, Tan et al., 26 Mar 2025), domain adaptation with limited data (Zhang et al., 22 Dec 2024), and efficient curriculum-based RL (Shi et al., 7 Apr 2025), all confirming its broad applicability and robustness in settings that demand both versatility and generalization.

Reinforced Fine-Tuning constitutes a robust RL-driven alternative to supervised post-training, enabling neural models to acquire diverse, generalizable strategies for complex reasoning problems, especially in domains characterized by diverse, under-specified solution spaces. Its methodologically principled design and empirical success across multiple high-profile benchmarks underscore its central role in the next generation of foundation model adaptation.