Reinforcement Fine-Tuning (RFT)

Updated 13 July 2025

Reinforcement Fine-Tuning (RFT) is a method that adapts pretrained models using reinforcement learning objectives and policy gradients to maximize expected rewards.
RFT leverages techniques such as a supervised fine-tuning initialization, KL regularization, and policy gradient methods like PPO to ensure training stability and mitigate vanishing gradients.
Empirical studies demonstrate that RFT improves performance in reasoning, visual, and embodied tasks by effectively aligning model behaviors with human preferences and structured task requirements.

Reinforcement Fine-Tuning (RFT) is a post-pretraining adaptation methodology in which pretrained models, typically LLMs or multimodal LLMs (MLLMs), are further optimized by maximizing expected rewards using policy gradient algorithms. Unlike supervised fine-tuning (SFT), which fits the model to labeled demonstration data, RFT employs reinforcement learning (RL) objectives—often leveraging rule-based or learned reward functions—to instill desired behaviors aligned with human preferences, downstream task requirements, or performance on structured reasoning problems.

1. Fundamental Principles and Theoretical Foundations

RFT is grounded in standard RL formalism. The model operates as a parameterized policy $\pi_\theta$ , generating an output $y$ conditioned on input $x$ . The central training objective is to maximize the expected reward: $\max_{\theta} \; \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}[r(x, y)]$ where $r(x, y)$ is the reward function, which may be a learned preference model, a verifiable rule-based metric, or a hybrid thereof.

Optimization is typically performed via policy gradient methods. The most prominent algorithm in recent practice is Proximal Policy Optimization (PPO), with variants such as Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) also widely adopted. The policy gradient is estimated as: $\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta}[A(x, y) \nabla_\theta \log \pi_\theta(y|x)]$ where $A(x, y)$ is an advantage function, often involving a baseline (such as a value function estimate) to reduce variance.

For pipeline stability and behavioral alignment, a KL-divergence regularization term is often added: $J_{\text{RFT}}(\theta) = \mathbb{E}_{x, y}[r(x, y)] - \beta \cdot \mathrm{KL}\left(\pi_\theta(\cdot|x)\; || \; \pi_{\text{ref}}(\cdot|x)\right)$ This regularizer discourages the updated policy from diverging too far from the pretrained (reference) model.

2. Optimization Challenges: Vanishing Gradients

A central theoretical finding is the "vanishing gradients" phenomenon, established in "Vanishing Gradients in Reinforcement Finetuning of LLMs" (Razin et al., 2023). When the model's output distribution yields low reward variance across possible outputs—that is, the standard deviation of $r(x, y)$ under $\pi_\theta$ is small—the expected policy gradient diminishes, potentially stalling learning:

$\|\nabla_\theta V(x; \theta)\| \leq 6 L_\mathrm{out} \gamma(x; \theta) \cdot (\mathrm{STD}[r(x, \cdot)])^{2/3}$

where $L_\mathrm{out}$ is output length and $\gamma(x; \theta)$ bounds the logit Jacobian norm. Importantly, even when the model's expected reward is suboptimal, low reward variability stalls optimization because the gradient signal nearly vanishes. This effect also applies to PPO-based learning, with a bound on the difference between surrogate and true gradients proportional to the total variation distance between updated and reference policies.

Empirical evidence from GRUE and controlled synthetic benchmarks confirms that this phenomenon occurs frequently, especially for tasks where the pretrained model exhibits low diversity in its predictions, and is independent of algorithms or optimizer noise.

3. RFT Methodologies and Design Patterns

Two-Stage Fine-Tuning: SFT + RFT

The prevailing RFT recipe is a two-stage pipeline:

Supervised Fine-Tuning (SFT): The model is "warmed up" on demonstration data, typically using a cross-entropy loss on target outputs or chain-of-thought (CoT) annotations. SFT aligns the model to desired behaviors and increases reward variance for challenging inputs.
Reinforcement Fine-Tuning (RFT): The model is then further optimized with a policy gradient objective (e.g., PPO or GRPO), sampling multiple output trajectories and updating parameters based on reward signals.

In "Vanishing Gradients in Reinforcement Finetuning of LLMs" (Razin et al., 2023), an initial SFT step—on as little as 1% of the data and with few optimization steps—was shown to substantially increase reward gains by lifting the model out of flat-gradient regions, thus mitigating vanishing gradients.

Sampling and Policy Update Variants

Group-Based Relative Policy Optimization (GRPO): For tasks like reasoning and visual classification, GRPO avoids a value network by comparing groups of outputs; normalized advantages are computed as:

$A_i = \frac{r_i - \text{mean}(r_1, \dots, r_G)}{\text{std}(r_1, \dots, r_G)}$

Masked and Fine-Grained RL Approaches: In mesh generation, Masked-DPO applies spatial masking to only update segments of the output sequence corresponding to low-quality mesh regions (Liu et al., 22 May 2025).
Adaptive Curriculum RFT: AdaRFT dynamically tunes the difficulty of training examples using a target difficulty $T$ and an update rule:

$T' = \operatorname{clip}\big(T + \eta \cdot\tanh(\alpha(R_\mathrm{avg} - \beta)), d_\mathrm{min}, d_\mathrm{max}\big)$

focusing compute on problems best matched to the model's evolving capabilities (Shi et al., 7 Apr 2025).

4. Empirical Results and Benchmarks

RFT has been empirically validated as an effective strategy for reasoning, generalization, and domain transfer:

Language and Reasoning: RFT (e.g., ReFT, Prefix-RFT) significantly outperforms SFT in mathematical reasoning (GSM8K, MathQA, AIME, etc.), enabling learning from multiple reasoning paths and exploration of solution space (Luong et al., 17 Jan 2024, Huang et al., 2 Jul 2025).
Visual and Multimodal Tasks: Visual-RFT and Reason-RFT demonstrate large gains on visual classification, object detection, and visual reasoning benchmarks, with enhancements as high as +24.3% in one-shot fine-grained classification and strong generalization under few-shot or out-of-domain settings (Liu et al., 3 Mar 2025, Tan et al., 26 Mar 2025).
Embodied Agents: RFTF and SEEA-R1 frameworks introduce dense temporal rewards and learned multimodal reward models, achieving state-of-the-art results on embodied manipulation and navigation tasks (e.g., CALVIN ABC-D, ALFWorld) (Shu et al., 26 May 2025, Tian et al., 26 Jun 2025).
Continual Learning: RFT inherently mitigates catastrophic forgetting when adapting to novel tasks, outperforming SFT in knowledge retention and even enhancing general reasoning benchmarks (MMMU, MMLU-Pro) (Lai et al., 7 Jul 2025, Zhang et al., 30 Jun 2025).

A summary table of representative benchmarks and RFT methods:

Domain	RFT Method	Key Metric(s)	Gain over SFT
Math Reasoning	ReFT, Prefix-RFT	Accuracy (%), Pass@1	+8–9 points
Visual (VQA, CLS, DET)	Visual-RFT, Reason-RFT	mAP, accuracy, IoU	+15–24 points
Embodied Agents	RFTF, SEEA-R1	Success Length, Success Rate (%)	SOTA
Continual Learning	GRPO, RIF-RFT	Retention, Generalization	Strongly mitigated forgetting

5. Task-Specific Reward Design and Regularization

Custom reward design is central to RFT:

Rule-Based Verifiable Rewards: In Visual-RFT, rewards depend on task-specific criteria such as IoU for object detection or class label accuracy for classification. In video reasoning, semantic consistency between generated reasoning and visual evidence is enforced by computing the similarity between text representations and corresponding video frames (Liu et al., 3 Mar 2025, Wang et al., 18 May 2025).
Format and Structure Compliance: For multi-step search, composite rewards target answer correctness, DAG validity, and strict output formatting, ensuring outputs are both factually correct and structurally executable (Shi et al., 10 Jun 2025).
KL Regularization: Most successful RFT pipelines include a KL term with a decaying weight, acting as a regularizer to limit divergence from the reference (pretrained) policy.

Notably, conventional heuristics such as increasing learning rates or temperature scaling do not mitigate the vanishing gradient issue; instead, techniques such as initial SFT or reward shaping are essential (Razin et al., 2023).

6. Challenges, Limitations, and Open Problems

Despite its empirical and theoretical strengths, RFT faces challenges:

Vanishing Gradients: As established in (Razin et al., 2023), small reward variance can cause optimization to stall completely in flat-reward regions, necessitating SFT-based initialization or new algorithmic solutions with guaranteed nonzero reward variance.
Sample and Compute Efficiency: RFT (particularly in complex domains) can be compute- and sample-intensive. Adaptive curriculum methods (AdaRFT) and instance filtering (RIF-RFT) have been proposed to mitigate inefficiencies (Shi et al., 7 Apr 2025, Lai et al., 7 Jul 2025).
Hallucination and Trustworthiness: RFT can degrade refusal behavior, leading models to hallucinate answers when confronted with unanswerable questions. Mixing in counterexamples (e.g., SUM data) restores proper refusal rates with modest cost to standard task performance (Song et al., 20 May 2025).
Task Misalignment: Overly generic or misaligned reward functions can yield suboptimal or unexpected behaviors, especially in domains where process rewards are difficult to specify.

7. Broader Impact and Future Directions

RFT is being generalized across modalities (language, vision, audio, video, embodied agents) and tasks (reasoning, search, action planning, red teaming). Frameworks such as Trinity-RFT provide modular support for on-policy/off-policy, synchronous/asynchronous, and online/offline training workflows (Pan et al., 23 May 2025).

Active research directions identified in recent literature include:

Systematic combination of outcome and process reward paradigms (Sun et al., 24 May 2025).
Adaptive curriculum and data-centric augmentation to improve learning speed and robustness (Shi et al., 7 Apr 2025, Cen et al., 25 May 2025).
Further advances in prompt engineering for behavior shaping (prior prompt engineering, pPE) and constructive multi-behavior fine-tuning (Taveekitworachai et al., 20 May 2025).
Deepening understanding of RFT's implicit regularization and its role in continual post-training (Lai et al., 7 Jul 2025).

Ongoing challenges include designing reward models that generalize to novel domains, integrating richer sensor modalities for embodied agents, improving computational efficiency at scale, and further mitigating negative side effects such as hallucinations or forgetting.

RFT has thus emerged as a theoretically grounded and empirically validated paradigm for aligning and generalizing large models, with continued methodological innovation and cross-domain expansion expected.