Reinforcement Fine-Tuning (RFT)
- Reinforcement Fine-Tuning (RFT) is a method that adapts pretrained models using reinforcement learning objectives and policy gradients to maximize expected rewards.
- RFT leverages techniques such as a supervised fine-tuning initialization, KL regularization, and policy gradient methods like PPO to ensure training stability and mitigate vanishing gradients.
- Empirical studies demonstrate that RFT improves performance in reasoning, visual, and embodied tasks by effectively aligning model behaviors with human preferences and structured task requirements.
Reinforcement Fine-Tuning (RFT) is a post-pretraining adaptation methodology in which pretrained models, typically LLMs or multimodal LLMs (MLLMs), are further optimized by maximizing expected rewards using policy gradient algorithms. Unlike supervised fine-tuning (SFT), which fits the model to labeled demonstration data, RFT employs reinforcement learning (RL) objectives—often leveraging rule-based or learned reward functions—to instill desired behaviors aligned with human preferences, downstream task requirements, or performance on structured reasoning problems.
1. Fundamental Principles and Theoretical Foundations
RFT is grounded in standard RL formalism. The model operates as a parameterized policy , generating an output conditioned on input . The central training objective is to maximize the expected reward: where is the reward function, which may be a learned preference model, a verifiable rule-based metric, or a hybrid thereof.
Optimization is typically performed via policy gradient methods. The most prominent algorithm in recent practice is Proximal Policy Optimization (PPO), with variants such as Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) also widely adopted. The policy gradient is estimated as: where is an advantage function, often involving a baseline (such as a value function estimate) to reduce variance.
For pipeline stability and behavioral alignment, a KL-divergence regularization term is often added: This regularizer discourages the updated policy from diverging too far from the pretrained (reference) model.
2. Optimization Challenges: Vanishing Gradients
A central theoretical finding is the "vanishing gradients" phenomenon, established in "Vanishing Gradients in Reinforcement Finetuning of LLMs" (2310.20703). When the model's output distribution yields low reward variance across possible outputs—that is, the standard deviation of under is small—the expected policy gradient diminishes, potentially stalling learning:
where is output length and bounds the logit Jacobian norm. Importantly, even when the model's expected reward is suboptimal, low reward variability stalls optimization because the gradient signal nearly vanishes. This effect also applies to PPO-based learning, with a bound on the difference between surrogate and true gradients proportional to the total variation distance between updated and reference policies.
Empirical evidence from GRUE and controlled synthetic benchmarks confirms that this phenomenon occurs frequently, especially for tasks where the pretrained model exhibits low diversity in its predictions, and is independent of algorithms or optimizer noise.
3. RFT Methodologies and Design Patterns
Two-Stage Fine-Tuning: SFT + RFT
The prevailing RFT recipe is a two-stage pipeline:
- Supervised Fine-Tuning (SFT): The model is "warmed up" on demonstration data, typically using a cross-entropy loss on target outputs or chain-of-thought (CoT) annotations. SFT aligns the model to desired behaviors and increases reward variance for challenging inputs.
- Reinforcement Fine-Tuning (RFT): The model is then further optimized with a policy gradient objective (e.g., PPO or GRPO), sampling multiple output trajectories and updating parameters based on reward signals.
In "Vanishing Gradients in Reinforcement Finetuning of LLMs" (2310.20703), an initial SFT step—on as little as 1% of the data and with few optimization steps—was shown to substantially increase reward gains by lifting the model out of flat-gradient regions, thus mitigating vanishing gradients.
Sampling and Policy Update Variants
- Group-Based Relative Policy Optimization (GRPO): For tasks like reasoning and visual classification, GRPO avoids a value network by comparing groups of outputs; normalized advantages are computed as:
- Masked and Fine-Grained RL Approaches: In mesh generation, Masked-DPO applies spatial masking to only update segments of the output sequence corresponding to low-quality mesh regions (2505.16761).
- Adaptive Curriculum RFT: AdaRFT dynamically tunes the difficulty of training examples using a target difficulty and an update rule:
focusing compute on problems best matched to the model's evolving capabilities (2504.05520).
4. Empirical Results and Benchmarks
RFT has been empirically validated as an effective strategy for reasoning, generalization, and domain transfer:
- Language and Reasoning: RFT (e.g., ReFT, Prefix-RFT) significantly outperforms SFT in mathematical reasoning (GSM8K, MathQA, AIME, etc.), enabling learning from multiple reasoning paths and exploration of solution space (2401.08967, 2507.01679).
- Visual and Multimodal Tasks: Visual-RFT and Reason-RFT demonstrate large gains on visual classification, object detection, and visual reasoning benchmarks, with enhancements as high as +24.3% in one-shot fine-grained classification and strong generalization under few-shot or out-of-domain settings (2503.01785, 2503.20752).
- Embodied Agents: RFTF and SEEA-R1 frameworks introduce dense temporal rewards and learned multimodal reward models, achieving state-of-the-art results on embodied manipulation and navigation tasks (e.g., CALVIN ABC-D, ALFWorld) (2505.19767, 2506.21669).
- Continual Learning: RFT inherently mitigates catastrophic forgetting when adapting to novel tasks, outperforming SFT in knowledge retention and even enhancing general reasoning benchmarks (MMMU, MMLU-Pro) (2507.05386, 2506.23508).
A summary table of representative benchmarks and RFT methods:
Domain | RFT Method | Key Metric(s) | Gain over SFT |
---|---|---|---|
Math Reasoning | ReFT, Prefix-RFT | Accuracy (%), Pass@1 | +8–9 points |
Visual (VQA, CLS, DET) | Visual-RFT, Reason-RFT | mAP, accuracy, IoU | +15–24 points |
Embodied Agents | RFTF, SEEA-R1 | Success Length, Success Rate (%) | SOTA |
Continual Learning | GRPO, RIF-RFT | Retention, Generalization | Strongly mitigated forgetting |
5. Task-Specific Reward Design and Regularization
Custom reward design is central to RFT:
- Rule-Based Verifiable Rewards: In Visual-RFT, rewards depend on task-specific criteria such as IoU for object detection or class label accuracy for classification. In video reasoning, semantic consistency between generated reasoning and visual evidence is enforced by computing the similarity between text representations and corresponding video frames (2503.01785, 2505.12434).
- Format and Structure Compliance: For multi-step search, composite rewards target answer correctness, DAG validity, and strict output formatting, ensuring outputs are both factually correct and structurally executable (2506.08352).
- KL Regularization: Most successful RFT pipelines include a KL term with a decaying weight, acting as a regularizer to limit divergence from the reference (pretrained) policy.
Notably, conventional heuristics such as increasing learning rates or temperature scaling do not mitigate the vanishing gradient issue; instead, techniques such as initial SFT or reward shaping are essential (2310.20703).
6. Challenges, Limitations, and Open Problems
Despite its empirical and theoretical strengths, RFT faces challenges:
- Vanishing Gradients: As established in (2310.20703), small reward variance can cause optimization to stall completely in flat-reward regions, necessitating SFT-based initialization or new algorithmic solutions with guaranteed nonzero reward variance.
- Sample and Compute Efficiency: RFT (particularly in complex domains) can be compute- and sample-intensive. Adaptive curriculum methods (AdaRFT) and instance filtering (RIF-RFT) have been proposed to mitigate inefficiencies (2504.05520, 2507.05386).
- Hallucination and Trustworthiness: RFT can degrade refusal behavior, leading models to hallucinate answers when confronted with unanswerable questions. Mixing in counterexamples (e.g., SUM data) restores proper refusal rates with modest cost to standard task performance (2505.13988).
- Task Misalignment: Overly generic or misaligned reward functions can yield suboptimal or unexpected behaviors, especially in domains where process rewards are difficult to specify.
7. Broader Impact and Future Directions
RFT is being generalized across modalities (language, vision, audio, video, embodied agents) and tasks (reasoning, search, action planning, red teaming). Frameworks such as Trinity-RFT provide modular support for on-policy/off-policy, synchronous/asynchronous, and online/offline training workflows (2505.17826).
Active research directions identified in recent literature include:
- Systematic combination of outcome and process reward paradigms (2505.18536).
- Adaptive curriculum and data-centric augmentation to improve learning speed and robustness (2504.05520, 2505.18917).
- Further advances in prompt engineering for behavior shaping (prior prompt engineering, pPE) and constructive multi-behavior fine-tuning (2505.14157).
- Deepening understanding of RFT's implicit regularization and its role in continual post-training (2507.05386).
Ongoing challenges include designing reward models that generalize to novel domains, integrating richer sensor modalities for embodied agents, improving computational efficiency at scale, and further mitigating negative side effects such as hallucinations or forgetting.
RFT has thus emerged as a theoretically grounded and empirically validated paradigm for aligning and generalizing large models, with continued methodological innovation and cross-domain expansion expected.