Reward-Ranked Fine-Tuning (RAFT)
- RAFT is a family of fine-tuning techniques that ranks candidate outputs based on reward models to align generative models with specific external preferences.
- It decouples sample generation from model updates by selecting only high-reward samples for supervised fine-tuning, offering a simpler alternative to traditional RLHF.
- Empirical results demonstrate that RAFT improves sample efficiency, stability, and overall performance across tasks like LLM alignment and diffusion model generation.
Reward-Ranked Fine-Tuning (RAFT) is a family of methods for aligning generative models—particularly LLMs and diffusion models—with externally specified preferences or reward functions. RAFT algorithms select and rank candidate outputs according to a reward model (often trained from human feedback or preference data) and subsequently fine-tune the model on the most highly rewarded outputs. This approach aims to achieve high alignment and sample efficiency while remaining computationally simpler and more stable than traditionally on-policy reinforcement learning from human feedback (RLHF) algorithms such as PPO.
1. Fundamental Principles of RAFT
RAFT’s central principle is to decouple sample generation from policy (model) update, in contrast to standard RLHF methods which rely on on-policy gradients and tightly coupled actor–critic training. The canonical RAFT workflow involves, for each input prompt:
- Generating candidate model outputs,
- Ranking or scoring them with a reward model,
- Selecting the highest-reward samples,
- Fine-tuning the model with supervised objectives on the selected subset.
This “best-of-K” or rejection sampling strategy forms the core of RAFT (Dong et al., 2023, Xiong et al., 15 Apr 2025). Rather than updating on both positive and negative samples (or on reward differences as in PPO), RAFT fine-tunes the model primarily on positively rewarded outcomes. The reward model itself is typically trained via pairwise preference comparisons, e.g., with a Bradley–Terry objective,
where is the learned reward for .
2. RAFT Algorithmic Frameworks and Variants
2.1 Canonical Algorithm (LLMs)
The standard RAFT procedure for LLMs is:
- Sample Generation: For each prompt , sample candidate completions from the current model.
- Scoring and Filtering: Evaluate each with a reward model . Keep only those with the highest rewards (e.g., ).
- Supervised Update: Fine-tune the model with maximum likelihood or log-likelihood loss on the filtered (positive) examples:
where is the set of accepted pairs.
- Iteration: Repeat the above with the updated model.
This approach eliminates negative rewards during optimization, focusing purely on high-quality responses (Xiong et al., 15 Apr 2025). Empirically, this holds comparable performance to more complex RL approaches such as GRPO and PPO in early stages, though long-term policy entropy may decrease due to lack of negative samples.
2.2 RAFT for Diffusion Models
For generative diffusion models, RAFT-inspired fine-tuning can be adapted using different mechanisms depending on reward differentiability:
- Ranking Loss or Reward Filtering: For non-differentiable rewards, generated samples are filtered and the model is fine-tuned on those with the highest reward (analogous to rejection sampling).
- DRaFT Integration: For differentiable rewards, one can directly backpropagate the reward gradient through the (fully or partially unrolled) sampling chain (cf. equations (1), (2), and variants DRaFT-K and DRaFT-LV) (Clark et al., 2023). Hybrid losses combining direct reward optimization and ranking loss can be used:
where is a ranking loss over generated samples.
2.3 Curriculum and Sample Efficiency Extensions
Recent work augments RAFT with prompt-specific inference budgeting to accelerate convergence (GVM-RAFT) (Yao et al., 5 May 2025). Sample budgets per prompt are dynamically allocated according to both acceptance rate and the gradient norm, minimizing variance: where is the acceptance rate, is the gradient norm, and is the total sample budget.
3. Reward Modeling and Ranking Approaches
RAFT’s efficacy hinges on the quality of its reward model, which supplies the ranking or scoring signals. There are several technical variants:
- Pairwise Preference Models: Trained on human-labeled (chosen, rejected) response pairs, using Bradley–Terry or similar objectives.
- Proto-RM (Prototypical Reward Models): Utilize prototypical networks to enable robust reward estimation from small data, grouping examples by prototype vectors and leveraging weighted prototype averages for reward estimation (Zhang et al., 6 Jun 2024).
- TinyRM: Employs small, encoder-only masked LLMs (MLMs) as lightweight reward models with FLAN-style prompting and Directional LoRA adaptation. These models demonstrate strong performance in reasoning and safety, providing efficient and scalable ranking signals for RAFT with substantial resource reduction (Pan, 14 Jul 2025).
The practical reward loss typically takes the form: where is the reward model and the sigmoid function.
4. Applications and Implications
RAFT-type fine-tuning is applied across several domains:
- LLM Alignment: RAFT aligns LLM outputs with human preferences, practical ethics, or task-specific objectives, while maintaining sample efficiency and simplifying implementation relative to RLHF (Dong et al., 2023, Xiong et al., 15 Apr 2025).
- Diffusion Models (Image and Biomolecular Generation): RAFT and its extensions (e.g., iterative distillation from soft-optimal policies (Su et al., 1 Jul 2025)) enable diffusion models to optimize outputs according to reward functions, even those that are non-differentiable, with improved stability and sample efficiency relative to on-policy RL.
- Retrieval-Augmented Question Answering: Retrieval-Augmented Fine-Tuning (also known as RAFT in separate works (Zhang et al., 15 Mar 2024, Chung et al., 26 Sep 2024, Shi et al., 6 Jun 2025)) improves in-domain QA performance in resource-constrained environments and domains with scarce labeled data. RAFT enables LLMs and small models to internalize retrieval, improving factual accuracy and robustness to noisy or distractor documents.
- Reasoning and Math Problem Solving: Methods such as ReFT extend RAFT with RL-based optimization of chain-of-thought outputs, notably improving generalization in math tasks by enabling models to learn from multiple reasoning paths (Luong et al., 17 Jan 2024, Yao et al., 5 May 2025).
- Electronic Design Automation (EDA) and Secure Access: RAFT can be combined with synthetic Q/A generation, secure context filtering, and retrieval fusion to produce domain-precise and privacy-aware LLM assistants (Shi et al., 6 Jun 2025).
5. Comparative Analysis and Empirical Results
RAFT is systematically compared with PPO, GRPO, DPO, and other RL-based fine-tuning methods across LLM and diffusion domains:
- Performance: RAFT matches or exceeds PPO and GRPO in reward alignment and output fluency on language and vision tasks when evaluated via both automated and human metrics (Dong et al., 2023, Xiong et al., 15 Apr 2025).
- Efficiency: RAFT’s supervised updates and model decoupling reduce memory and computation, as only one model (not multiple actor/critic/reward reference policies) need be loaded at a time.
- Stability: RAFT avoids RL-specific instability due to policy divergence, especially when regularized with a KL penalty term:
- Sample Efficiency: GVM-RAFT and Proto-RM provide further benefits, requiring fewer training samples and yielding faster convergence (Zhang et al., 6 Jun 2024, Yao et al., 5 May 2025).
- Limitations: In the absence of negative sample training, RAFT’s entropy can collapse, reducing exploration and robustness over prolonged training, a factor partially mitigated in GRPO/Reinforce-Rej variants (Xiong et al., 15 Apr 2025).
Empirically, RAFT-based systems achieve notable gains, such as a >30% accuracy increase on HotpotQA in multi-hop QA (Zhang et al., 15 Mar 2024), 2–4× speedups in chain-of-thought optimization (Yao et al., 5 May 2025), and superior data efficiency in human feedback reward modeling (Proto-RM, TinyRM).
6. Extensions and Future Directions
- Hybrid Losses and Differentiable Ranking: Combining analytic gradients when rewards are differentiable (DRaFT) with ranking-based losses can provide a flexible architecture interpolating between supervised reward maximization and hard best-of-K filtering (Clark et al., 2023).
- Dynamic Curriculum and Task Difficulty Scheduling: Adaptive curriculum methods such as AdaRFT can be layered onto RAFT for further improvements in reasoning performance and convergence (Shi et al., 7 Apr 2025).
- Secure and Domain-Specific Fine-Tuning: RAFT can incorporate secure context filtering, domain-adaptive curricula, or synthetic Q/A generation to address domains with constrained data or privacy requirements (Shi et al., 6 Jun 2025).
- Efficient Reward Modeling: Deployment of bidirectional MLM-based reward models (TinyRM) and prototypical networks (Proto-RM) facilitates efficient preference modeling under strict compute and data constraints (Zhang et al., 6 Jun 2024, Pan, 14 Jul 2025).
7. Summary Table: RAFT Core Variants and Characteristics
RAFT Variant | Model Domain | Selection Mechanism | Sample Update Strategy | Key Advantages |
---|---|---|---|---|
Canonical RAFT | LLM, diffusion | Best-of-K ranking | Supervised (MLE on pos. only) | Simplicity, stability, efficiency |
DRaFT / DRaFT-K/LV | Diffusion | Differentiable reward | Backprop through sampling | Sample efficiency, flexible gradients |
GVM-RAFT | LLM (CoT) | Dynamic prompt-specific | Variance-minimizing update | Faster convergence, allocation opt. |
Proto-RM/TinyRM | Reward model | Prototypical/MLM scoring | Efficient reward estimation | Data-efficient, scalable |
Retrieval RAFT (CRAFT) | LLM/RAG | Fine-tune on retrieved QA | Adapters (LoRA/DoRA) | Resource efficiency, QA fidelity |
ReFT | LLM (math/reasoning) | PPO (RL) on CoT | On-policy RL on rewards | Generalization, multi-path learning |
RAFT thus embodies a class of techniques that streamline reward-based alignment for generative models, leveraging sample ranking and preference modeling for scalable, interpretable, and stable fine-tuning across diverse application domains.