Reward rAnked FineTuning (RAFT)
- RAFT is a model alignment paradigm that iteratively generates, ranks, and fine-tunes outputs using reward scores to enhance stability and efficiency.
- It employs a decoupled workflow by generating multiple responses per prompt and training only on top-ranked outputs, ensuring robust convergence across different domains.
- GVM-RAFT extends the method by dynamically allocating samples to minimize gradient variance, achieving 2–4× faster convergence and improved accuracy in reasoning tasks.
Reward rAnked FineTuning (RAFT) is a model alignment paradigm that optimizes generative models—especially large language and diffusion models—using reward-driven sample selection and supervised fine-tuning. RAFT departs from canonical reinforcement learning from human feedback (RLHF) by iteratively filtering high-reward outputs with a reward model and updating the model in a supervised manner, resulting in empirically superior stability, simplicity, and efficiency. The RAFT framework extends from conventional best-of-K data selection to advanced stochastic gradient optimization for chain-of-thought (CoT) reasoning, including gradient-variance–minimizing methods such as GVM-RAFT.
1. Conceptual Foundations and Motivations
RAFT addresses the fundamental challenges of aligning generative models—LLMs and diffusion models—with human preferences or task-specific objectives. Originating as an alternative to RLHF, which relies on on-policy rollouts and policy gradient methods (such as PPO), RAFT introduces an iterative, reward-centric data selection procedure. For each sampled prompt, RAFT generates multiple candidate outputs, ranks them using a learned reward model (pretrained, e.g., via Bradley–Terry pairwise preference modeling), and then fine-tunes the model using only top-ranked responses. Unlike RLHF, RAFT decouples data generation from model updates (off-policy), enabling usage of larger batches and more memory-efficient training (Dong et al., 2023).
In chain-of-thought and mathematical reasoning tasks, RAFT’s latent-variable view recognizes generation as an EM-style structure, with intermediate rationales (y) and observable answers (z), grounding the approach in a probabilistically principled negative log-likelihood minimization (Yao et al., 5 May 2025).
2. RAFT Algorithmic Structure
RAFT’s core algorithm alternates between data generation, ranking, filtering, and supervised fine-tuning. The typical workflow involves:
- Prompt Sampling: Draw a batch of prompts from the data distribution.
- Candidate Generation: For each prompt, generate K candidate outputs using the current model and sampling temperature λ, encouraging output diversity.
- Reward Ranking: Score each candidate using the reward model; select the top candidate for each prompt.
- Fine-Tuning: Update model parameters using standard cross-entropy loss on the high-reward filtered set; optionally regularize with a KL penalty to limit policy drift.
The process can be formalized as follows:
- Given a reward model trained with pairwise (or finer-grained) preference data, define per-sample selection as .
- Filtered batches are constructed for each RAFT stage, and the fine-tuning objective is:
$L_{\rm FT}(w) = -\,\E_{(x,y)\sim B_t} \log p_w(y\,|\,x)$
- KL regularization, if used, penalizes drift from the initial model.
Variants include: top-p global filtering, alternative candidate generation strategies (e.g., beam search, nucleus sampling), and added regularizers for diversity or fluency (Dong et al., 2023).
3. Gradient-Variance–Minimizing RAFT (GVM-RAFT)
In CoT reasoning, RAFT can be cast as an EM procedure on a latent-variable model with pseudo-posterior inference via rejection sampling. RAFT’s MC gradient estimator exhibits high variance, especially when prompt acceptance rates vary. The inefficiency stems from uniform allocation of samples per prompt, which wastes compute on easy prompts (high acceptance rate) and undersamples hard prompts (low acceptance rate), inflating the stochastic gradient variance and slowing convergence.
GVM-RAFT introduces a dynamic allocation strategy to minimize gradient variance under a fixed sample budget:
- For prompts with prompt-specific acceptance rates and gradient norms , solve:
- The closed-form solution is . With a regularizer to handle extremely low , the allocation formula becomes:
- GVM-RAFT uses a practical two-stage algorithm: first estimate prompt statistics with a pilot round, then perform rejection sampling and gradient updates according to optimal allocation.
GVM-RAFT yields strictly faster convergence than vanilla RAFT by minimizing the gradient variance term at each EM iteration; this acceleration is theoretically quantifiable under mild smoothness and convexity conditions (Yao et al., 5 May 2025).
4. Application Domains and Variants
RAFT and its extensions are implemented across generative language and image modeling, as well as retrieval-based LLM reranking systems.
Diffusion and LLMs
- In LLMs, RAFT demonstrates robust reward maximization with little perplexity degradation, outperforming RLHF (PPO) in mean reward and human preference on HH-RLHF with LLaMA-7B (Dong et al., 2023).
- For diffusion models (e.g., Stable Diffusion v1.5), RAFT, using CLIP and aesthetic predictors as rewards, attains higher reward and aesthetic scores than RL alternatives (e.g., DDPO) with much lower computational cost.
Chain-of-Thought Reasoning
- GVM-RAFT leverages prompt-specific sample allocation for mathematical reasoning with significant empirical gains. On Qwen2.5-Math-1.5B/7B models, GVM-RAFT achieves up to 2–4× faster convergence and up to 3.2 percentage point accuracy improvements over RAFT++ on the Math500, Minerva Math, OlympiadBench, AIME24, and AMC23 tasks (Yao et al., 5 May 2025).
Text Reranking
- RAFT underpins the ERank reranker, employing a two-stage process: supervised fine-tuning with integer relevance scores and reinforcement learning using a listwise, rule-based reward function. The RL phase (under Group Relative Policy Optimization) injects global ranking awareness by rewarding or penalizing scored outputs based on their ordering among all candidates, resulting in robust nDCG@10 gains and low query latency (Cai et al., 30 Aug 2025).
5. Empirical Outcomes and Theoretical Properties
RAFT’s design yields several empirical advantages over conventional RLHF:
- Stability and Simplicity: No reliance on actor-critic architectures or on-policy rollouts; memory footprint is reduced.
- Efficiency: RAFT and especially GVM-RAFT enable convergence in fewer steps and with lower variance under identical computational budgets. Empirically, GVM-RAFT improves accuracy on mathematical benchmarks and accelerates descent by a factor of 2–4× (Yao et al., 5 May 2025).
Typical language modeling experiments (SFT, PPO, RAFT) reveal:
- For LLaMA-7B, RAFT (K = 32) achieves a reward of 2.294 compared to PPO’s 2.077, with lower perplexity.
- Human and GPT-4 preference tests indicate RAFT is favored over PPO 65–69% of the time.
- In diffusion, RAFT matches or outperforms RL baselines on CLIP and aesthetic scores, at a fraction (∼1/50th) the GPU cost (Dong et al., 2023).
In reranking, ERank-4B and ERank-32B achieve state-of-the-art nDCG@10 on BRIGHT and competitive results on BEIR and TREC-DL, all while exhibiting low evaluation latency (Cai et al., 30 Aug 2025).
6. Comparative Analysis and Design Tradeoffs
RAFT’s iterative best-of-K selection approximates greedy policy improvement. The data-generation and model-update steps are decoupled, circumventing distribution shift and “alignment tax” issues endemic to SFT-only or pure RL pipelines. Regularization via KL penalties can be tuned to balance reward maximization and output fluency.
Distinctive features relative to RLHF include:
- Batch-parallelism: RAFT requires only one model at a time (no actor/reference/critic separation) and enables large off-policy batches.
- Scalability and Modularity: The core RAFT method extends via GVM-RAFT’s sample reallocation and can be plugged into various RL finetuning algorithms for similar variance reductions.
- Reward Structure Flexibility: RAFT supports programmatic, neural, or rule-based reward models, including listwise and nDCG-style objectives critical for information retrieval applications (Cai et al., 30 Aug 2025).
7. Experimental Protocols and Hyperparameterization
Key hyperparameter and procedural details are as follows:
- LLM Alignment: LLaMA-7B, batch size 2048, K={8,16,32}, temperature λ∈[0.7,1.0], learning rate 2e–5, 10–15 RAFT stages (Dong et al., 2023).
- GVM-RAFT for Math Reasoning: Pilot sample size N′∈{8,16,32}, total sample budget N allocated dynamically per prompt; tested on Qwen2.5-Math-{1.5B,7B}.
- Reranking: ERank with Qwen3-{4B,14B,32B}, SFT with LoRA rank=32, α=64, RL phase with group size 5, PPO-style clipping, KL coefficient 1×10⁻³, and max sequence length 1024 (Cai et al., 30 Aug 2025).
RAFT’s flexibility and systematic reward-driven selection underpin its adoption in domain-specific model alignment, logical reasoning, and high-throughput reranking. Empirical validation across large-scale language, image, and retrieval benchmarks consistently demonstrates gains in reward, sample efficiency, and alignment fidelity.