RAFT Algorithm for Model Alignment
- RAFT is a framework that aligns generative models by repeatedly sampling outputs, ranking them with a reward, and fine-tuning on the highest-ranking examples.
- It decouples the sampling and learning phases, using maximum likelihood updates to achieve stability and computational efficiency compared to conventional RLHF.
- Empirical evaluations show RAFT’s effectiveness in enhancing performance and alignment in both language and diffusion models, while reducing training time and complexity.
RAFT (Reward rAnked FineTuning) is a framework for aligning generative foundation models with specific reward signals, providing a stable, scalable alternative to Reinforcement Learning from Human Feedback (RLHF) for tasks such as aligning LLMs and diffusion models with human or automated preferences. The essential idea is to repeatedly sample generations from a model, rank these candidates according to a scalar reward, select the highest-rewarding outputs, and then fine-tune the model using standard supervised learning on this filtered subset. RAFT decouples the sampling and learning phases, avoids the instability and high computational demands of RL-based methods, and offers robust guarantees for reward maximization through maximum-likelihood gradient descent on high-quality samples (Dong et al., 2023).
1. Formal Specification
Given a pre-trained conditional generator (parameterized by ), a reward model , and a distribution over prompts , RAFT defines the (generally intractable) reward-maximizing objective:
RAFT operationalizes this objective by an iterative procedure:
- For each iteration , for each of prompts sampled from , draw independent samples from .
- Rank the responses by reward , retaining only the best (or top , or those exceeding threshold ).
- Fine-tune via supervised maximum likelihood on the selected pairs:
If regularization is desired, a KL penalty with weight can be included:
Selection is invariant to affine rescaling of . RAFT thus iteratively pushes the generator distribution toward regions of higher reward under the provided metric, without the complexities of reward-model-based credit assignment or policy optimization (Dong et al., 2023).
2. Algorithmic Workflow
The complete pseudocode for RAFT is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Input: initial parameters θ⁰, reward model r(x,y), batch size b, sample count N per prompt, acceptance rule (top-1, top-K, threshold τ), number of iterations T for t = 0 ... T-1: X = {sample b prompts x₁,...,x_b from D} S = {} for each xᵢ in X: Y = {sample N responses yᵢ,ₖ ~ p_{θᵗ}(·|xᵢ)} scores = {r(xᵢ,yᵢ,ₖ) for k=1..N} Sᵢ = select_best(scores, rule=acceptance_rule) S.update({(xᵢ, y): y in Sᵢ}) θᵗ⁺¹ = gradient_descent_step(θᵗ, S) return θᵀ |
The selection procedure can perform:
- Best-of-N sampling (top-1 per prompt),
- Top-K selection,
- Thresholding (all pairs with ).
Fine-tuning is always via cross-entropy (negative log-likelihood) over the accepted pairs, optionally with a KL term penalizing divergence from the initial parameters.
3. Hyperparameterization and Effects
The following hyperparameters define RAFT’s operational regime (Dong et al., 2023):
| Parameter | Effect/Tradeoff | Typical Values |
|---|---|---|
| (samples/prompt) | Higher yields higher-reward extremes, but increases compute; scaling like | 16–64 |
| Top-K or $1/K$ | Lower acceptance ratio increases selectivity, but can reduce data diversity and overfit smaller batches | , |
| Threshold | Sets a fixed reward bar. Too high: under-training. Too low: poor align. | Data/metric-dependent |
| Sampling temp. | Controls sample diversity; increases exploration | –$1.0$ |
| KL weight | Regularizes toward the original model (preserving fluency, diversity) | –$0.1$ |
Empirically, the authors find , , top-1 per prompt, or , with temperature and light KL regularization optimal for both language and diffusion models.
4. Advantages over RLHF and PPO
RAFT contrasts with traditional RLHF—especially RLHF grounded in Proximal Policy Optimization (PPO)—in multiple dimensions (Dong et al., 2023):
- Stability: RAFT employs only maximum-likelihood supervised updates on filtered samples, which are inherently stable. RLHF-PPO is prone to reward-scaling dependencies, sensitivity to noise, and hyperparameter instability.
- Simplicity: No policy reference, no separate critic network, and only a single model active at each training point.
- Decoupling of Phases: Generation (sampling) and weight update (fine-tuning) are loosely coupled, enabling efficient batch pipeline execution.
- Memory/Compute: RAFT requires only the reward model and current generator during fine-tuning, avoiding multiple concurrent model/critic references.
- Reward scaling: RAFT’s accept/reject criterion is rank-based, rendering it robust to affine transformations or calibration error in the reward model; PPO and related methods are explicitly reward-scale sensitive.
These traits make RAFT more robust and practical for large-scale alignment tasks than typical RLHF protocols.
5. Empirical Evaluation
LLM Alignment
On LLaMA-7B aligned against the HH-RLHF benchmark:
- Unaligned: mean reward , perplexity .
- SFT: , perplexity .
- PPO: , perplexity .
- RAFT (, ): (best), perplexity .
- Lexical and semantic diversity on par or superior to PPO.
- RAFT converges in hours on A40 GPUs (faster than PPO).
Diffusion Model Alignment
For Stable Diffusion 1.5, on both 256×256 CLIP-aesthetic and 512×512 text-image alignment:
- RAFT achieves CLIP-aesthetic (vs. DDPO RLHF ), minutes total training time (vs. $415$ minutes for RL baseline).
- RAFT produces stronger text-prompt alignment, and followers commit entries faster than with vanilla Raft approaches.
Overall, RAFT matches or outperforms RLHF (PPO/DDPO) on reward-based alignment, preserves or improves output fluency and diversity, and offers substantially reduced hardware costs and operational complexity (Dong et al., 2023).
6. Practical Deployment and Limitations
RAFT is effective for both LLMs and diffusion models. Acceptable reward metrics can be model-driven (learned from human feedback or automated, e.g., CLIP-aesthetic). The invariance to reward rescaling eases deployment with imperfect reward calibrations.
Primary limitations reflect the need for:
- Sufficient sample count to realize expected reward improvement.
- Careful management of acceptance ratios/top-K to avoid overfitting or degraded data diversity in small or low-entropy domains.
- A sufficiently expressive and aligned reward function; garbage-in-garbage-out applies if poorly reflects desired output quality/ethics.
RAFT is strictly limited to the domain defined by the reward model or function—potential misalignment between reward and true human preference will propagate, just as with RLHF.
References:
- "RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment" (Dong et al., 2023)