Papers
Topics
Authors
Recent
Search
2000 character limit reached

RAFT Algorithm for Model Alignment

Updated 18 January 2026
  • RAFT is a framework that aligns generative models by repeatedly sampling outputs, ranking them with a reward, and fine-tuning on the highest-ranking examples.
  • It decouples the sampling and learning phases, using maximum likelihood updates to achieve stability and computational efficiency compared to conventional RLHF.
  • Empirical evaluations show RAFT’s effectiveness in enhancing performance and alignment in both language and diffusion models, while reducing training time and complexity.

RAFT (Reward rAnked FineTuning) is a framework for aligning generative foundation models with specific reward signals, providing a stable, scalable alternative to Reinforcement Learning from Human Feedback (RLHF) for tasks such as aligning LLMs and diffusion models with human or automated preferences. The essential idea is to repeatedly sample generations from a model, rank these candidates according to a scalar reward, select the highest-rewarding outputs, and then fine-tune the model using standard supervised learning on this filtered subset. RAFT decouples the sampling and learning phases, avoids the instability and high computational demands of RL-based methods, and offers robust guarantees for reward maximization through maximum-likelihood gradient descent on high-quality samples (Dong et al., 2023).

1. Formal Specification

Given a pre-trained conditional generator pθ(yx)p_\theta(y|x) (parameterized by θ\theta), a reward model r(x,y)r(x,y), and a distribution DD over prompts xx, RAFT defines the (generally intractable) reward-maximizing objective:

maxθExDEypθ(x)[r(x,y)]\max_\theta \mathbb{E}_{x \sim D}\,\mathbb{E}_{y \sim p_\theta(\cdot|x)}\,[r(x,y)]

RAFT operationalizes this objective by an iterative procedure:

  • For each iteration tt, for each of bb prompts xix_i sampled from DD, draw NN independent samples {yi,k}\{y_{i,k}\} from pθt(xi)p_{\theta^t}(\cdot|x_i).
  • Rank the responses by reward r(xi,yi,k)r(x_i, y_{i,k}), retaining only the best (or top KK, or those exceeding threshold τ\tau).
  • Fine-tune θ\theta via supervised maximum likelihood on the selected (x,y)(x, y) pairs:

LRAFT(θ)=(x,y)Slogpθ(yx)L_{\rm RAFT} (\theta) = -\sum_{(x,y) \in S} \log p_\theta(y|x)

If regularization is desired, a KL penalty with weight β\beta can be included:

r~(x,y)=r(x,y)βlogpθ(yx)pθ0(yx)\tilde{r}(x,y) = r(x,y) - \beta \log \frac{p_\theta(y|x)}{p_{\theta^0}(y|x)}

Selection is invariant to affine rescaling of rr. RAFT thus iteratively pushes the generator distribution toward regions of higher reward under the provided metric, without the complexities of reward-model-based credit assignment or policy optimization (Dong et al., 2023).

2. Algorithmic Workflow

The complete pseudocode for RAFT is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Input: initial parameters θ,
       reward model r(x,y),
       batch size b,
       sample count N per prompt,
       acceptance rule (top-1, top-K, threshold τ),
       number of iterations T

for t = 0 ... T-1:
    X = {sample b prompts x,...,x_b from D}
    S = {}
    for each xᵢ in X:
        Y = {sample N responses yᵢ,ₖ ~ p_{θᵗ}(·|xᵢ)}
        scores = {r(xᵢ,yᵢ,ₖ) for k=1..N}
        Sᵢ = select_best(scores, rule=acceptance_rule)
        S.update({(xᵢ, y): y in Sᵢ})
    θᵗ¹ = gradient_descent_step(θᵗ, S)
return θᵀ

The selection procedure can perform:

  • Best-of-N sampling (top-1 per prompt),
  • Top-K selection,
  • Thresholding (all pairs with rτr \geq \tau).

Fine-tuning is always via cross-entropy (negative log-likelihood) over the accepted pairs, optionally with a KL term penalizing divergence from the initial parameters.

3. Hyperparameterization and Effects

The following hyperparameters define RAFT’s operational regime (Dong et al., 2023):

Parameter Effect/Tradeoff Typical Values
NN (samples/prompt) Higher NN yields higher-reward extremes, but increases compute; scaling like O(logN)O(\sqrt{\log N}) 16–64
Top-K or $1/K$ Lower acceptance ratio increases selectivity, but can reduce data diversity and overfit smaller batches K=8K=8, K=32K=32
Threshold τ\tau Sets a fixed reward bar. Too high: under-training. Too low: poor align. Data/metric-dependent
Sampling temp. λ\lambda Controls sample diversity; λ>1\lambda>1 increases exploration λ=0.85\lambda=0.85–$1.0$
KL weight β\beta Regularizes toward the original model (preserving fluency, diversity) β=0\beta=0–$0.1$

Empirically, the authors find N32N \simeq 32, b2048b \approx 2048, top-1 per prompt, or 1/K1/8...1/321/K \simeq 1/8 ... 1/32, with temperature λ[0.85,1.00]\lambda \in [0.85,1.00] and light KL regularization optimal for both language and diffusion models.

4. Advantages over RLHF and PPO

RAFT contrasts with traditional RLHF—especially RLHF grounded in Proximal Policy Optimization (PPO)—in multiple dimensions (Dong et al., 2023):

  • Stability: RAFT employs only maximum-likelihood supervised updates on filtered samples, which are inherently stable. RLHF-PPO is prone to reward-scaling dependencies, sensitivity to noise, and hyperparameter instability.
  • Simplicity: No policy reference, no separate critic network, and only a single model active at each training point.
  • Decoupling of Phases: Generation (sampling) and weight update (fine-tuning) are loosely coupled, enabling efficient batch pipeline execution.
  • Memory/Compute: RAFT requires only the reward model and current generator during fine-tuning, avoiding multiple concurrent model/critic references.
  • Reward scaling: RAFT’s accept/reject criterion is rank-based, rendering it robust to affine transformations or calibration error in the reward model; PPO and related methods are explicitly reward-scale sensitive.

These traits make RAFT more robust and practical for large-scale alignment tasks than typical RLHF protocols.

5. Empirical Evaluation

LLM Alignment

On LLaMA-7B aligned against the HH-RLHF benchmark:

  • Unaligned: mean reward 0.44\approx -0.44, perplexity 4.78\approx 4.78.
  • SFT: r0.77r \approx 0.77, perplexity 3.78\approx 3.78.
  • PPO: r2.08r \approx 2.08, perplexity 4.16\approx 4.16.
  • RAFT (N=32N=32, λ=1.0\lambda=1.0): r2.29r \approx 2.29 (best), perplexity 4.03\approx 4.03.
  • Lexical and semantic diversity on par or superior to PPO.
  • RAFT converges in 7\approx 7 hours on 8×8 \times A40 GPUs (faster than PPO).

Diffusion Model Alignment

For Stable Diffusion 1.5, on both 256×256 CLIP-aesthetic and 512×512 text-image alignment:

  • RAFT achieves CLIP-aesthetic 6.14±0.49\approx 6.14 \pm 0.49 (vs. DDPO RLHF 6.04±0.496.04 \pm 0.49), 8.4\approx 8.4 minutes total training time (vs. $415$ minutes for RL baseline).
  • RAFT produces stronger text-prompt alignment, and followers commit entries faster than with vanilla Raft approaches.

Overall, RAFT matches or outperforms RLHF (PPO/DDPO) on reward-based alignment, preserves or improves output fluency and diversity, and offers substantially reduced hardware costs and operational complexity (Dong et al., 2023).

6. Practical Deployment and Limitations

RAFT is effective for both LLMs and diffusion models. Acceptable reward metrics can be model-driven (learned from human feedback or automated, e.g., CLIP-aesthetic). The invariance to reward rescaling eases deployment with imperfect reward calibrations.

Primary limitations reflect the need for:

  • Sufficient sample count NN to realize expected reward improvement.
  • Careful management of acceptance ratios/top-K to avoid overfitting or degraded data diversity in small or low-entropy domains.
  • A sufficiently expressive and aligned reward function; garbage-in-garbage-out applies if r(x,y)r(x, y) poorly reflects desired output quality/ethics.

RAFT is strictly limited to the domain defined by the reward model or function—potential misalignment between reward and true human preference will propagate, just as with RLHF.


References:

  • "RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment" (Dong et al., 2023)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RAFT Algorithm.