RAFT Algorithm for Model Alignment

Updated 18 January 2026

RAFT is a framework that aligns generative models by repeatedly sampling outputs, ranking them with a reward, and fine-tuning on the highest-ranking examples.
It decouples the sampling and learning phases, using maximum likelihood updates to achieve stability and computational efficiency compared to conventional RLHF.
Empirical evaluations show RAFT’s effectiveness in enhancing performance and alignment in both language and diffusion models, while reducing training time and complexity.

RAFT (Reward rAnked FineTuning) is a framework for aligning generative foundation models with specific reward signals, providing a stable, scalable alternative to Reinforcement Learning from Human Feedback (RLHF) for tasks such as aligning LLMs and diffusion models with human or automated preferences. The essential idea is to repeatedly sample generations from a model, rank these candidates according to a scalar reward, select the highest-rewarding outputs, and then fine-tune the model using standard supervised learning on this filtered subset. RAFT decouples the sampling and learning phases, avoids the instability and high computational demands of RL-based methods, and offers robust guarantees for reward maximization through maximum-likelihood gradient descent on high-quality samples (Dong et al., 2023).

1. Formal Specification

Given a pre-trained conditional generator $p_\theta(y|x)$ (parameterized by $\theta$ ), a reward model $r(x,y)$ , and a distribution $D$ over prompts $x$ , RAFT defines the (generally intractable) reward-maximizing objective:

$\max_\theta \mathbb{E}_{x \sim D}\,\mathbb{E}_{y \sim p_\theta(\cdot|x)}\,[r(x,y)]$

RAFT operationalizes this objective by an iterative procedure:

For each iteration $t$ , for each of $b$ prompts $x_i$ sampled from $D$ , draw $N$ independent samples $\{y_{i,k}\}$ from $p_{\theta^t}(\cdot|x_i)$ .
Rank the responses by reward $r(x_i, y_{i,k})$ , retaining only the best (or top $K$ , or those exceeding threshold $\tau$ ).
Fine-tune $\theta$ via supervised maximum likelihood on the selected $(x, y)$ pairs:

$L_{\rm RAFT} (\theta) = -\sum_{(x,y) \in S} \log p_\theta(y|x)$

If regularization is desired, a KL penalty with weight $\beta$ can be included:

$\tilde{r}(x,y) = r(x,y) - \beta \log \frac{p_\theta(y|x)}{p_{\theta^0}(y|x)}$

Selection is invariant to affine rescaling of $r$ . RAFT thus iteratively pushes the generator distribution toward regions of higher reward under the provided metric, without the complexities of reward-model-based credit assignment or policy optimization (Dong et al., 2023).

2. Algorithmic Workflow

The complete pseudocode for RAFT is as follows:

Input: initial parameters θ⁰,
       reward model r(x,y),
       batch size b,
       sample count N per prompt,
       acceptance rule (top-1, top-K, threshold τ),
       number of iterations T

for t = 0 ... T-1:
    X = {sample b prompts x₁,...,x_b from D}
    S = {}
    for each xᵢ in X:
        Y = {sample N responses yᵢ,ₖ ~ p_{θᵗ}(·|xᵢ)}
        scores = {r(xᵢ,yᵢ,ₖ) for k=1..N}
        Sᵢ = select_best(scores, rule=acceptance_rule)
        S.update({(xᵢ, y): y in Sᵢ})
    θᵗ⁺¹ = gradient_descent_step(θᵗ, S)
return θᵀ

The selection procedure can perform:

Best-of-N sampling (top-1 per prompt),
Top-K selection,
Thresholding (all pairs with $r \geq \tau$ ).

Fine-tuning is always via cross-entropy (negative log-likelihood) over the accepted pairs, optionally with a KL term penalizing divergence from the initial parameters.

3. Hyperparameterization and Effects

The following hyperparameters define RAFT’s operational regime (Dong et al., 2023):

Parameter	Effect/Tradeoff	Typical Values
$N$ (samples/prompt)	Higher $N$ yields higher-reward extremes, but increases compute; scaling like $O(\sqrt{\log N})$	16–64
Top-K or $1/K$	Lower acceptance ratio increases selectivity, but can reduce data diversity and overfit smaller batches	$K=8$ , $K=32$
Threshold $\tau$	Sets a fixed reward bar. Too high: under-training. Too low: poor align.	Data/metric-dependent
Sampling temp. $\lambda$	Controls sample diversity; $\lambda>1$ increases exploration	$\lambda=0.85$ –$1.0$
KL weight $\beta$	Regularizes toward the original model (preserving fluency, diversity)	$\beta=0$ –$0.1$

Empirically, the authors find $N \simeq 32$ , $b \approx 2048$ , top-1 per prompt, or $1/K \simeq 1/8 ... 1/32$ , with temperature $\lambda \in [0.85,1.00]$ and light KL regularization optimal for both language and diffusion models.

4. Advantages over RLHF and PPO

RAFT contrasts with traditional RLHF—especially RLHF grounded in Proximal Policy Optimization (PPO)—in multiple dimensions (Dong et al., 2023):

Stability: RAFT employs only maximum-likelihood supervised updates on filtered samples, which are inherently stable. RLHF-PPO is prone to reward-scaling dependencies, sensitivity to noise, and hyperparameter instability.
Simplicity: No policy reference, no separate critic network, and only a single model active at each training point.
Decoupling of Phases: Generation (sampling) and weight update (fine-tuning) are loosely coupled, enabling efficient batch pipeline execution.
Memory/Compute: RAFT requires only the reward model and current generator during fine-tuning, avoiding multiple concurrent model/critic references.
Reward scaling: RAFT’s accept/reject criterion is rank-based, rendering it robust to affine transformations or calibration error in the reward model; PPO and related methods are explicitly reward-scale sensitive.

These traits make RAFT more robust and practical for large-scale alignment tasks than typical RLHF protocols.

5. Empirical Evaluation

LLM Alignment

On LLaMA-7B aligned against the HH-RLHF benchmark:

Unaligned: mean reward $\approx -0.44$ , perplexity $\approx 4.78$ .
SFT: $r \approx 0.77$ , perplexity $\approx 3.78$ .
PPO: $r \approx 2.08$ , perplexity $\approx 4.16$ .
RAFT ( $N=32$ , $\lambda=1.0$ ): $r \approx 2.29$ (best), perplexity $\approx 4.03$ .
Lexical and semantic diversity on par or superior to PPO.
RAFT converges in $\approx 7$ hours on $8 \times$ A40 GPUs (faster than PPO).

Diffusion Model Alignment

For Stable Diffusion 1.5, on both 256×256 CLIP-aesthetic and 512×512 text-image alignment:

RAFT achieves CLIP-aesthetic $\approx 6.14 \pm 0.49$ (vs. DDPO RLHF $6.04 \pm 0.49$ ), $\approx 8.4$ minutes total training time (vs. $415$ minutes for RL baseline).
RAFT produces stronger text-prompt alignment, and followers commit entries faster than with vanilla Raft approaches.

Overall, RAFT matches or outperforms RLHF (PPO/DDPO) on reward-based alignment, preserves or improves output fluency and diversity, and offers substantially reduced hardware costs and operational complexity (Dong et al., 2023).

6. Practical Deployment and Limitations

RAFT is effective for both LLMs and diffusion models. Acceptable reward metrics can be model-driven (learned from human feedback or automated, e.g., CLIP-aesthetic). The invariance to reward rescaling eases deployment with imperfect reward calibrations.

Primary limitations reflect the need for:

Sufficient sample count $N$ to realize expected reward improvement.
Careful management of acceptance ratios/top-K to avoid overfitting or degraded data diversity in small or low-entropy domains.
A sufficiently expressive and aligned reward function; garbage-in-garbage-out applies if $r(x, y)$ poorly reflects desired output quality/ethics.

RAFT is strictly limited to the domain defined by the reward model or function—potential misalignment between reward and true human preference will propagate, just as with RLHF.

References:

"RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment" (Dong et al., 2023)

Markdown Report Issue Upgrade to Chat

References (1)

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RAFT Algorithm.

RAFT Algorithm for Model Alignment

1. Formal Specification

2. Algorithmic Workflow

3. Hyperparameterization and Effects

4. Advantages over RLHF and PPO

5. Empirical Evaluation

LLM Alignment

Diffusion Model Alignment

6. Practical Deployment and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RAFT Algorithm for Model Alignment

1. Formal Specification

2. Algorithmic Workflow

3. Hyperparameterization and Effects

4. Advantages over RLHF and PPO

5. Empirical Evaluation

LLM Alignment

Diffusion Model Alignment

6. Practical Deployment and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research