Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rejection Sampling & SFT in LLM Alignment

Updated 14 February 2026
  • Rejection Sampling and SFT are fundamental techniques that align large language models to human preferences using data-efficient and statistically rigorous methods.
  • Hybrid pipelines like RS-DPO, RSO, and STARS integrate rejection sampling with fine-tuning to generate high-quality preference data and improve benchmark performance.
  • Empirical analyses show that combining these methods enhances model alignment, data efficiency, and overall performance across various evaluation metrics.

Rejection sampling and supervised fine-tuning (SFT) are foundational components in the alignment and adaptation of LLMs to human preferences and downstream tasks. Their interplay underpins a range of state-of-the-art approaches, including pervasive methods such as statistical rejection sampling optimization (RSO), hybrid pipelines like RS-DPO, and segment-level alignment strategies exemplified by STARS. This article provides a comprehensive overview of these methodologies, focusing on the technical mechanisms that connect rejection sampling with SFT, their role in modern preference optimization frameworks, and comparative empirical results across benchmark datasets.

1. Foundations of Supervised Fine-Tuning and Rejection Sampling

Supervised fine-tuning (SFT) refers to the process of adapting a pretrained LLM to high-quality data Dsft={(xi,yi)}\mathcal{D}_\mathrm{sft} = \{(x_i, y_i)\} by maximizing the conditional log-likelihood of reference responses: LSFT=(x,y)Dsftt=1ylogπ(ytx,y<t).\mathcal{L}_\mathrm{SFT} = -\sum_{(x,y)\in\mathcal{D}_{\rm sft}} \sum_{t=1}^{|y|}\log \pi(y_t\mid x, y_{<t}). SFT produces a policy πsft(yx)\pi_\mathrm{sft}(y|x) that approximates the human demonstration distribution but does not directly optimize for reward functions that encode user preference or task success (Khaki et al., 2024).

Rejection sampling is a statistical procedure for biasing output distributions by selectively accepting or rejecting candidates according to a target distribution. In RLHF and preference optimization, rejection sampling typically applies to sampling from πsft\pi_\mathrm{sft} to form preference datasets where accepted/rejected samples are determined by a reward model r(x,y)r(x, y) (Liu et al., 2023).

2. Hybrid Preference Optimization: RS-DPO

The RS-DPO (Rejection Sampling + Direct Preference Optimization) pipeline systematically integrates SFT, rejection sampling, and DPO as follows (Khaki et al., 2024):

  1. Supervised Fine-Tuning: Start from a pretrained LLM (e.g., Llama-2-7B), and apply SFT on paired prompt-response data (e.g., OASST1 dataset; 9,000 examples) to yield policy πsft\pi_\mathrm{sft} with specified hyperparameters (learning rate 2×1052\times 10^{-5}, batch size $64$, epochs $2$, weight decay $0.1$, sequence length $4096$). No LoRA adapters are used in this SFT phase.
  2. Synthetic Preference Bootstrapping via RS: For each prompt, sample k=16k=16 responses from πsft\pi_\mathrm{sft} using multinomial decoding (top-k $50$, top-p $0.98$, temperature $1.0$, max tokens $512$). Each response is evaluated with a reward model R(x,y)R(x, y) (pythia-6.9B-RM, trained with a Bradley–Terry loss). For each ordered pair, compute the normalized reward gap:

rgap=σ(r(x,yij)r(x,yil)τ).r_\mathrm{gap} = \sigma\Bigl(\frac{r(x, y_{ij}) - r(x, y_{il})}{\tau}\Bigr).

Pairs exceeding a threshold η\eta (e.g., $0.85$) are accepted as contrastive preference triples and form a synthetic set DP\mathcal{D}_\mathrm{P}.

  1. Direct Preference Optimization: Apply DPO on DP\mathcal{D}_\mathrm{P}, optimizing

LDPO=(x,yl,yw)DPlogσ[βlogπRL(ywx)πsft(ywx)βlogπRL(ylx)πsft(ylx)],\mathcal{L}_\mathrm{DPO} = -\sum_{(x, y_l, y_w)\in\mathcal{D}_\mathrm{P}} \log \sigma\Bigl[\beta\log\frac{\pi_\mathrm{RL}(y_w|x)}{\pi_\mathrm{sft}(y_w|x)} - \beta\log\frac{\pi_\mathrm{RL}(y_l|x)}{\pi_\mathrm{sft}(y_l|x)}\Bigr],

with β=0.1\beta = 0.1 and LoRA rank $8$ (in DPO, not SFT) for GPU efficiency.

The RS-DPO pipeline efficiently aligns LLMs by generating diverse, on-policy preference pairs and optimizing them with a stable convex surrogate loss, outperforming standalone RS, PPO, and DPO in benchmarks such as MT-Bench and AlpacaEval.

3. Statistical and Decoding-Time Rejection Sampling: RSO and STARS

Statistical Rejection Sampling Optimization (RSO)

RSO generalizes the sampling of preference pairs by aiming to match the optimal policy: π(yx)=1Z(x)πSFT(yx)exp(1βrψ(x,y)),\pi^*(y|x) = \frac{1}{Z(x)}\pi_\mathrm{SFT}(y|x)\exp\Bigl(\frac{1}{\beta} r_\psi(x, y)\Bigr), with rψr_\psi a learned reward model and β\beta a regularization strength (Liu et al., 2023). Rejection sampling is structured such that the acceptance probability is

a(x,y)=exp(1β(rψ(x,y)rmax)).a(x, y) = \exp\Bigl(\frac{1}{\beta}(r_\psi(x, y) - r_\mathrm{max})\Bigr).

Accepted samples are used to construct preference datasets for downstream optimization with DPO or hinge losses.

RSO produces higher-quality on-policy preference data than direct use of SFT or static annotation, resulting in improved models across summarization and dialogue tasks according to multiple preference win-rate metrics and human evaluation.

Segment-level Token Alignment with Rejection Sampling (STARS)

STARS operates at inference-time, segmenting generation into fixed-size blocks and applying rejection sampling blockwise based on a scalar reward model (Quamar et al., 5 Nov 2025). Each block is accepted with probability

αk=min{1,exp(r(x,y(k1)s(k))τr(k)β)}\alpha_k = \min\Bigl\{1, \exp\Bigl(\frac{r(x, y^{(k-1)} \oplus s^{(k)}) - \tau_r(k)}{\beta}\Bigr)\Bigr\}

with an annealed threshold τr(k)\tau_r(k). STARS enables early error correction, prunes low-reward continuations, and can outperform SFT and DPO by up to $14.9$ and $4.3$ percentage points respectively in win-rate metrics, while matching strong Best-of-N baselines at much lower computational cost.

4. Rejection Sampling in Fine-Tuning: RFT and RIFT

In Rejection Sampling Fine-Tuning (RFT), candidate responses are generated and filtered by a scalar reward, retaining only those surpassing a threshold for standard SFT (Liu et al., 14 Jan 2026). RFT is limited by data inefficiency, as negative samples are fully discarded.

Reward-Informed Fine-Tuning (RIFT) addresses this with a stabilized signed-weighted objective: LRIFT(θ)=E(x,y)D+[r(x,y)logπθ(yx)]E(x,y)D[r(x,y)πθ(yx)],\mathcal{L}_\mathrm{RIFT}(\theta) = -\mathbb{E}_{(x, y)\in\mathcal{D}^+}[r(x, y) \log \pi_\theta(y|x)] - \mathbb{E}_{(x, y)\in\mathcal{D}^-}[r(x, y) \pi_\theta(y|x)], where D+\mathcal{D}^+ and D\mathcal{D}^- are positive and negative reward sets, and the log-loss is replaced by a bounded surrogate for negatives. RIFT demonstrates superior data efficiency, stability, and accuracy relative to SFT, RFT, and DPO, notably raising Mean@8 and Pass@8 on mathematical reasoning datasets. RIFT's design allows generalization with both successful and failed trajectories while maintaining low VRAM requirements.

5. Empirical Comparisons and Ablation Analyses

Comparative results across multiple methods emphasize the tradeoffs between statistical rigor, data efficiency, computational cost, and final model quality.

Method MT-Bench (avg) AlpacaEval (%) IMDB Win-Rate (Δ pp vs SFT) Mean@8 (Qwen2.5-1.5B) Pass@8 (DeepSeek-R1)
SFT 5.12 60.2
RS (1-step SFT) 4.84 60.2
PPO 5.22 69.2
DPO (human) 5.26 65.3
RS-DPO 5.49 74.17
STARS +14.91
RFT 25.6
RIFT 37.0 63.2

Ablation studies reveal that performance in rejection sampling-based pipelines is sensitive to RS thresholds (η\eta), temperature parameters, and reward model robustness. For instance, RS-DPO’s performance is robust to weaker reward models, whereas PPO degrades. RIFT ablations show performance gains plateau at moderate rollout sizes, and constant negative rewards produce best results for bounded loss stability. STARS demonstrates that segment-level alignment saves compute and can outperform full-sequence reranking.

6. Advantages, Limitations, and Theoretical Considerations

Advantages of integrating rejection sampling with SFT include:

  • High data efficiency through diverse, on-policy preferences
  • Offline, stable preference optimization (DPO/hinge) with convex surrogates
  • Robustness to reward model quality and VRAM constraints in RIFT
  • No need for large quantities of static human annotation

However, limitations persist:

  • Reward model miscalibration on partial texts (CR for STARS)
  • Forced-accept policies or fixed block sizes introduce biases
  • RFT discards valuable negative information, wasting compute

Theoretically, rejection sampling guarantees alignment with the desired reward-adjusted distribution, and under Bradley–Terry, preference learning losses admit MLE formulations. RSO unifies DPO/SLiC under a preference modeling framework parameterized by a convex surrogate \ell and highlights the critical impact of where preference data are sourced (on-policy vs. off-policy).

7. Prospects and Future Directions

Recent developments suggest a reorientation of alignment practices toward more principled, data-efficient, and computationally tractable frameworks. Future work may explore:

  • Improved reward modeling, particularly for partial or prefix texts (as noted for STARS)
  • Adaptive and uncertainty-aware acceptance thresholds
  • Integration of segmentation and rejection in training-time pipelines
  • Mix-and-match sampling strategies, e.g., Metropolis–Hastings or SMC, for further alignment improvements

Overall, the interplay of rejection sampling and SFT constitutes a flexible backbone for both scalable offline tuning and dynamic, inference-time alignment in LLMs, supported by strong empirical performance and a comprehensive theoretical foundation (Khaki et al., 2024, Quamar et al., 5 Nov 2025, Liu et al., 14 Jan 2026, Liu et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rejection Sampling and Supervised Fine-Tuning (SFT).