Rejection Sampling & SFT in LLM Alignment
- Rejection Sampling and SFT are fundamental techniques that align large language models to human preferences using data-efficient and statistically rigorous methods.
- Hybrid pipelines like RS-DPO, RSO, and STARS integrate rejection sampling with fine-tuning to generate high-quality preference data and improve benchmark performance.
- Empirical analyses show that combining these methods enhances model alignment, data efficiency, and overall performance across various evaluation metrics.
Rejection sampling and supervised fine-tuning (SFT) are foundational components in the alignment and adaptation of LLMs to human preferences and downstream tasks. Their interplay underpins a range of state-of-the-art approaches, including pervasive methods such as statistical rejection sampling optimization (RSO), hybrid pipelines like RS-DPO, and segment-level alignment strategies exemplified by STARS. This article provides a comprehensive overview of these methodologies, focusing on the technical mechanisms that connect rejection sampling with SFT, their role in modern preference optimization frameworks, and comparative empirical results across benchmark datasets.
1. Foundations of Supervised Fine-Tuning and Rejection Sampling
Supervised fine-tuning (SFT) refers to the process of adapting a pretrained LLM to high-quality data by maximizing the conditional log-likelihood of reference responses: SFT produces a policy that approximates the human demonstration distribution but does not directly optimize for reward functions that encode user preference or task success (Khaki et al., 2024).
Rejection sampling is a statistical procedure for biasing output distributions by selectively accepting or rejecting candidates according to a target distribution. In RLHF and preference optimization, rejection sampling typically applies to sampling from to form preference datasets where accepted/rejected samples are determined by a reward model (Liu et al., 2023).
2. Hybrid Preference Optimization: RS-DPO
The RS-DPO (Rejection Sampling + Direct Preference Optimization) pipeline systematically integrates SFT, rejection sampling, and DPO as follows (Khaki et al., 2024):
- Supervised Fine-Tuning: Start from a pretrained LLM (e.g., Llama-2-7B), and apply SFT on paired prompt-response data (e.g., OASST1 dataset; 9,000 examples) to yield policy with specified hyperparameters (learning rate , batch size $64$, epochs $2$, weight decay $0.1$, sequence length $4096$). No LoRA adapters are used in this SFT phase.
- Synthetic Preference Bootstrapping via RS: For each prompt, sample responses from using multinomial decoding (top-k $50$, top-p $0.98$, temperature $1.0$, max tokens $512$). Each response is evaluated with a reward model (pythia-6.9B-RM, trained with a Bradley–Terry loss). For each ordered pair, compute the normalized reward gap:
Pairs exceeding a threshold (e.g., $0.85$) are accepted as contrastive preference triples and form a synthetic set .
- Direct Preference Optimization: Apply DPO on , optimizing
with and LoRA rank $8$ (in DPO, not SFT) for GPU efficiency.
The RS-DPO pipeline efficiently aligns LLMs by generating diverse, on-policy preference pairs and optimizing them with a stable convex surrogate loss, outperforming standalone RS, PPO, and DPO in benchmarks such as MT-Bench and AlpacaEval.
3. Statistical and Decoding-Time Rejection Sampling: RSO and STARS
Statistical Rejection Sampling Optimization (RSO)
RSO generalizes the sampling of preference pairs by aiming to match the optimal policy: with a learned reward model and a regularization strength (Liu et al., 2023). Rejection sampling is structured such that the acceptance probability is
Accepted samples are used to construct preference datasets for downstream optimization with DPO or hinge losses.
RSO produces higher-quality on-policy preference data than direct use of SFT or static annotation, resulting in improved models across summarization and dialogue tasks according to multiple preference win-rate metrics and human evaluation.
Segment-level Token Alignment with Rejection Sampling (STARS)
STARS operates at inference-time, segmenting generation into fixed-size blocks and applying rejection sampling blockwise based on a scalar reward model (Quamar et al., 5 Nov 2025). Each block is accepted with probability
with an annealed threshold . STARS enables early error correction, prunes low-reward continuations, and can outperform SFT and DPO by up to $14.9$ and $4.3$ percentage points respectively in win-rate metrics, while matching strong Best-of-N baselines at much lower computational cost.
4. Rejection Sampling in Fine-Tuning: RFT and RIFT
In Rejection Sampling Fine-Tuning (RFT), candidate responses are generated and filtered by a scalar reward, retaining only those surpassing a threshold for standard SFT (Liu et al., 14 Jan 2026). RFT is limited by data inefficiency, as negative samples are fully discarded.
Reward-Informed Fine-Tuning (RIFT) addresses this with a stabilized signed-weighted objective: where and are positive and negative reward sets, and the log-loss is replaced by a bounded surrogate for negatives. RIFT demonstrates superior data efficiency, stability, and accuracy relative to SFT, RFT, and DPO, notably raising Mean@8 and Pass@8 on mathematical reasoning datasets. RIFT's design allows generalization with both successful and failed trajectories while maintaining low VRAM requirements.
5. Empirical Comparisons and Ablation Analyses
Comparative results across multiple methods emphasize the tradeoffs between statistical rigor, data efficiency, computational cost, and final model quality.
| Method | MT-Bench (avg) | AlpacaEval (%) | IMDB Win-Rate (Δ pp vs SFT) | Mean@8 (Qwen2.5-1.5B) | Pass@8 (DeepSeek-R1) |
|---|---|---|---|---|---|
| SFT | 5.12 | 60.2 | — | — | — |
| RS (1-step SFT) | 4.84 | 60.2 | — | — | — |
| PPO | 5.22 | 69.2 | — | — | — |
| DPO (human) | 5.26 | 65.3 | — | — | — |
| RS-DPO | 5.49 | 74.17 | — | — | — |
| STARS | — | — | +14.91 | — | — |
| RFT | — | — | — | 25.6 | — |
| RIFT | — | — | — | 37.0 | 63.2 |
Ablation studies reveal that performance in rejection sampling-based pipelines is sensitive to RS thresholds (), temperature parameters, and reward model robustness. For instance, RS-DPO’s performance is robust to weaker reward models, whereas PPO degrades. RIFT ablations show performance gains plateau at moderate rollout sizes, and constant negative rewards produce best results for bounded loss stability. STARS demonstrates that segment-level alignment saves compute and can outperform full-sequence reranking.
6. Advantages, Limitations, and Theoretical Considerations
Advantages of integrating rejection sampling with SFT include:
- High data efficiency through diverse, on-policy preferences
- Offline, stable preference optimization (DPO/hinge) with convex surrogates
- Robustness to reward model quality and VRAM constraints in RIFT
- No need for large quantities of static human annotation
However, limitations persist:
- Reward model miscalibration on partial texts (CR for STARS)
- Forced-accept policies or fixed block sizes introduce biases
- RFT discards valuable negative information, wasting compute
Theoretically, rejection sampling guarantees alignment with the desired reward-adjusted distribution, and under Bradley–Terry, preference learning losses admit MLE formulations. RSO unifies DPO/SLiC under a preference modeling framework parameterized by a convex surrogate and highlights the critical impact of where preference data are sourced (on-policy vs. off-policy).
7. Prospects and Future Directions
Recent developments suggest a reorientation of alignment practices toward more principled, data-efficient, and computationally tractable frameworks. Future work may explore:
- Improved reward modeling, particularly for partial or prefix texts (as noted for STARS)
- Adaptive and uncertainty-aware acceptance thresholds
- Integration of segmentation and rejection in training-time pipelines
- Mix-and-match sampling strategies, e.g., Metropolis–Hastings or SMC, for further alignment improvements
Overall, the interplay of rejection sampling and SFT constitutes a flexible backbone for both scalable offline tuning and dynamic, inference-time alignment in LLMs, supported by strong empirical performance and a comprehensive theoretical foundation (Khaki et al., 2024, Quamar et al., 5 Nov 2025, Liu et al., 14 Jan 2026, Liu et al., 2023).