Iterative Rejection-Sampling SFT

Updated 22 October 2025

Iterative rejection-sampling SFT is a method that uses statistical rejection sampling to filter high-reward outputs for improved model alignment.
It integrates reward modeling, inverse reinforcement learning, and iterative finetuning to overcome limitations of traditional SFT and RLHF methods.
Empirical results demonstrate enhanced stability, convergence, and higher human evaluation scores by prioritizing high-quality, on-policy data.

Iterative Rejection-Sampling Supervised Finetuning (SFT) is an advanced paradigm for aligning LLMs with human preferences, expanding upon supervised fine-tuning by introducing statistical mechanisms—primarily rejection sampling using reward models—to iteratively curate higher-quality data for alignment. This methodology addresses core limitations of standard SFT, preference optimization approaches like Direct Preference Optimization (DPO) and Sequence Likelihood Calibration (SLiC), and online reinforcement learning from human feedback (RLHF) strategies such as Proximal Policy Optimization (PPO), improving stability, scalability, and convergence to optimal policies (Liu et al., 2023, Mukobi et al., 2023, Khaki et al., 15 Feb 2024, Ni et al., 30 May 2024, Xie et al., 20 Aug 2024, Ye et al., 14 Jan 2025, Sotiropoulos et al., 4 Jun 2025, Harada et al., 17 Jun 2025, Qin et al., 17 Jul 2025).

1. Foundational Principles and Motivation

Traditional SFT uses cross-entropy loss on curated response datasets to fit the model to human-preferred behaviors. However, SFT samples typically come from suboptimal distributions and lack an explicit mechanism for preference ranking or reward-driven selection. Preference-based approaches—DPO and SLiC—improve upon this by leveraging pairwise preferences, but are constrained to sampled distributions either from mixed sources (DPO) or strictly the SFT policy (SLiC), limiting their ability to approximate the target optimal policy $\pi^*$ . RLHF with PPO attempts to optimize the LLM with reward-driven updates, but is often unstable and computationally intensive.

Iterative rejection-sampling SFT incorporates a reward estimator to score candidate outputs from the current policy, filtering these candidates through a statistical acceptance lens. Responses more likely to be generated by the optimal policy (as defined by high relative reward) are preferentially selected; rejected samples are discarded. Iteratively, the model is finetuned on these higher-quality, "on-policy" samples, bootstrapping the SFT loop toward improved alignment and stability (Liu et al., 2023, Ni et al., 30 May 2024).

2. Methodology of Statistical Rejection Sampling

The central mechanism is the use of a reward-ranking model $r(x, y)$ to filter candidate completions $y$ given a prompt $x$ . Formally, given an SFT proposal distribution $\pi_{\mathrm{sft}}(y|x)$ , a batch of candidates is drawn; each candidate is scored, and the acceptance probability is computed via:

$P_\mathrm{accept}(x, y) = \exp\left(\frac{r(x, y) - r_\mathrm{max}}{\beta}\right)$

where $r_\mathrm{max}$ is the maximum reward in the candidate pool and $\beta$ is a temperature hyperparameter controlling trade-offs between exploration and exploitation. This acceptance step is inspired by classic rejection sampling and reweights the empirical data distribution toward the unknown target optimal policy $\pi^*$ as defined by:

$\pi_r(y|x) = \frac{1}{Z(x)} \pi_{\mathrm{sft}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$

where $Z(x)$ is the partition function for normalization (Liu et al., 2023). Iteratively performing this process with updated reward models yields a dataset that is increasingly representative of the desired alignment target.

3. Loss Function Unification and Preference Modeling

Both SLiC and DPO can be expressed as models fitting the Bradley-Terry preference probabilities, but differ in their loss functions:

DPO employs a sigmoid-normalized (logistic) loss:

$\mathcal{L}_\mathrm{sigmoid-norm} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}_p} \log \sigma\left(\gamma \log \frac{\pi_\theta(y_w|x)}{\pi_\theta^{(\mathrm{ref})}(y_w|x)} - \gamma \log \frac{\pi_\theta(y_l|x)}{\pi_\theta^{(\mathrm{ref})}(y_l|x)}\right)$

SLiC utilizes a hinge-based loss:

$\mathcal{L}_\mathrm{hinge} = \mathbb{E} \left[\max \left(0, 1 - \left(\gamma \log \pi_\theta(y_w|x) - \gamma \log \pi_\theta(y_l|x)\right)\right)\right]$

Unified frameworks, as proposed, allow normalized hinge-losses and improved calibrations applicable to either preference modeling direction (Liu et al., 2023). This loss unification supports iterative rejection-sampling SFT by clarifying principled optimization of the filtered dataset in every round.

4. Integrating Reward Learning and Inverse Reinforcement Learning

Recent work shows that SFT substantially benefits from integrating reward learning and inverse reinforcement learning (IRL) even in the presence of only positive human demonstrations (Li et al., 28 May 2024). In IRL-based SFT, the learning process is formulated as a bilevel optimization:

Upper level: maximize the likelihood of expert demonstrations.
Lower level: derive policy $\pi$ that balances maximizing $r(x, y)$ against KL regularization to a reference model:

$\max_r \min_\pi \mathbb{E}_{x, y \sim \pi^E, \tilde{y} \sim \pi} \left[ \frac{r(x, y) - r(x, \tilde{y})}{\beta} + D_\mathrm{KL}(\pi(\cdot|x)\|\pi_\mathrm{ref}(\cdot|x)) \right]$

Contrasting human demonstrations and synthetic alternatives by reward difference enables penalization of off-distribution outputs, enhancing robustness and alignment—even when direct preference labels are unavailable. This approach has convergence guarantees and connections to self-play fine-tuning (SPIN), further improving iterative rejection-sampling SFT strategies.

5. Experimental Performance and Practical Considerations

Empirical evaluations across diverse tasks (Reddit TL;DR summarization, AnthropicHH dialogues, CNN/DailyMail, benchmarking on AlpacaEval and MT-bench) show that iterative rejection-sampling SFT consistently achieves superior alignment and stability compared to SFT, RLHF/PPO, DPO, and SLiC baselines. Notably, RSO-variants often improve proxy reward rates, LLM-based side-by-side win rates, and human evaluation scores. Such gains are attributed to the use of high-reward-filtered data and iterative bootstrapping of policy updates (Liu et al., 2023, Ni et al., 30 May 2024).

Resource requirements include:

Additional computation for reward filtering (often needing a superbatch of candidate completions per prompt; e.g., up to 64 candidates in some pipelines).
Increased wall-clock time in exchange for stability and reduced hyperparameter tuning overhead (e.g., SuperHF observed a 6× increase relative to RLHF/PPO) (Mukobi et al., 2023).
Implementation simplicity, especially when compared to RLHF, since the pipeline reduces reliance on unstable and intricate RL updates.

Further work extends iterative rejection-sampling SFT to crowd-sourced feedback frameworks and label refinement regimes. For example, multi-model selection and point-based reward systems have been introduced in crowd-SFT to enhance fairness and scalability, leveraging iterative candidate model selection as a form of rejection sampling (Sotiropoulos et al., 4 Jun 2025). Simultaneously, iterative label refinement (ILR) demonstrates that instead of using comparison preference feedback to directly tune models (as in RLHF/DPO), feedback should be used to iteratively update the training labels themselves, yielding robustness against unreliable supervision (Ye et al., 14 Jan 2025).

In both cases, only the “best” candidates (either model variants or data samples) are retained for subsequent fine-tuning rounds, accelerating convergence and improving alignment quality over naive SFT or unstable RLHF alternatives.

7. Theoretical Insights and Further Directions

Iterative rejection-sampling SFT can be interpreted through the lens of RL theory. SFT on curated data optimizes a lower bound of the RL objective in a sparse reward setting. Importance-weighted SFT (iw-SFT) tightens this bound by introducing auxiliary reweighting distributions:

$\mathcal{J}_\mathrm{iw-SFT}(\theta) = \mathbb{E}_{\tau \in D^+} \left[ \frac{q(\tau)}{\pi_{\mathrm{ref}}(\tau)} \log p(\tau;\theta) \right]$

where $q(\tau)$ adaptively tracks the evolving policy, and per-step clipping or temperature-scaled smoothing mitigates variance in importance weights. Empirical performance of iw-SFT on reasoning benchmarks shows competitive or superior results compared to advanced RL-based methods, with straightforward implementation via modified maximum-likelihood loss (Qin et al., 17 Jul 2025).

Iterative rejection-sampling SFT benefits from prioritizing low-perplexity samples during selection (as perplexity is the strongest predictor of model improvement), with mid-layer network updates most highly correlated with downstream accuracy gains (Harada et al., 17 Jun 2025).

Ongoing research directions include:

Adaptive hyperparameter scheduling for reward and rejection criteria.
Integrating dynamic sample-level coefficients to minimize unwanted deviation while increasing training effectiveness (Xie et al., 20 Aug 2024).
Scaling frameworks for broader annotation, improved fairness, and efficient performance in distributed or resource-constrained settings (Sotiropoulos et al., 4 Jun 2025).
Refinements in label replacement and hybrid strategies that combine ILR with RL-based optimization under weak or noisy supervision (Ye et al., 14 Jan 2025).

In summary, iterative rejection-sampling supervised finetuning unifies and extends contemporary LLM alignment approaches, integrating statistically principled filtering, reward modeling, and iterative optimization to robustly, efficiently, and transparently improve model alignment with human preferences. The methodology is theoretically grounded, empirically validated, and readily extensible, marking it as a key development in practical LLM training under realistic supervision constraints.