Rejection Sampling Fine-Tuning (RFT)

Updated 10 October 2025

Rejection Sampling Fine-Tuning is a set of methodologies that fine-tune classical rejection sampling to improve sample efficiency and computational performance.
It employs adaptive mechanisms such as data augmentation, gradient-refined proposals, and optimal bit-revelation to reduce rejection rates and enhance sampling quality.
RFT underpins advancements in Bayesian inference, generative modeling, and self-improving language models by effectively optimizing resource usage and downstream utility.

Rejection Sampling Fine-Tuning (RFT) refers to a set of methodologies that systematically enhance the efficacy, efficiency, and applicability of rejection sampling in statistical inference, generative modeling, and large-scale machine learning. While classical rejection sampling is simple but often inefficient—especially for complex, high-dimensional, or structure-constrained distributions—RFT approaches introduce data augmentations, adaptive mechanisms, statistical optimization, algorithmic improvements, or downstream synergy with learning procedures to realize more robust and resource-efficient sampling and learning.

1. Core Principles of Rejection Sampling and Its Fine-Tuning

Rejection sampling generates samples from a target density $p(x)$ (often up to normalization) by sampling from a proposal $q(x)$ and accepting with probability proportional to $p(x)/Mq(x)$ for a suitable envelope constant $M$ . Rejection Sampling Fine-Tuning seeks to improve one or more of:

Sample efficiency: reducing redundancies, improving acceptance probability, or leveraging bits/queries optimally (Langevin et al., 29 Sep 2025, Achdou et al., 2018, Raff et al., 2023).
Downstream utility: augmenting data or learning proposals for inference or generative modeling (e.g., Bayesian inference for doubly intractable models (Rao et al., 2014), reinforcement learning of LLMs (Yuan et al., 2023, Ji et al., 4 Mar 2025, Lan et al., 17 Apr 2025, Koh et al., 22 May 2025, Huang et al., 2 Jul 2025)).
Algorithmic generality and theoretical guarantees: such as minimax-optimal rates (Achdou et al., 2018), multi-level variance reduction (Warne et al., 2017), or optimal bit-revelation (Langevin et al., 29 Sep 2025).
Statistical or preference optimization: aligning sampling or fine-tuning with targets defined by preferences, divergences, or reward signals (Sharma et al., 2019, Liu et al., 2023, Liu et al., 22 May 2025).

A unifying feature in RFT research is the explicit control and optimization of what gets retained, what gets discarded, and how the resulting information or data is used to further tune models or downstream algorithms.

2. Data Augmentation via Rejection Sampling: MCMC and Inference

In high-dimensional or doubly-intractable models, naively computing marginal likelihoods or normalization constants is often infeasible. Augmenting the observed data with the sequence of rejected proposals preceding every accepted sample, as in the method of (Rao et al., 2014), yields a tractable joint distribution:

$p(Y, x|\theta) = \frac{f(x, \theta)}{M}\prod_{i=1}^{|Y|} \left[q(y_i|\theta) - \frac{f(y_i,\theta)}{M}\right]$

where $Y = \{y_1, ..., y_r\}$ are rejected proposals, $x$ is the accepted sample, $q$ is the proposal, and $M$ is the envelope. This data augmentation:

Eliminates the need for intractable normalization constants ( $Z(\theta)$ ).
Enables efficient MCMC and block-wise Gibbs updates, including for parameter regimes which are "doubly intractable".
Empirically improves mixing rates and exploration efficiency in practical applications such as flow cytometry with truncation, Bayesian inference on matrix Langevin distributions, and nonparametric GP-modulated densities.
Facilitates the use of gradient-based transition proposals, e.g., Hamiltonian Monte Carlo, by making the joint posterior fully differentiable.

This augmentation-based RFT can be seen as fundamentally altering the state-space—unconditionally transforming an otherwise intractable marginal inference problem into a conditionally tractable (augmented) posterior, enabling conventional sampling and learning techniques to be applied effectively.

3. Adaptive and Optimal Envelope Construction

Many RFT approaches focus directly on reducing rejection rates or optimizing sampling costs:

Adaptive Envelopes: NNARS (Achdou et al., 2018) constructs a nonparametric, nearest-neighbor-based upper-bound envelope that iteratively tightens using all previously evaluated (accepted/rejected) samples. By carefully bounding the deviation of the estimate from the true density (leveraging Hölder continuity), the algorithm achieves a near minimax-optimal expected loss in terms of rejected samples: $O(\log^2 n \cdot n^{1-s/d})$ .
Gradient-Refined Proposals: Instead of hand-designed proposals, gradient-based refinement of parameterized proposals (e.g., GMMs) directly minimizes $\max_x f(x)/g(x)$ using a softmax-based loss for tractable backpropagation (Raff et al., 2023). This approach yields up to $7\times$ higher acceptance rates than state-of-the-art adaptive rejection sampling baselines in low dimensions, with only the requirement that $f$ is differentiable.
Bit-Efficient Sampling: (Langevin et al., 29 Sep 2025) analyzes bitwise rejection sampling for increasing densities, showing that an alternating "reveal and test" algorithm can (in the worst case) settle accept/reject using $O(n^2)$ bits for $n$ -dimensional problems, establishing strong theoretical bounds on the minimal randomness consumption needed for correct decisions.

Adaptive envelope tuning and optimal bit revelation represent dual frontiers in RFT: one reduces functional evaluation/sample complexity, the other optimizes the information-theoretic cost of the sampling process itself.

4. RFT in Bayesian Inference, Generative Modeling, and Learning

RFT strategies now pervade statistical learning and generative modeling:

RDVI and RFT Hybridization: By minimizing the Rényi $\alpha$ -divergence ( $D_\alpha(p \| q)$ ) between target $p$ and variational $q$ —with the crucial observation that $\lim_{\alpha \to \infty} D_\alpha(p \| q) = \log M(\theta)$ (with $M(\theta)$ the optimal rejection constant)—optimal variational proposals are obtained (Sharma et al., 2019). A two-stage procedure first learns $q_\theta$ via RDVI, then samples using rejection sampling with tuned $M_\theta$ , yielding posteriors that capture multi-modality and heavy tails more accurately than RDVI alone.
Preference and Reward Optimization: In learning from preference data (human feedback, reward models), RFT frameworks such as Statistical Rejection Sampling Optimization (RSO) (Liu et al., 2023) generate preference pairs that more closely match the "optimal" policy by performing rejection sampling over candidate responses, thereby supporting more robust and scalable learning compared to Direct Preference Optimization or offline sequence likelihood approaches.
Rejection Filtering / Particle Methods: Fusion of rejection sampling with particle filtering enables Bayesian inference in memory-constrained or online settings, with only summary statistics (mean, covariance) retained rather than all accepted samples (Wiebe et al., 2015). This enables effective online tracking and classification (e.g., in embedded systems or active learning settings).
Generative Adversarial Models and Diffusion Models: Discriminator Rejection Sampling (Azadi et al., 2018) post-hoc filters GAN outputs using their critic's density ratio estimates, substantially enhancing fidelity and sample quality. Diffusion Rejection Sampling (Na et al., 28 May 2024) aligns reverse process transitions with true kernels using learned discriminators and rejection tests per timestep, yielding state-of-the-art FID while being computationally competitive with (and synergistic to) fast ODE-based samplers.

5. RFT for Data Augmentation and Self-Improving LLMs

RFT underlies the "Self-Taught Reasoner" paradigm for LLMs and agentic pipelines (Yuan et al., 2023, Ji et al., 4 Mar 2025, Lan et al., 17 Apr 2025, Koh et al., 22 May 2025, Huang et al., 2 Jul 2025):

Self-Improving Reasoning: RFT collects correct reasoning chains generated by a (supervised) model, filters them for correctness and diversity, and augments the training data for further fine-tuning. Diversity (distinct solution strategies, non-redundant equations) is explicitly promoted to enhance generalization and robustness.
Data Balancing and Adaptive Sampling: AdaSTaR (Koh et al., 22 May 2025) addresses the inefficiency of classical RFT's random sampling (over-sampling "easy" solved examples and under-sampling hard ones) with adaptive heaps and curriculum mechanisms—leveraging performance statistics to prioritize examples, balance diversity, and align training difficulty with capability.
Exploring Failures and Negative Trajectories: In LLM agent learning, RFT by default ignores failed expert trajectories. Complementary approaches mine such failures for recoverable states or beneficial actions (e.g., "Next" or "Back" navigation actions), adding carefully filtered segments to training and improving out-of-distribution/generalization performance (Lan et al., 17 Apr 2025).
Hybrid Imitation-Exploration: Prefix-RFT (Huang et al., 2 Jul 2025) blends demonstration (SFT) prefixes with on-policy (RFT) completions in a unified fine-tuning framework, combining the stability of behavior cloning with the flexibility of reinforcement learning, and yielding new SOTA on mathematical reasoning.
Prefix Self-Consistency and Efficient Tuning: Unsupervised Prefix Fine-Tuning (UPFT) (Ji et al., 4 Mar 2025) leverages the empirical observation that the initial reasoning steps are highly consistent across correct and incorrect solutions. Fine-tuning on only the prefix tokens reduces the need for full rejection sampling sweeps, dramatically lowering cost with negligible impact on accuracy.

6. RFT in Structured and Topologically-Constrained Generation

Recent advances in RFT extend to structured output spaces such as 3D mesh generation, where fine-grained reinforcement fine-tuning (e.g., Mesh-RFT (Liu et al., 22 May 2025)) leverages locally-aware preference optimization. Key features include:

Masked Direct Preference Optimization (M-DPO): Refinement at the granularity of mesh faces, with local masking functions targeting only defective or low-quality regions and objective topology-aware metrics (e.g., Boundary Edge Ratio, Topology Score) guiding the update.
Objective Topology Metrics: Fine-grained metrics (quad ratio, angle quality, aspect ratio, adjacent consistency) quantify and reward local topological and geometric integrity, enabling spatially precise improvements that are not readily achieved with global reinforcement signals.
Empirical Gains: Masked RL yields substantial reductions in Hausdorff Distance and significant improvements in topological regularity, outperforming global RL and baseline DPO in both objective and subjective evaluation.

7. Theoretical and Practical Frontiers

RFT research connects theoretical optimality with practical algorithmics:

Variance Reduction and Multilevel Methods: MLMC-ABC (Warne et al., 2017) exploits telescoping sum decompositions and level-wise couplings, accelerating Approximate Bayesian Computation while retaining i.i.d. sample properties and reducing simulation burden.
Kernel Design in MCMC: Reducing rejection probability, via the design of explicit rejection control transition kernels, exponentially suppresses autocorrelation time in Markov chains, sometimes outperforming all traditional kernels when plotted against rejection rate (Suwa, 2022).
Computational and Scaling Implications: Empirical studies across reasoning LLMs (Yuan et al., 2023, Koh et al., 22 May 2025), preference learning (Liu et al., 2023), and generative modeling (Na et al., 28 May 2024, Azadi et al., 2018) show RFT to be robust, often more data-parsimonious, and computationally tractable—especially when augmented with adaptive sampling, gradient refinement, and statistical calibration.

8. Limitations, Open Problems, and Future Directions

While RFT transforms the efficiency landscape for a broad class of problems, several frontiers remain:

High-dimensional challenges: The "curse of dimensionality" persists for rejection-based methods; approaches like ERS (Raff et al., 2023), while effective up to $d\leq3$ , show performance plateauing in higher dimensions. Extensions involving better initialization, hierarchical mixture models, or learned invertible proposals may offer gains.
Adaptive and unified loss frameworks: Theoretical understanding of combined loss strategies (e.g., the unified DPO/SLiC view (Liu et al., 2023)) is maturing, but optimal balancing and robust scheduling of imitation/exploration or statistics-based hybrid losses remains an open research agenda.
Failure mining and negative supervision: Methodologies for safely leveraging partial or negative trajectories (as in EEF (Lan et al., 17 Apr 2025)), without introducing bias or contamination, are critical for scalable agent and reasoning learning.
Integration with model distillation and hybrid generative frameworks: The interface of RFT with diffusion model distillation, consistency models, and more generally, the distillation–augmentation pipeline, represents an emerging frontier with significant impact on high-quality, low-cost generation (Na et al., 28 May 2024).
Optimal bitwise implementation and randomness reduction: The fine-tuning of randomness consumption—beyond mere acceptance rates—has implications for hardware design, privacy-preserving computing, and quantum random number generation (Langevin et al., 29 Sep 2025).

Rejection Sampling Fine-Tuning encompasses a diverse and rapidly evolving set of algorithmic strategies, statistical techniques, and practical implementations, all aimed at transforming rejection sampling from a generic but blunt tool into a sophisticated paradigm for efficient, expressive, and robust inference, generation, and learning in modern statistical and machine learning systems.