Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 16 tok/s
GPT-5 High 18 tok/s Pro
GPT-4o 104 tok/s
GPT OSS 120B 459 tok/s Pro
Kimi K2 216 tok/s Pro
2000 character limit reached

Best-of-N (BoN): Optimizing Generative Outputs

Updated 18 July 2025
  • Best-of-N (BoN) is a method that produces multiple independent outputs from a model and selects the best candidate based on an external reward metric.
  • It is widely applied in areas like language model alignment, evaluation, and test-time scaling to enhance output quality and efficiency.
  • Practical challenges include computational overhead and reward-model imperfections, driving innovations such as adaptive, soft, and rejection-based selection methods.

Best-of-N (BoN) strategies are a class of inference-time methods that improve the selection of outputs from generative models—most notably LLMs—by drawing multiple independent samples and selecting the best candidate according to an external metric or reward function. This approach has become central to a range of domains, including model evaluation, alignment with human preferences, adversarial testing, efficient large-scale generation, and even structure-based molecular generation. BoN provides a simple yet powerful mechanism for test-time scaling, offering enhanced output quality through intelligent exploitation of stochastic model behavior, but it also presents unique challenges related to computational efficiency, reward-model imperfection, and potential misalignment with broader task objectives.

1. Definitional Framework and Principles

BoN refers to the procedure in which, for a given prompt or task instance, a model samples NN outputs (via autoregressive, diffusion, or other generative mechanisms) from a base distribution, evaluates each output with a (potentially proxy) reward model, and selects the candidate with the highest reward as the final output: y=argmaxi=1,,NR(x,yi)y^* = \arg\max_{i=1, \ldots, N} R(x, y_i) where RR is the reward (e.g., human preference, correctness, or a proxy metric), xx is the input, and each yiy_i is sampled i.i.d. from the model’s (possibly stochastic) reference policy. This mechanism is applied to a variety of settings: LLMing with human preference alignment (Gui et al., 2 Jun 2024), architecture selection (Bajgar et al., 2018), process supervision in mathematical reasoning (Zhang et al., 13 Jan 2025), structure-based drug design (Yalabadi et al., 26 Jan 2025), and adversarial safety testing (Hughes et al., 4 Dec 2024).

The BoN metric (sometimes notated as Boon\text{Boo}_n in the statistical evaluation context (Bajgar et al., 2018)) quantifies the expected maximal outcome drawn from the underlying output distribution. Formally, if X1,...,XNPX_1, ..., X_N \sim \mathbb{P} (the stochastic performance of the system), then the expected best-of-NN metric is: BooN(P)=E[max{X1,...,XN}]\text{Boo}_N(\mathbb{P}) = \mathbb{E}[\max\{ X_1, ..., X_N \}] where the distribution of XX encapsulates the randomness from initialization, sampling, or data shuffling in the generative system.

2. Performance Guarantees and Analytical Properties

The expected performance of BoN sampling can be derived using order statistics, with the expected maximum given by: E[max{X1,,XN}]=xNf(x)F(x)N1dx\mathbb{E}[\max\{X_1, \dots, X_N\}] = \int_{-\infty}^{\infty} x \cdot N f(x) F(x)^{N-1} dx where f(x)f(x) and F(x)F(x) are the PDF and CDF of the output quality distribution, respectively (Bajgar et al., 2018). For practical estimation, one can compute a discrete weighted sum over observed outcomes, assigning to each test result a weight corresponding to the probability it would be the best out of NN.

For LLM alignment, BoN sampling achieves near-optimal tradeoffs between improvement in a reward metric (e.g., win rate versus a base model) and Kullback-Leibler (KL) divergence from the base model distribution. Analytic expressions show that the win rate of BoN sampling approaches N/(N+1)N/(N+1), and its KL divergence from the reference is log(N)(N1)/N\log(N) - (N-1)/N (Gui et al., 2 Jun 2024). Empirical studies confirm that, under mild regularity assumptions, BoN sampling is nearly Pareto-optimal for alignment scenarios.

Recent theoretical analyses further demonstrate that as the number of samples NN increases, the BoN-selected distribution approaches the optimal reward-tilted policy, subject to monotonic convergence rates and tight upper bounds on regret and KL divergence (Aminian et al., 8 Jul 2025, Verdun et al., 6 May 2025). In the limit of infinite samples, the performance is dictated by the geometry of the ROC curve of the verifier or reward model, and both BoN and rejection sampling attain the same accuracy plateau, given by the slope of the ROC near the origin (Dorner et al., 16 Jul 2025). Soft Best-of-NN variants, which select among candidates with a temperature-controlled softmax rather than a hard maximum, can further mitigate reward overoptimization, especially when the reward model is imperfect (Aminian et al., 8 Jul 2025, Verdun et al., 6 May 2025).

3. Applications: Alignment, Evaluation, and Test-Time Scaling

Model Alignment and Preference Optimization

BoN sampling is broadly employed as an inference-time method for aligning LLM outputs with human or proxy preferences, bypassing the need for costly iterative fine-tuning (e.g., RLHF or DPO). It forms the foundation for more advanced distillation techniques (such as BoNBoN (Gui et al., 2 Jun 2024), vBoN (Amini et al., 8 Jul 2024), and BOND (Sessa et al., 19 Jul 2024)), which aim to amortize BoN-like benefits into a single-pass policy via distribution matching or variational objectives.

Recent empirical evaluations consistently show that BoN sampling, even without distillation, improves alignment benchmarks such as win rate, helpfulness, and harmlessness, across datasets like AlpacaFarm, HH-RLHF, and UltraFeedback (Jinnai et al., 1 Apr 2024, Qiu et al., 18 Oct 2024, Sun et al., 26 Oct 2024, Chow et al., 18 Dec 2024).

Evaluation and Benchmarking

BoN metrics are foundational in reporting reproducible and stochastic-invariant model performance. In architecture evaluation, reporting the normalized expected best-of-NN performance (Boon\text{Boo}_n) provides a robust, transparent alternative to single-run maxima or means, thereby mitigating the “cherry picking” of outlier results (Bajgar et al., 2018). For process reward models in mathematical reasoning, BoN evaluation quantifies the true likelihood of generating fully correct reasoning chains, but may introduce biases if not paired with step-level analysis (Zhang et al., 13 Jan 2025).

Test-Time Scaling and Data Efficiency

BoN’s principal advantage lies in leveraging inference-time compute to boost outcome quality, notably in code generation and math problem solving (e.g., pass@N metrics). Adaptive BoN strategies allocate computational resources dynamically across prompts, optimizing compute for “hard” cases while saving cost on “easy” ones (Raman et al., 17 May 2025).

Innovations such as speculative rejection (Sun et al., 26 Oct 2024), TreeBoN (Qiu et al., 18 Oct 2024), and self-truncation (Wang et al., 3 Mar 2025) further reduce the computational overhead traditionally associated with BoN by enabling early rejection or pruning of unpromising candidates, or by leveraging internal model signals rather than external reward models.

4. Challenges: Computational Overhead and Reward Hacking

Computational Inefficiency

A central limitation of standard BoN sampling is its linear computational and memory scaling with NN—each additional candidate incurs a full forward pass of the model. This overhead impacts both latency and cost, especially for large models or high-throughput scenarios (Amini et al., 8 Jul 2024, Sun et al., 26 Oct 2024). Acceleration techniques (speculative rejection, tree search, ST-BoN) address this by implementing early candidate truncation or partial generation approaches, with documented reductions in memory and latency exceeding 90% in some cases (Wang et al., 3 Mar 2025, Sun et al., 26 Oct 2024).

Reward Model Imperfections and Overoptimization

BoN is highly sensitive to the properties of the proxy reward model. Overoptimization—sometimes called “reward hacking”—arises when candidates are selected for pathological maxima of the proxy, potentially reducing true objective performance or yielding degenerate outputs (Jinnai et al., 1 Apr 2024, Ichihara et al., 18 Feb 2025). Regularized variants such as MBR-BoN (Minimum Bayes Risk) add penalty terms (KL, Wasserstein, or length penalties) to balance between reward maximization and fidelity to the base model, effectively interpolating between BoN and more conservative selection (Jinnai et al., 1 Apr 2024, Ichihara et al., 18 Feb 2025).

Empirical and theoretical evidence shows that smoothing the selection process (as in soft BoN/SBoN) or using pairwise/tournament style reward models (Liu et al., 22 Jan 2025) can further alleviate instability and improve robustness, especially when reward models are noisy or weak.

5. Security, Adversarial Testing, and Jailbreaking

BoN methodologies play an instrumental role in adversarial red-teaming (“jailbreaking”) of large models. By sampling numerous input augmentations (e.g., shuffling, capitalization) and selecting the successful (i.e., harmful or unsafe) response among them, BoN Jailbreaking has demonstrated high attack success rates—up to 89% on leading LLMs and 78% on closed-source models given 10,000 prompt augmentations (Hughes et al., 4 Dec 2024). The attack’s power scales predictably with the number of samples according to a power-law in negative log-ASR (attack success rate), and generalizes across modalities (text, vision, audio). BoN jailbreaking can be combined with other black-box attack methods (e.g., optimized prefixes), dramatically boosting sample efficiency and ASR.

Countermeasures such as Defense Against the Dark Prompts (DATDP) operate by pre-emptively filtering prompts through dedicated evaluator LLMs, achieving over 99.8% defense rates against BoN-generated attacks by iterative prompt assessment and weighted voting (Armstrong et al., 1 Feb 2025).

6. Extensions, Variants, and Emerging Directions

Soft and Stochastic Best-of-N

Soft BoN (or SBoN) generalizes the argmax selection to a softmax over rewards, parameterized by a temperature λ\lambda: Pr(Z=i)=er(Xi)/λj=1Ner(Xj)/λ\Pr(Z=i) = \frac{e^{r(X_i)/\lambda}}{\sum_{j=1}^N e^{r(X_j)/\lambda}} This interpolation offers robust control over the reward-KL tradeoff, provably achieves O(1/N)O(1/N) convergence to the optimal reward-tilted policy, and mitigates reward overoptimization (Verdun et al., 6 May 2025, Aminian et al., 8 Jul 2025).

Adaptive and Domain-Agnostic Versions

Adaptive BoN (Raman et al., 17 May 2025) dynamically partitions inference budget based on the estimated reward distribution per prompt, thus increasing efficiency in batch deployments. Self-Truncation BoN (Wang et al., 3 Mar 2025) and speculative rejection (Sun et al., 26 Oct 2024) extend BoN to domains lacking explicit reward models by leveraging internal consistency checks (e.g., chains of embedding comparisons) or partial sequence reward proxies, enhancing generalizability and reducing cost.

Pairwise and Tournament-Style Selection

Knockout tournaments and pairwise-judge reward models are emerging alternatives in settings—such as math problem solving—where scoring models are unreliable or inconsistent. These methods evaluate candidates in pairs, often using chain-of-thought rationales, iteratively eliminating suboptimal responses (Liu et al., 22 Jan 2025).

Evaluation in Process Supervision

In process reward model (PRM) contexts, BoN scoring often aggregates step-level correctness to select chain-of-thought outputs. However, BoN may inflate downstream metrics by overemphasizing correct final answers or by tolerating intermediate process errors, necessitating improved evaluation paradigms—such as step-level consensus filtering and response-step joint metrics—for faithful process supervision (Zhang et al., 13 Jan 2025).

7. Theoretical Comparisons and Limitations

While BoN excels in leveraging compute for test-time scaling, its sample complexity and convergence rates can be less favorable than supervised fine-tuning in certain regimes. For instance, when the target function is realizable and the response length is large, SFT achieves nearly length-independent (in TT) convergence, whereas BoN’s rate can depend linearly or logarithmically on TT. However, in non-realizable or noisy scenarios, BoN often enjoys greater robustness, maintaining strong performance where SFT fails due to model or data mismatch (Somerstep et al., 22 May 2025).

BoN's ultimate scaling is fundamentally limited by the properties of the verifier’s ROC curve—specifically, the slope at the origin, which dictates the best asymptotic accuracy for BoN and rejection sampling alike (Dorner et al., 16 Jul 2025). Thus, the design and empirical performance of the reward/verifier model are the binding constraints on the achievable improvement through BoN scaling.


Overall, Best-of-N sampling strategies serve as a versatile and theoretically grounded solution for leveraging additional test-time computation to enhance the practical outputs of stochastic generative systems. Their ongoing development, from regularized variants and adaptive scheduling to tournament and pairwise approaches, continues to shape both academic evaluation and real-world deployment of large-scale AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube