Best-of-N (BoN): Optimizing Generative Outputs

Updated 18 July 2025

Best-of-N (BoN) is a method that produces multiple independent outputs from a model and selects the best candidate based on an external reward metric.
It is widely applied in areas like language model alignment, evaluation, and test-time scaling to enhance output quality and efficiency.
Practical challenges include computational overhead and reward-model imperfections, driving innovations such as adaptive, soft, and rejection-based selection methods.

Best-of-N (BoN) strategies are a class of inference-time methods that improve the selection of outputs from generative models—most notably LLMs—by drawing multiple independent samples and selecting the best candidate according to an external metric or reward function. This approach has become central to a range of domains, including model evaluation, alignment with human preferences, adversarial testing, efficient large-scale generation, and even structure-based molecular generation. BoN provides a simple yet powerful mechanism for test-time scaling, offering enhanced output quality through intelligent exploitation of stochastic model behavior, but it also presents unique challenges related to computational efficiency, reward-model imperfection, and potential misalignment with broader task objectives.

1. Definitional Framework and Principles

BoN refers to the procedure in which, for a given prompt or task instance, a model samples $N$ outputs (via autoregressive, diffusion, or other generative mechanisms) from a base distribution, evaluates each output with a (potentially proxy) reward model, and selects the candidate with the highest reward as the final output: $y^* = \arg\max_{i=1, \ldots, N} R(x, y_i)$ where $R$ is the reward (e.g., human preference, correctness, or a proxy metric), $x$ is the input, and each $y_i$ is sampled i.i.d. from the model’s (possibly stochastic) reference policy. This mechanism is applied to a variety of settings: LLMing with human preference alignment (2406.00832), architecture selection (1807.01961), process supervision in mathematical reasoning (2501.07301), structure-based drug design (2501.15631), and adversarial safety testing (2412.03556).

The BoN metric (sometimes notated as $\text{Boo}_n$ in the statistical evaluation context (1807.01961)) quantifies the expected maximal outcome drawn from the underlying output distribution. Formally, if $X_1, ..., X_N \sim \mathbb{P}$ (the stochastic performance of the system), then the expected best-of- $N$ metric is: $\text{Boo}_N(\mathbb{P}) = \mathbb{E}[\max\{ X_1, ..., X_N \}]$ where the distribution of $X$ encapsulates the randomness from initialization, sampling, or data shuffling in the generative system.

2. Performance Guarantees and Analytical Properties

The expected performance of BoN sampling can be derived using order statistics, with the expected maximum given by: $\mathbb{E}[\max\{X_1, \dots, X_N\}] = \int_{-\infty}^{\infty} x \cdot N f(x) F(x)^{N-1} dx$ where $f(x)$ and $F(x)$ are the PDF and CDF of the output quality distribution, respectively (1807.01961). For practical estimation, one can compute a discrete weighted sum over observed outcomes, assigning to each test result a weight corresponding to the probability it would be the best out of $N$ .

For LLM alignment, BoN sampling achieves near-optimal tradeoffs between improvement in a reward metric (e.g., win rate versus a base model) and Kullback-Leibler (KL) divergence from the base model distribution. Analytic expressions show that the win rate of BoN sampling approaches $N/(N+1)$ , and its KL divergence from the reference is $\log(N) - (N-1)/N$ (2406.00832). Empirical studies confirm that, under mild regularity assumptions, BoN sampling is nearly Pareto-optimal for alignment scenarios.

Recent theoretical analyses further demonstrate that as the number of samples $N$ increases, the BoN-selected distribution approaches the optimal reward-tilted policy, subject to monotonic convergence rates and tight upper bounds on regret and KL divergence (2507.05913, 2505.03156). In the limit of infinite samples, the performance is dictated by the geometry of the ROC curve of the verifier or reward model, and both BoN and rejection sampling attain the same accuracy plateau, given by the slope of the ROC near the origin (2507.12399). Soft Best-of- $N$ variants, which select among candidates with a temperature-controlled softmax rather than a hard maximum, can further mitigate reward overoptimization, especially when the reward model is imperfect (2507.05913, 2505.03156).

3. Applications: Alignment, Evaluation, and Test-Time Scaling

Model Alignment and Preference Optimization

BoN sampling is broadly employed as an inference-time method for aligning LLM outputs with human or proxy preferences, bypassing the need for costly iterative fine-tuning (e.g., RLHF or DPO). It forms the foundation for more advanced distillation techniques (such as BoNBoN (2406.00832), vBoN (2407.06057), and BOND (2407.14622)), which aim to amortize BoN-like benefits into a single-pass policy via distribution matching or variational objectives.

Recent empirical evaluations consistently show that BoN sampling, even without distillation, improves alignment benchmarks such as win rate, helpfulness, and harmlessness, across datasets like AlpacaFarm, HH-RLHF, and UltraFeedback (2404.01054, 2410.16033, 2410.20290, 2412.15287).

Evaluation and Benchmarking

BoN metrics are foundational in reporting reproducible and stochastic-invariant model performance. In architecture evaluation, reporting the normalized expected best-of- $N$ performance ( $\text{Boo}_n$ ) provides a robust, transparent alternative to single-run maxima or means, thereby mitigating the “cherry picking” of outlier results (1807.01961). For process reward models in mathematical reasoning, BoN evaluation quantifies the true likelihood of generating fully correct reasoning chains, but may introduce biases if not paired with step-level analysis (2501.07301).

Test-Time Scaling and Data Efficiency

BoN’s principal advantage lies in leveraging inference-time compute to boost outcome quality, notably in code generation and math problem solving (e.g., pass@N metrics). Adaptive BoN strategies allocate computational resources dynamically across prompts, optimizing compute for “hard” cases while saving cost on “easy” ones (2505.12050).

Innovations such as speculative rejection (2410.20290), TreeBoN (2410.16033), and self-truncation (2503.01422) further reduce the computational overhead traditionally associated with BoN by enabling early rejection or pruning of unpromising candidates, or by leveraging internal model signals rather than external reward models.

4. Challenges: Computational Overhead and Reward Hacking

Computational Inefficiency

A central limitation of standard BoN sampling is its linear computational and memory scaling with $N$ —each additional candidate incurs a full forward pass of the model. This overhead impacts both latency and cost, especially for large models or high-throughput scenarios (2407.06057, 2410.20290). Acceleration techniques (speculative rejection, tree search, ST-BoN) address this by implementing early candidate truncation or partial generation approaches, with documented reductions in memory and latency exceeding 90% in some cases (2503.01422, 2410.20290).

Reward Model Imperfections and Overoptimization

BoN is highly sensitive to the properties of the proxy reward model. Overoptimization—sometimes called “reward hacking”—arises when candidates are selected for pathological maxima of the proxy, potentially reducing true objective performance or yielding degenerate outputs (2404.01054, 2502.12668). Regularized variants such as MBR-BoN (Minimum Bayes Risk) add penalty terms (KL, Wasserstein, or length penalties) to balance between reward maximization and fidelity to the base model, effectively interpolating between BoN and more conservative selection (2404.01054, 2502.12668).

Empirical and theoretical evidence shows that smoothing the selection process (as in soft BoN/SBoN) or using pairwise/tournament style reward models (2501.13007) can further alleviate instability and improve robustness, especially when reward models are noisy or weak.

5. Security, Adversarial Testing, and Jailbreaking

BoN methodologies play an instrumental role in adversarial red-teaming (“jailbreaking”) of large models. By sampling numerous input augmentations (e.g., shuffling, capitalization) and selecting the successful (i.e., harmful or unsafe) response among them, BoN Jailbreaking has demonstrated high attack success rates—up to 89% on leading LLMs and 78% on closed-source models given 10,000 prompt augmentations (2412.03556). The attack’s power scales predictably with the number of samples according to a power-law in negative log-ASR (attack success rate), and generalizes across modalities (text, vision, audio). BoN jailbreaking can be combined with other black-box attack methods (e.g., optimized prefixes), dramatically boosting sample efficiency and ASR.

Countermeasures such as Defense Against the Dark Prompts (DATDP) operate by pre-emptively filtering prompts through dedicated evaluator LLMs, achieving over 99.8% defense rates against BoN-generated attacks by iterative prompt assessment and weighted voting (2502.00580).

6. Extensions, Variants, and Emerging Directions

Soft and Stochastic Best-of-N

Soft BoN (or SBoN) generalizes the argmax selection to a softmax over rewards, parameterized by a temperature $\lambda$ : $\Pr(Z=i) = \frac{e^{r(X_i)/\lambda}}{\sum_{j=1}^N e^{r(X_j)/\lambda}}$ This interpolation offers robust control over the reward-KL tradeoff, provably achieves $O(1/N)$ convergence to the optimal reward-tilted policy, and mitigates reward overoptimization (2505.03156, 2507.05913).

Adaptive and Domain-Agnostic Versions

Adaptive BoN (2505.12050) dynamically partitions inference budget based on the estimated reward distribution per prompt, thus increasing efficiency in batch deployments. Self-Truncation BoN (2503.01422) and speculative rejection (2410.20290) extend BoN to domains lacking explicit reward models by leveraging internal consistency checks (e.g., chains of embedding comparisons) or partial sequence reward proxies, enhancing generalizability and reducing cost.

Pairwise and Tournament-Style Selection

Knockout tournaments and pairwise-judge reward models are emerging alternatives in settings—such as math problem solving—where scoring models are unreliable or inconsistent. These methods evaluate candidates in pairs, often using chain-of-thought rationales, iteratively eliminating suboptimal responses (2501.13007).

Evaluation in Process Supervision

In process reward model (PRM) contexts, BoN scoring often aggregates step-level correctness to select chain-of-thought outputs. However, BoN may inflate downstream metrics by overemphasizing correct final answers or by tolerating intermediate process errors, necessitating improved evaluation paradigms—such as step-level consensus filtering and response-step joint metrics—for faithful process supervision (2501.07301).

7. Theoretical Comparisons and Limitations

While BoN excels in leveraging compute for test-time scaling, its sample complexity and convergence rates can be less favorable than supervised fine-tuning in certain regimes. For instance, when the target function is realizable and the response length is large, SFT achieves nearly length-independent (in $T$ ) convergence, whereas BoN’s rate can depend linearly or logarithmically on $T$ . However, in non-realizable or noisy scenarios, BoN often enjoys greater robustness, maintaining strong performance where SFT fails due to model or data mismatch (2505.17288).

BoN's ultimate scaling is fundamentally limited by the properties of the verifier’s ROC curve—specifically, the slope at the origin, which dictates the best asymptotic accuracy for BoN and rejection sampling alike (2507.12399). Thus, the design and empirical performance of the reward/verifier model are the binding constraints on the achievable improvement through BoN scaling.

Overall, Best-of-N sampling strategies serve as a versatile and theoretically grounded solution for leveraging additional test-time computation to enhance the practical outputs of stochastic generative systems. Their ongoing development, from regularized variants and adaptive scheduling to tournament and pairwise approaches, continues to shape both academic evaluation and real-world deployment of large-scale AI systems.