Papers
Topics
Authors
Recent
Search
2000 character limit reached

Best-of-K (BoK): Sampling & Selection

Updated 2 July 2026
  • Best-of-K (BoK) is a candidate selection strategy that generates K samples and selects the highest-scoring candidate based on a reward function.
  • In generative models, BoK improves decoding by balancing KL divergence with reward maximization while managing risks of overoptimization from weak proxies.
  • BoK applications span bandit optimization and VLA action, offering practical trade-offs between sample complexity, safety calibration, and computational efficiency.

Best-of-K (BoK) refers to an inference or learning strategy in which KK candidates are generated, evaluated, and the top-scoring candidate(s) are selected according to some criterion. BoK appears in a variety of settings: decoding in generative models, bandit identification, action selection in vision-language-action (VLA) agents, and rigorous statistical abstention. The methodology, properties, and theoretical implications of BoK-style sampling and selection have been rigorously analyzed in recent literature, exposing both its strengths and structural limitations.

1. BoK in Generative Model Decoding

In the context of sequence generation and language modeling, Best-of-K has emerged as a key inference-time alignment method. For a given prompt xx, KK candidate continuations Y1,...,YKY_1, ..., Y_K are sampled i.i.d. from a reference policy πref(yx)\pi_{\text{ref}}(y|x). Each candidate is scored by a proxy reward model r(y,x)r(y, x), and the candidate with maximal score is selected: YBoK=argmaxi=1,,Kr(Yi,x).Y_{\mathrm{BoK}} = \arg\max_{i=1,\dots,K} r(Y_i, x). The resulting BoK policy πBoKK(yx)\pi_{\text{BoK}}^K(y \mid x) places probability mass on candidates that maximize r(y,x)r(y, x) within the KK draws. When xx0 matches the true reward function, BoK approximates the solution to a KL-regularized reward maximization: xx1 and achieves a near-optimal trade-off between KL divergence and reward (Aminian et al., 8 Jul 2025).

However, with an imperfect proxy xx2, BoK is prone to overoptimization, often selecting outliers favored by the proxy that do not correspond to high true reward. A generalization, Soft Best-of-K (SBoK or SBoN), introduces a smoothing parameter xx3 so that the xx4-th candidate is selected with probability proportional to xx5, interpolating between random selection xx6 and standard BoK xx7.

2. BoK in Bandit and Combinatorial Optimization

In best-arm identification, BoK refers to identifying the xx8 out of xx9 bandit arms with highest mean reward. In the Best-of-K Bandit game, at each round a subset KK0 of size KK1 is chosen, a jointly distributed reward vector KK2 is drawn, and the observed reward is KK3. The objective is to identify the optimal subset KK4 as efficiently as possible (Simchowitz et al., 2016).

For stochastic, independent-arm models, the problem admits specialized elimination algorithms with tight information-theoretic sample complexity, but worst-case lower bounds still scale with the number of KK5-subsets KK6.

A related and practically prominent formulation is the "Best-K Identification" or "Best-K-Arm" problem, where the goal is to adaptively sample arms to identify the top-KK7 arms by their mean rewards with high probability. The sample complexity matches lower bounds up to logarithmic factors (Jiang et al., 2017). Practical algorithms based on LIL-type confidence bounds (e.g., lil'RandLUCB, lil'CLUCB) offer dramatic improvements.

3. BoK for Action Selection in VLA Policies

BoK is widely used in vision-language-action (VLA) policies for embodied agents and robotics, for example in RoboMonkey, SEAL, and MG-Select (Singh et al., 28 May 2026). At inference, the agent generates a base action chunk KK8, then produces KK9 samples Y1,...,YKY_1, ..., Y_K0 (with Y1,...,YKY_1, ..., Y_K1), scores each with a verifier Y1,...,YKY_1, ..., Y_K2, and executes the action with the highest score: YBoK=argmaxi=1,,Kr(Yi,x).Y_{\mathrm{BoK}} = \arg\max_{i=1,\dots,K} r(Y_i, x).2 This form of BoK yields empirical gains in action selection and robustness by leveraging the diversity of candidate actions (Singh et al., 28 May 2026).

4. Theoretical Properties: KL, Regret, and Optimization

KL Divergence and Overoptimization

Analysis of BoK and SBoK quantifies how the aligned policy deviates from the base via Y1,...,YKY_1, ..., Y_K3. Under suitable assumptions, Y1,...,YKY_1, ..., Y_K4 increases with Y1,...,YKY_1, ..., Y_K5 and the smoothing parameter Y1,...,YKY_1, ..., Y_K6, with a bound Y1,...,YKY_1, ..., Y_K7 in the hard BoK limit. Notably, when proxy reward deviates from the true reward, SBoK's KL-divergence to the true-reward SBoK policy can be bounded in terms of a tilted-MSE error metric Y1,...,YKY_1, ..., Y_K8, revealing a trade-off between exploitation (higher Y1,...,YKY_1, ..., Y_K9, πref(yx)\pi_{\text{ref}}(y|x)0) and exploration to minimize overoptimization (Aminian et al., 8 Jul 2025).

Regret Gap

The expected regret of BoK or SBoK can be decomposed into terms depending on proxy error, coverage constants, and KL divergence. In perfect-proxy regimes, large πref(yx)\pi_{\text{ref}}(y|x)1 and πref(yx)\pi_{\text{ref}}(y|x)2 always help and BoK is optimal. With nonzero πref(yx)\pi_{\text{ref}}(y|x)3, intermediate smoothing πref(yx)\pi_{\text{ref}}(y|x)4 can significantly mitigate regret. Empirical results confirm that, with weak proxies, large πref(yx)\pi_{\text{ref}}(y|x)5 in hard BoK leads to overoptimization, whereas SBoK with intermediate πref(yx)\pi_{\text{ref}}(y|x)6 restores true-reward performance (Aminian et al., 8 Jul 2025).

Decoding as Optimisation

Recent work unifies BoK with other heuristic decoders by phrasing decoding as a regularized optimization on the probability simplex over tokens. The BoK objective is a KL-anchored coverage utility: maximize (over πref(yx)\pi_{\text{ref}}(y|x)7) a sum of base model score, KL-divergence to the base distribution, and the expected probability (coverage) of hitting important tokens across πref(yx)\pi_{\text{ref}}(y|x)8 samples (Ji et al., 20 Feb 2026). The optimization can be implemented via entropic mirror ascent, adding minimal generation overhead and yielding strong empirical gains (e.g., +18.6% accuracy for Qwen2.5-Math-7B at high temperature on MATH500).

5. Statistical Calibration and Safety in BoK Pipelines

Structural flaws in standard BoK for VLA action pipelines have been identified, particularly regarding safety: naive BoK sampling can select "best among πref(yx)\pi_{\text{ref}}(y|x)9 bad options" without recognizing that all candidates are unsafe. Conformal calibration frameworks such as BOKBO provide finite-sample, distribution-free guarantees on the executed-violation rate by adding an abstention layer (Singh et al., 28 May 2026). A learned violation predictor, conditioned on semantic and proprioceptive features, supports tight (conditional) calibration. The empirical results demonstrate that abstention via BOKBO achieves target violation rates (r(y,x)r(y, x)0), with robust coverage and net task success, across distribution shifts and multiple seeds. Mondrian (per-task) calibration further boosts worst-case task-wise safety.

6. Connections to Bandit Algorithms

In multi-armed and combinatorial bandit problems, BoK arises in selecting subsets with maximal aggregate reward or in identifying the top-r(y,x)r(y, x)1 arms. Under general dependence, lower bounds show that (in the worst case) exhaustive search over all subsets is required. However, for independent arms or when high-order dependencies are weak, efficient algorithms can exploit structure for nearly optimal sample complexity (Simchowitz et al., 2016). In fixed-confidence best-r(y,x)r(y, x)2 identification, algorithms leveraging LIL-based confidence intervals achieve minimax-optimal instance-dependent sample complexity, with clear guidelines for parameter selection and extensions to combinatorial structures (Jiang et al., 2017).

Setting BoK Mechanism Main Theoretical Concern
LLM/Decoder alignment Max proxy reward among r(y,x)r(y, x)3 Overoptimization, KL bounds, regret
Bandit best-r(y,x)r(y, x)4 Max mean reward subset Sample complexity, lower bounds
VLA action selection Max verifier among r(y,x)r(y, x)5 Safety, abstention, calibration
Decoding simplex KL-anchored coverage utility Optimization, coverage, efficiency

7. Practical Recommendations and Limitations

Recent theoretical and empirical advances provide guidance for deploying BoK and its soft variants:

  • With high-quality proxy reward models, set r(y,x)r(y, x)6 and scale r(y,x)r(y, x)7 within a KL/compute budget; performance scales as r(y,x)r(y, x)8 in reward and regret.
  • With noisy or misspecified proxies, employ moderate r(y,x)r(y, x)9 to mitigate overoptimization; tune YBoK=argmaxi=1,,Kr(Yi,x).Y_{\mathrm{BoK}} = \arg\max_{i=1,\dots,K} r(Y_i, x).0 by cross-validation.
  • In settings with strong safety constraints, use conformal calibration layers with learned violation predictors to guarantee abstention rates within prespecified thresholds (Singh et al., 28 May 2026).
  • When compute or sample budget is limited, moderate smoothing (YBoK=argmaxi=1,,Kr(Yi,x).Y_{\mathrm{BoK}} = \arg\max_{i=1,\dots,K} r(Y_i, x).1) gives most BoK benefit at lower KL divergence.

Empirical evaluation confirms that BoK or its mirror-ascent simplex-optimization variant yields consistent gains in multi-sample pipelines, particularly at high sampling temperature or diversity. However, success depends critically on proxy fidelity, candidate diversity, and proper calibration. Open challenges remain for robust real-world integration, transfer of abstention predictors, and algorithmic efficiency in combinatorial bandit regimes (Aminian et al., 8 Jul 2025, Ji et al., 20 Feb 2026, Singh et al., 28 May 2026, Jiang et al., 2017, Simchowitz et al., 2016).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Best-of-K (BoK).