Bayesian Evaluation for Pass@K Optimization

Updated 4 April 2026

Bayesian evaluation is a statistical framework that applies probabilistic modeling to assess pass@k metrics, providing uncertainty estimates and robust rankings.
It underpins techniques like leave-one-out reward transformations and annealed-k training to balance exploration with single-shot reliability in policy optimization.
Empirical results from methods such as PKPO and TA-GRPO demonstrate that integrating Bayesian approaches enhances variance reduction and stabilizes learning.

Pass@K optimization targets the probability that at least one out of $k$ independently sampled responses—a standard metric in code generation, mathematical reasoning, and RL with verifiable rewards—will be correct. Unlike pass@1, which reflects only the single most likely outcome, pass@k is a set-level objective that quantifies the collective utility of a batch, capturing both exploitation and exploration. Directly optimizing for pass@k involves joint reward transformations, unbiased low-variance estimators, and often annealed-k training to maximize both hard-problem solve rates and single-shot reliability. The area is distinguished by rigorous reward transformation derivations, policy gradient estimator design, formal variance analysis, convergence trade-offs, and implications for exploration–exploitation balance.

1. Mathematical Foundation and Estimator Design

Pass@k in the binary-reward case is

$\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$

for $k$ i.i.d. samples $x_1, \ldots, x_k \sim p(\cdot|\theta)$ and indicator function $f(x)\in\{0,1\}$ . In the continuous case, it is

$G_k(\theta) = \mathbb{E}_x\Big[ \max_{i\leq k} g(x_i) \Big]$

where $g(x)\in\mathbb{R}$ . Estimating pass@k from $n\geq k$ samples with $c$ corrects yields the unbiased combinatorial estimator

$\rho(n,c,k) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$

For continuous rewards, the unbiased estimator is

$\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 0

where $\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 1 and $\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 2, admitting a numerically stable form.

Efficient and unbiased gradient estimation in policy-gradient RL for set-level pass@k requires reward transformations mapping the batch to per-sample "shaped" rewards. The leave-one-out (loo, "sloo-1") transformation provides the lowest estimator variance in practice, involving combinatorial computations on the batch (of order $\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 3) without unstable factorial terms (Walder et al., 21 May 2025).

2. Policy Optimization and Gradient Characterization

The gradient of pass@k is analytically

$\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 4

Critically, for standard RLVR, the pass@k gradient is everywhere collinear with the pass@1 gradient, differing only by a scalar reweighting that up-weights low-success prompts and down-weights easy ones (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026). This implies that standard per-sample REINFORCE with binary rewards cannot induce fundamentally new directions in parameter space for pass@k relative to pass@1. For this reason, high-fidelity pass@k optimization requires transformation at the batch/set level—not mere up-weighting of per-sample rewards.

Annealing $\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 5 provides a practical solution to the conflict between exploration and sharp single-sample performance: training with high $\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 6 initially, then reducing to $\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 7, yields strong coverage and single-answer reliability (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

3. Exploration–Exploitation, Failure Modes, and Remedies

Pass@k optimization, especially via transformed rewards, increases entropy and exploration—elevating the probability of solving hard problems. Higher $\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 8 targets boost the diversity of responses, as evidenced by monotonic increases in solution rates and task generalization (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025). However, naive pass@k policy gradients can degrade pass@1 performance due to the phenomenon of prompt interference, where up-weighted hard prompts contribute gradients antagonistic to average single-shot accuracy, especially if those prompts are "negatively interfering" (Barakat et al., 24 Feb 2026). This gradient conflict is quantifiable: for large enough $\operatorname{pass@}k(\theta) = \mathbb{P}[\exists\,i\leq k : f(x_i) = 1] = \mathbb{E}_x\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]$ 9, the inner product between pass@k and pass@1 gradients becomes negative, and optimizing the former provably degrades the latter.

Mitigating this requires objective blending (mixing pass@1 and pass@k), weight tempering ( $k$ 0 with $k$ 1), or explicit gradient surgery (for example, projecting updates to preserve pass@1). Annealed-k schedules ameliorate the conflict in practice, and exploration can be further sustained via entropy bonuses, joint-set RL objectives, and advantage-shaping regularization (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025, Yu, 20 Nov 2025).

4. Algorithmic Instantiations and Empirical Results

Several algorithmic approaches for pass@k optimization have been established:

Pass-at-k Policy Optimization (PKPO) (Walder et al., 21 May 2025): Applies a batch-level reward transformation (leave-one-out "sloo-1") to enable direct, unbiased, numerically stable optimization of pass@k for arbitrary $k$ 2. Empirical results on MATH and ARC-AGI benchmarks demonstrate higher entropy, improved cumulative solve rates, and, under annealed $k$ 3, no trade-offs in pass@1.
SimKO (Simple Pass@k Optimization) (Peng et al., 16 Oct 2025): Asymmetrically shapes the PPO loss at the token level, positively smoothing among the top-K alternatives at high-entropy positions (exploration), and penalizing overconfident wrong top-1 tokens (mitigates mode collapse). SimKO raises both pass@k (for large K) and pass@1 across logical and math benchmarks.
Advantage-shaping and surrogate reward maximization (Thrampoulidis et al., 27 Oct 2025): Shows all existing advantage-shaping approaches for pass@k (and more generally, hard-example up-weighting) can be interpreted as maximizing regularized surrogate rewards of the empirical group success rate.
TA-GRPO (Le et al., 30 Jan 2026): Pools advantages across semantically equivalent paraphrases of a prompt, ensuring persistent training signal (reduced zero-gradient probability) and subphrase diversity, translating to strong pass@k gains on both in-domain and out-of-distribution reasoning tasks.
Inference-time Best-of-Majority (BoM) (Di et al., 3 Oct 2025): For inference, BoM achieves minimax-optimal regret for pass@k by exploiting both majority-vote frequency filtering and reward model ranking among high-support candidates. BoM guarantees monotonic improvement with increasing N (sampling budget) and K.
Variator agent exploiting LLM inconsistency (Dalal et al., 19 May 2025): At inference, samples $k$ 4 distinct paraphrased variants, each with potentially different latent success probability. The theoretical foundation shows exponential gain in pass@k over naive repeated sampling, confirmed empirically on coding and security benchmarks.

Table: Comparison of Core Pass@K Methods

Algorithm	Level	Exploration Handling	Empirical Finding	Reference
PKPO	RL, batch	Joint reward, loo, anneal- $k$ 5	Boosts pass@k, unblocks learning on hard set	(Walder et al., 21 May 2025)
SimKO	RL, token	Top-K smoothing, entropy-thresh.	Improves exploration, pass@K, pass@1	(Peng et al., 16 Oct 2025)
TA-GRPO	RL, group	Multi-variant pooling	Sustains gradients, boosts pass@k	(Le et al., 30 Jan 2026)
BoM	Inference	High-frequency + reward filter	Minimax-optimal regret, monotonicity	(Di et al., 3 Oct 2025)
Variator	Inference	Paraphrastic diversity	Pass@k lift via task-variantization	(Dalal et al., 19 May 2025)

5. Theoretical Analysis, Surrogate Losses, and Variance Considerations

Pass@k is a nonlinear, set-level metric, so direct stochastic gradient estimation has high variance—especially for rare events. Unbiasedness and variance reduction motivate sophisticated estimators:

Sloo-1 and leave-one-out batch transformations retain unbiasedness while lowering empirical estimator variance (Walder et al., 21 May 2025).
Analytical derivations allow closed-form calculation of group-level and response-level advantages under grouping or bootstrap sampling (Chen et al., 14 Aug 2025).
All advantage-shaping updates for pass@k (e.g., skewed regularization, entropy regularization) can be reverse-engineered as maximizing some explicit, differentiable surrogate reward function, aligning their population gradients with user-specified exploration–exploitation trade-offs (Thrampoulidis et al., 27 Oct 2025).

Empirically, variance-reduced estimators lead to more stable learning and efficient utilization of the multi-sample budget, in both supervised ranking (Top Pass (Lyu et al., 2024)) and RLVR settings.

6. Evaluation Protocols, Bayesian Approaches, and Limitations

Evaluation of pass@k is sensitive to the sampling regime, compute budget, and pooling method. Bayesian evaluation frameworks (Hariri et al., 5 Oct 2025) demonstrate that naive pass@k can yield unstable or misleading rankings at low sample counts; Bayesian inference using the Dirichlet-multinomial model yields more informative posterior means, uncertainty intervals, and faster, more reliable rank convergence, especially for binary and rubric-based outcomes.

Empirical pass@k gains may be confounded by factors such as test-suite incompleteness (code), paraphrase drift (variant-based methods), and tuning instability (diffusion-based ODD (Lamont et al., 5 Mar 2026)). Recent work demonstrates that pass@k optimization without proper blending or gradient-correction can degrade pass@1, which may be operationally unacceptable (Barakat et al., 24 Feb 2026). Thus, practitioners must explicitly account for trade-offs induced by objective choice, sampling allocation, and estimator variance.

7. Practical Recommendations and Outlook

Apply batch-level reward transformations (e.g., sloo-1) for unbiased, stable pass@k optimization when multi-sample sampling is available (Walder et al., 21 May 2025), especially for RLVR on code/math tasks.
Use objective annealing—high $k$ 6 early, $k$ 7 late—to jointly maximize exploration and single-shot strength.
During inference, employ Best-of-Majority strategies, variant sampling (Variator), and diversity-promoting samplers (ODD for diffusion LMs) to maximize real-world pass@k (Di et al., 3 Oct 2025, Dalal et al., 19 May 2025, Lamont et al., 5 Mar 2026).
When evaluating at low N, prefer Bayesian posterior interval-based evaluation to avoid instability (Hariri et al., 5 Oct 2025).
Monitor gradient alignment (⟨∇J_k,∇J_1⟩) to avoid sacrificing single-shot reliability for coverage (Barakat et al., 24 Feb 2026).
For rank-based code evaluation, use pass@k-focused surrogate losses (e.g., Top Pass) instead of indirect classification (Lyu et al., 2024).
Avoid naive pass@k reward insertion in RLVR without explicit diversity enhancement or reward shaping, as theoretical and empirical work shows it is insufficient for exploration and may collapse to exploitation (Yu, 20 Nov 2025).

Pass@K optimization is an active research area, unifying estimator theory, gradient analysis, RL, supervised ranking, and inference-time algorithm design. Optimal deployment requires principled objective/estimator selection, batchwise reward shaping, and careful management of the exploitation–exploration spectrum.