Pass@k: Evaluation & Insights
- Pass@k is a metric that quantifies the probability that at least one out of k independent samples produces a correct result, widely used for evaluating large language models.
- It can be computed analytically or via unbiased estimators from multiple samples, though its high variance in low-sample regimes may yield unstable results.
- Recent research extends its theoretical foundations and proposes alternatives like Cover@τ and Bayesian methods to address reliability issues and enhance model ranking.
The Pass@k metric, also known as Pass-at-k, quantifies the probability that at least one out of k independent samples from a model yields a correct solution to a given problem. It has become a standard evaluation method for LLMs, especially in coding, mathematical reasoning, and other discrete output tasks. Pass@k serves as a bridge between pure accuracy (pass@1) and broader exploration, but is susceptible to misinterpretations when applied outside its intended sampling regime. Recent research has both extended its mathematical foundations and highlighted its limitations, motivating alternative metrics such as Cover@τ.
1. Formal Definition and Mathematical Properties
Given a test suite of problems, let denote the probability that a single sample from a model will solve problem . The Pass@k for model and budget is: For a single problem, the metric reduces to , where is the per-sample success probability. When multiple completions () are drawn independently, Pass@k reflects the chance that at least one sample yields success.
As , for any , indicating that in the large- limit, the metric saturates regardless of the true difficulty or reliability of the underlying model on each problem.
2. Statistical Estimation and Practical Computation
In practice, Pass@k can be computed either analytically (with access to exact ) or as an unbiased estimator when samples have been drawn: with the number of correct samples among runs for problem . This estimator is used widely in code generation and reasoning benchmarks.
Notably, the high variance of in regimes where or when is small can yield unstable or misleading results, as emphasized in (Hariri et al., 5 Oct 2025). It is common to use a large (e.g., 256 or 300) to obtain stable estimates, but this is often computationally expensive.
3. Interpretation: Breadth vs. Depth and the Crossover Phenomenon
Pass@k is often interpreted as a "breadth" metric: it rewards any problem that can be solved at least once across samples, even if the success is due to random chance rather than robust reasoning. At low (e.g., ), Pass@k measures average per-sample success, tightly coupling to model depth and reliability. At large , it increasingly reflects whether a model's probability mass on the correct answer is nonzero, regardless of its magnitude.
A key empirical observation ("crossover phenomenon" (Dragoi et al., 9 Oct 2025)): RL-fine-tuned models typically outperform base models on Pass@k at small , but are overtaken by the base model as increases and random guessing dominates. On problems with discrete answer spaces, a base model can eventually enumerate the correct response at large , yielding high Pass@k that overstates its true reasoning ability.
4. Cover@τ: Reliability-Thresholded Generalization
To address the reliability ambiguity in Pass@k, (Dragoi et al., 9 Oct 2025) proposes Cover@τ: This measures the fraction of problems where the model's per-sample success probability is at least , explicitly parameterizing the reliability requirement. corresponds to the problems with (ever solved), while captures the problems solved almost always when sampled.
Theoretical connections relate the two: Thus, Pass@k can be interpreted as a (Beta-distributed) weighted average over Cover@τ, with the weight concentrating near as increases (i.e., emphasizing breadth over depth).
5. Applications in Policy Optimization and RLVR
Pass@k is both an evaluation metric and an objective for direct optimization in reinforcement learning with verifiable rewards (RLVR). Recent work such as Pass@K Policy Optimization (PKPO) (Walder et al., 21 May 2025) and Pass@k Training (Chen et al., 14 Aug 2025) derive unbiased, low-variance estimators for Pass@k and its policy gradient, enabling principled sample-level reward transformations. Core steps include:
- For each problem: draw completions, score their correctness.
- Compute Pass@k unbiasedly across all -sized subsets (using combinatorial estimators).
- Transform per-sample rewards to optimize for the best of over all subsets (joint rather than independent utility).
- Anneal (e.g., starts large to prioritize exploration, then decreases to focus on exploitation), as this empirically lifts both Pass@1 and Pass@k on challenging tasks.
Empirical results demonstrate that reward transformations and advantage shaping for Pass@k—as in (Thrampoulidis et al., 27 Oct 2025), which unifies direct REINFORCE and GRPO-style approaches—yield robust improvements, particularly in hard or low-entropy settings. Moreover, algorithms such as SimKO (Peng et al., 16 Oct 2025) explicitly counteract the probability-concentration effect, improving the diversity of reasoning paths and boosting Pass@k relative to vanilla RLVR methods.
6. Limitations, Misuses, and Bayesian Alternatives
Pass@k, while intuitive, is prone to several statistical pitfalls in standard practice:
- Variance and ranking instability: For small or , Pass@k has high variance, producing inconsistent rankings as varies or between runs (Hariri et al., 5 Oct 2025).
- No confidence intervals: Pass@k does not yield analytic uncertainty estimates, requiring computationally intensive bootstrap for CIs.
- Ranking paradoxes: Model A may win for but lose at , or vice versa, when is small.
A Bayesian alternative, Bayes@N, models evaluation outcomes as categorical with Dirichlet priors, enabling closed-form posterior means and credible intervals (Hariri et al., 5 Oct 2025). Under a uniform prior, Bayes@N yields rankings equivalent to average accuracy (Pass@1), but with principled uncertainty and robustness to small-sample effects. The approach generalizes seamlessly to graded/partial-credit rubrics, prior integration, and is recommended for stable LLM evaluation.
| Metric | Convergence | CIs | Prior | Categorical |
|---|---|---|---|---|
| Pass@k | slow, unstable | no | no | no |
| avg@N | medium | boot/approx | no | no |
| Bayes@N | fast, stable | yes | yes | yes |
7. Extensions, Inference Strategies, and Ranking Optimization
Recent research extends the application of Pass@k beyond direct evaluation:
- Inference strategies: Best-of-Majority (BoM) (Di et al., 3 Oct 2025) selects the most promising responses from samples, provably minimizing regret (defined as ) under reward model and coverage constraints.
- Direct loss optimization: Top Pass (Lyu et al., 11 Aug 2024) trains pairwise ranking models to maximize Pass@k by adjusting the margin between the hardest negatives and the best positives, with robust surrogate losses and stabilization via auxiliary classification loss.
- Advantage shaping: Policy gradient methods can be unified using surrogate reward functions and regularizers (arcsin, entropy, etc.), allowing tailored emphasis on exploration or exploitation (Thrampoulidis et al., 27 Oct 2025).
A plausible implication is that, for tasks requiring both breadth (covering rare successes) and depth (consistent reasoning), hybrid reporting—Pass@1, Pass@k with matched to real-world sampling budgets, and Cover@τ at relevant thresholds—provides the most informative assessment.
References
- "Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries" (Dragoi et al., 9 Oct 2025)
- "Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems" (Walder et al., 21 May 2025)
- "SimKO: Simple Pass@K Policy Optimization" (Peng et al., 16 Oct 2025)
- "Best-of-Majority: Minimax-Optimal Strategy for Pass@ Inference Scaling" (Di et al., 3 Oct 2025)
- "Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models" (Chen et al., 14 Aug 2025)
- "Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients" (Thrampoulidis et al., 27 Oct 2025)
- "Don't Pass: A Bayesian Framework for LLM Evaluation" (Hariri et al., 5 Oct 2025)
- "Top Pass: Improve Code Generation by Pass@k-Maximized Code Ranking" (Lyu et al., 11 Aug 2024)