Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Pass@k: Evaluation & Insights

Updated 10 November 2025
  • Pass@k is a metric that quantifies the probability that at least one out of k independent samples produces a correct result, widely used for evaluating large language models.
  • It can be computed analytically or via unbiased estimators from multiple samples, though its high variance in low-sample regimes may yield unstable results.
  • Recent research extends its theoretical foundations and proposes alternatives like Cover@τ and Bayesian methods to address reliability issues and enhance model ranking.

The Pass@k metric, also known as Pass-at-k, quantifies the probability that at least one out of k independent samples from a model yields a correct solution to a given problem. It has become a standard evaluation method for LLMs, especially in coding, mathematical reasoning, and other discrete output tasks. Pass@k serves as a bridge between pure accuracy (pass@1) and broader exploration, but is susceptible to misinterpretations when applied outside its intended sampling regime. Recent research has both extended its mathematical foundations and highlighted its limitations, motivating alternative metrics such as Cover@τ.

1. Formal Definition and Mathematical Properties

Given a test suite of TT problems, let pi[0,1]p_i \in [0,1] denote the probability that a single sample from a model will solve problem ii. The Pass@k for model MM and budget kk is: Pass@k=1Ti=1TPr[≥1 success in k trials on i]=1Ti=1T[1(1pi)k]\mathrm{Pass@}k = \frac{1}{T} \sum_{i=1}^T \Pr[\,\text{≥1 success in }k\text{ trials on }i] = \frac{1}{T} \sum_{i=1}^T \Bigl[1 - (1 - p_i)^k\Bigr] For a single problem, the metric reduces to 1(1p)k1 - (1-p)^k, where pp is the per-sample success probability. When multiple completions (kk) are drawn independently, Pass@k reflects the chance that at least one sample yields success.

As kk \rightarrow \infty, Pass@k1\mathrm{Pass@}k \rightarrow 1 for any pi>0p_i > 0, indicating that in the large-kk limit, the metric saturates regardless of the true difficulty or reliability of the underlying model on each problem.

2. Statistical Estimation and Practical Computation

In practice, Pass@k can be computed either analytically (with access to exact pip_i) or as an unbiased estimator when nkn \ge k samples have been drawn: Pass@k^=1Ti=1T[1(ncik)(nk)]\widehat{\mathrm{Pass@}k} = \frac{1}{T}\sum_{i=1}^T \left[1 - \frac{\binom{n-c_i}{k}}{\binom{n}{k}}\right] with cic_i the number of correct samples among nn runs for problem ii. This estimator is used widely in code generation and reasoning benchmarks.

Notably, the high variance of Pass@k^\widehat{\mathrm{Pass@}k} in regimes where nkn \approx k or when TT is small can yield unstable or misleading results, as emphasized in (Hariri et al., 5 Oct 2025). It is common to use a large nn (e.g., 256 or 300) to obtain stable estimates, but this is often computationally expensive.

3. Interpretation: Breadth vs. Depth and the Crossover Phenomenon

Pass@k is often interpreted as a "breadth" metric: it rewards any problem that can be solved at least once across kk samples, even if the success is due to random chance rather than robust reasoning. At low kk (e.g., k=1k=1), Pass@k measures average per-sample success, tightly coupling to model depth and reliability. At large kk, it increasingly reflects whether a model's probability mass on the correct answer is nonzero, regardless of its magnitude.

A key empirical observation ("crossover phenomenon" (Dragoi et al., 9 Oct 2025)): RL-fine-tuned models typically outperform base models on Pass@k at small kk, but are overtaken by the base model as kk increases and random guessing dominates. On problems with discrete answer spaces, a base model can eventually enumerate the correct response at large kk, yielding high Pass@k that overstates its true reasoning ability.

4. Cover@τ: Reliability-Thresholded Generalization

To address the reliability ambiguity in Pass@k, (Dragoi et al., 9 Oct 2025) proposes Cover@τ: Cover@τ=G(τ)=1Ti=1T1{piτ}\mathrm{Cover@}\tau = G(\tau) = \frac{1}{T}\sum_{i=1}^T \mathbf{1}\{p_i \geq \tau\} This measures the fraction of problems where the model's per-sample success probability is at least τ\tau, explicitly parameterizing the reliability requirement. G(0+)G(0^+) corresponds to the problems with pi>0p_i > 0 (ever solved), while G(1)G(1) captures the problems solved almost always when sampled.

Theoretical connections relate the two: Pass@k=01k(1τ)k1G(τ)dτ\mathrm{Pass@}k = \int_{0}^{1} k(1-\tau)^{k-1} G(\tau)\, d\tau Thus, Pass@k can be interpreted as a (Beta-distributed) weighted average over Cover@τ, with the weight concentrating near τ=0\tau=0 as kk increases (i.e., emphasizing breadth over depth).

5. Applications in Policy Optimization and RLVR

Pass@k is both an evaluation metric and an objective for direct optimization in reinforcement learning with verifiable rewards (RLVR). Recent work such as Pass@K Policy Optimization (PKPO) (Walder et al., 21 May 2025) and Pass@k Training (Chen et al., 14 Aug 2025) derive unbiased, low-variance estimators for Pass@k and its policy gradient, enabling principled sample-level reward transformations. Core steps include:

  • For each problem: draw nkn \ge k completions, score their correctness.
  • Compute Pass@k unbiasedly across all kk-sized subsets (using combinatorial estimators).
  • Transform per-sample rewards to optimize for the best of kk over all subsets (joint rather than independent utility).
  • Anneal kk (e.g., kk starts large to prioritize exploration, then decreases to focus on exploitation), as this empirically lifts both Pass@1 and Pass@k on challenging tasks.

Empirical results demonstrate that reward transformations and advantage shaping for Pass@k—as in (Thrampoulidis et al., 27 Oct 2025), which unifies direct REINFORCE and GRPO-style approaches—yield robust improvements, particularly in hard or low-entropy settings. Moreover, algorithms such as SimKO (Peng et al., 16 Oct 2025) explicitly counteract the probability-concentration effect, improving the diversity of reasoning paths and boosting Pass@k relative to vanilla RLVR methods.

6. Limitations, Misuses, and Bayesian Alternatives

Pass@k, while intuitive, is prone to several statistical pitfalls in standard practice:

  • Variance and ranking instability: For small nn or TT, Pass@k has high variance, producing inconsistent rankings as kk varies or between runs (Hariri et al., 5 Oct 2025).
  • No confidence intervals: Pass@k does not yield analytic uncertainty estimates, requiring computationally intensive bootstrap for CIs.
  • Ranking paradoxes: Model A may win for k=2k=2 but lose at k=4k=4, or vice versa, when nn is small.

A Bayesian alternative, Bayes@N, models evaluation outcomes as categorical with Dirichlet priors, enabling closed-form posterior means and credible intervals (Hariri et al., 5 Oct 2025). Under a uniform prior, Bayes@N yields rankings equivalent to average accuracy (Pass@1), but with principled uncertainty and robustness to small-sample effects. The approach generalizes seamlessly to graded/partial-credit rubrics, prior integration, and is recommended for stable LLM evaluation.

Metric Convergence CIs Prior Categorical
Pass@k slow, unstable no no no
avg@N medium boot/approx no no
Bayes@N fast, stable yes yes yes

7. Extensions, Inference Strategies, and Ranking Optimization

Recent research extends the application of Pass@k beyond direct evaluation:

  • Inference strategies: Best-of-Majority (BoM) (Di et al., 3 Oct 2025) selects the kk most promising responses from NN samples, provably minimizing regret (defined as 1Pass@k1-\mathrm{Pass@}k) under reward model and coverage constraints.
  • Direct loss optimization: Top Pass (Lyu et al., 11 Aug 2024) trains pairwise ranking models to maximize Pass@k by adjusting the margin between the hardest negatives and the best positives, with robust surrogate losses and stabilization via auxiliary classification loss.
  • Advantage shaping: Policy gradient methods can be unified using surrogate reward functions and regularizers (arcsin, entropy, etc.), allowing tailored emphasis on exploration or exploitation (Thrampoulidis et al., 27 Oct 2025).

A plausible implication is that, for tasks requiring both breadth (covering rare successes) and depth (consistent reasoning), hybrid reporting—Pass@1, Pass@k with kk matched to real-world sampling budgets, and Cover@τ at relevant thresholds—provides the most informative assessment.

References

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Pass@K Metric.