Pass@k: Metric for LLM Success & Optimization

Updated 3 July 2026

Pass@k is defined as the probability that at least one of k independent outputs is correct, calculated using the formula 1 - (1 - p)^k.
It plays a crucial role in evaluating LLM performance in tasks like code synthesis, mathematical reasoning, and agentic tool-use by emphasizing diversity and exploration.
Variants and optimization strategies, including unbiased estimators and Bayesian frameworks, enhance Pass@k's robustness and mitigate issues like gradient vanishing and output saturation.

Pass@k is a standard metric for evaluating and optimizing the success probability of LLMs in settings where multiple independent outputs are sampled and a “verifier” can determine correctness. It quantifies the probability that at least one out of k independently drawn outputs is correct according to a task-specific verifier—such as passing all unit tests in code generation or producing the correct answer in mathematical reasoning. Pass@k forms the foundation for much of the recent progress and diagnostic analysis in verifiable model reasoning, code synthesis, and RL with verifiable rewards (RLVR), and has catalyzed both methodological innovation and theoretical scrutiny.

1. Formal Definition and Statistical Properties

Let $p$ be the per-sample correctness probability—i.e., the likelihood that a single model output is correct. Then the Pass@k probability is defined as

$\mathrm{Pass}@k = 1 - (1 - p)^k$

When k independent samples are drawn, Pass@k measures the probability that at least one is correct. For evaluation with a finite sample pool of $n \geq k$ candidates— $c$ of which are correct—the unbiased estimator is

$\widehat{\mathrm{Pass}@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$

This estimator averages over all k-subsets, reflecting the sampling without replacement scenario (Dalal et al., 19 May 2025, Lyu et al., 2024, Zhai et al., 16 Apr 2026). Pass@k reduces to standard accuracy at $k=1$ and approaches a capability indicator (solvability) as $k\to n$ (Zhai et al., 16 Apr 2026). Its use is ubiquitous in evaluating LLMs on code (APPS, HumanEval), math (MATH, AIME), and agentic tool-use tasks.

2. Pass@k in Training Objectives and RLVR

In RLVR, direct optimization of Pass@k has been widely explored. The canonical policy gradient for Pass@k in the RL setting is

$\nabla_\theta J_k(x; \theta) = k\, (1 - J_1(x; \theta))^{k-1}\, \nabla_\theta J_1(x; \theta)$

where $J_1$ is the single-sample success probability (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026). Thus, policy gradients for Pass@k are per-example reweightings of Pass@1, amplifying the learning signal on difficult prompts but providing no new search direction in parameter space.

Several works introduce unbiased low-variance estimators, leave-one-out and subset baselines (Walder et al., 21 May 2025), and surrogate reward gradients (Thrampoulidis et al., 27 Oct 2025), enabling robust optimization of pass@k for arbitrary k, not just $k=n$ .

However, naive Pass@k optimization has inherent limitations:

Vanishing Signal in Exploration Regimes: When per-sample success is rare ( $\mathrm{Pass}@k = 1 - (1 - p)^k$ 0), policy gradients vanish precisely when exploration is most critical (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026).
Gradient Conflict with Pass@1: Optimizing Pass@k can degrade Pass@1 due to reweighting toward hard, negatively interfering prompts, as characterized by prompt-similarity kernels and agreement score statistics (Barakat et al., 24 Feb 2026).
Trade-offs and Annealing: Annealing k from large to small during training or combining Pass@1 and Pass@k advantages can balance exploration and exploitation (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

3. Practical Algorithms and Diversity-aware Variants

Pass@k-driven training has motivated multiple algorithmic innovations:

Method	Key Feature	Pass@k Behavior
PKPO	Unbiased pass@k estimator	Boosts all k, anneals
SimKO	Top-K asymmetric gradient	Mitigates overconcentration, higher span
GRPO	Group-normed advantages	Effective at k=1
CPPO	Strategy-level k diversity	Outperforms independent sampling at k=4
BoM (Inference)	Minimax-optimal regret	Scaling-monotonicity

Pass@k-aware algorithms frequently combine rewards for correctness and diversity, e.g., by rewarding JPlag-based code distinctness (Florian et al., 27 May 2026), or by shaping the token-level update to avoid collapse onto a single reasoning mode (SimKO) (Peng et al., 16 Oct 2025). In diffusion LMs, sample-level feature repulsion (ODD) directly improves Pass@k by penalizing redundancy during sampling (Lamont et al., 5 Mar 2026).

Additionally, “Variator” agents exploit LLM inconsistency by generating k paraphrased prompts and sampling one solution per variant, which can outperform standard repeated sampling (“Repeater”) especially on hard prompts or in large-k regimes (Dalal et al., 19 May 2025).

4. Diagnostic, Limitations, and Bayesian Alternatives

Pass@k serves as both an evaluation standard and a diagnostic for exploration:

The pass@k - pass@1 gap indicates remaining latent diversity in the policy; a shrinking gap suggests mode collapse (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025).
For small or moderate k, Pass@k is a practical efficiency/robustness indicator, but as k increases, it approaches a degenerate regime where any nonzero model probability for the correct answer yields Pass@k ≈ 1, regardless of model skill (“degeneracy at large k”) (Dragoi et al., 9 Oct 2025).
In discrete spaces, high-k Pass@k only reflects whether the correct answer is sampled at least once, not reasoning reliability (Dragoi et al., 9 Oct 2025).

Recent work has proposed replacing Pass@k with Bayesian evaluation frameworks that report the posterior mean and credible intervals of the success probability under a Dirichlet prior. This approach (Bayes@N) yields more stable model rankings—especially in small-sample regimes—and provides well-calibrated uncertainty, making it preferable when leaderboard stability and statistical claims are paramount (Hariri et al., 5 Oct 2025).

5. Pass@k in Agentic and Tool-use Settings

Static Pass@k curves saturate at large k, converging between RL- and base-trained models. However, in compositional and sequential tool-use environments where agents interact for up to T steps, the metric generalizes to PASS@(k,T):

$\mathrm{Pass}@k = 1 - (1 - p)^k$ 1

where $\mathrm{Pass}@k = 1 - (1 - p)^k$ 2 is the number of correct trajectories of length $\mathrm{Pass}@k = 1 - (1 - p)^k$ 3 (Zhai et al., 16 Apr 2026).

Key findings:

In pure static reasoning (T=0), RL-trained and base models’ PASS@(k,T) curves merge at large k.
In tool-use with compositional retrieval (T>0), RL expands the “capability boundary”: at large k, RL pass-curves pull above base models, signifying genuine discovery of new solution strategies as opposed to mere efficiency gains (Zhai et al., 16 Apr 2026).

The two-dimensional lens (k,T) decomposes sample efficiency and environment-interaction capability, reconciling prior pessimistic and optimistic results.

6. Best Practices and Emerging Recommendations

Use Pass@k at small to moderate k for reliability diagnostics and in user-facing scenarios where few attempts are feasible.
Avoid direct Pass@k objective optimization at large k unless accompanied by explicit diversity mechanisms or when guided by low-variance leave-one-out estimators (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025).
Consider combining Pass@k and Pass@1 objectives, annealing k during training, or using surrogate/advantage shaping for robust exploration-exploitation tradeoff (Chen et al., 14 Aug 2025, Thrampoulidis et al., 27 Oct 2025).
For code and math with discrete targets, treat Pass@k saturation with caution; supplement with Cover@τ for reliability-depth tradeoffs (Dragoi et al., 9 Oct 2025).
Consider Bayesian/posterior-based evaluation for robust and confidence-calibrated reporting (Hariri et al., 5 Oct 2025).
In agentic settings, evaluate with PASS@(k,T) to assess both exploration and compositional capability expansion (Zhai et al., 16 Apr 2026).

Pass@k remains a cornerstone of LLM evaluation and optimization methodology—its statistical formalism, limitations, and practical realization have guided the development of efficient, robust, and diversity-aware reasoning agents (Dalal et al., 19 May 2025, Yu, 20 Nov 2025, Walder et al., 21 May 2025, Peng et al., 16 Oct 2025, 2610.23049, Li et al., 26 May 2026, Florian et al., 27 May 2026).