Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Pass@$k$ Metric in Code Generation & RL

Updated 6 October 2025
  • Pass@$k$ is a probabilistic evaluation metric that determines if at least one of k generated outputs is correct, serving as a critical measure in tasks like code generation and reinforcement learning.
  • It enables direct model optimization through surrogates, ranking losses, and reward transformations to improve performance in generating correct outputs among multiple attempts.
  • Inference strategies such as Best-of-Majority and Best-of-N ensure efficient candidate selection by balancing exploration and exploitation in multi-sample output settings.

The Pass@kk metric is a probabilistic evaluation measure widely adopted in code generation, reasoning, and reinforcement learning tasks to quantify the ability of a model or inference strategy to produce at least one correct solution within kk independently sampled outputs. Conceptually, Pass@kk measures the likelihood that a correct response appears within the top kk candidates, forming the basis for both evaluation and, increasingly, model optimization protocols in applications where users or downstream systems are permitted to select from a pool of alternatives rather than a single solution.

1. Formal Definition and Utility

Let f(x)f(x) be a binary indicator for the correctness of model output xx (with f(x)=1f(x) = 1 if correct, $0$ otherwise), and suppose that a model generates kk independent outputs x1,,xkx_1, \dots, x_k for a given input or prompt. The Pass@kk metric is defined as:

Pass@k=E[1i=1k(1f(xi))]\text{Pass@}k = \mathbb{E}\left[1 - \prod_{i=1}^k (1 - f(x_i))\right]

This expresses the expected probability that at least one of the kk samples is correct. In practical systems—such as LLMs deployed for code synthesis or open-ended problem solving—Pass@kk is particularly salient because it aligns with real-world practices where users can inspect or utilize multiple generated candidates (Walder et al., 21 May 2025).

Beyond evaluation, Pass@kk is increasingly used as an explicit target for optimization in reinforcement learning (RL) and ranking scenarios, directly shaping how models are trained to allocate probability mass over output spaces to maximize user- or application-centric success rates (Lyu et al., 11 Aug 2024, Chen et al., 14 Aug 2025).

2. Optimization Techniques and Algorithmic Considerations

Traditional models optimized to maximize Pass@$1$ (single-sample correctness) often underutilize the benefit of batch generation and are prone to overly conservative exploitation strategies. Modern optimization schemes incorporate Pass@kk directly into their objectives, either through surrogates in ranking losses (Lyu et al., 11 Aug 2024), reward transformations in RL (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025), or custom advantage functions. Key methodological developments include:

  • Direct Metric Optimization: Methods such as Top Pass reformulate the model's loss to directly reflect the Pass@kk objective. For a candidate set with positives C+C_+ and negatives CC_-, ranking is cast as ensuring the highest-scoring positive is above the kk-th negative:

pass@k=I[f(Q,C+,1)>f(Q,C,k)]\text{pass@}k = \mathbb{I}[f(Q, C_{+,1}) > f(Q, C_{-,k})]

where f(Q,C)f(Q, C) is the ranker's score (Lyu et al., 11 Aug 2024).

  • Surrogate and Analytical Loss Functions: The non-differentiability of indicator-based Pass@kk loss is addressed by adopting surrogates such as squared hinge loss, enabling gradient-based optimization. Analytical derivations in RL permit the closed-form computation of advantage functions leveraging group statistics over kk samples, reducing the variance relative to sampling- or bootstrap-based estimators (Chen et al., 14 Aug 2025).
  • Reward Transformations in RL: Pass-at-k Policy Optimization (PKPO) introduces unbiased, low-variance estimators for both binary and continuous reward settings. PKPO generalizes earlier work restricted to k=nk = n and enables annealing of kk during training, which empirically improves both Pass@$1$ and Pass@kk when training large reasoning models (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).
  • Adaptive Grouping: By structuring rollouts in groups of kk and calculating group rewards via max operations, models are explicitly incentivized to explore a diverse output space (Chen et al., 14 Aug 2025).

3. Inference-Time Strategies for Pass@kk

Practical inference in Pass@kk regimes may involve selecting up to kk responses from a larger batch of NN candidates. Several strategies have been proposed and analyzed for this selection step, including:

Strategy Principle Limitations
Majority Voting Selects most-frequent outputs Constant regret; does not improve with NN
Best-of-N (BoN) Picks kk top-ranked by reward Susceptible to reward model overoptimization
Best-of-Majority (BoM) Filters by empirical frequency, then selects kk via reward Minimax-optimal and scaling-monotonic, robust to reward model errors (Di et al., 3 Oct 2025)

BoM, in particular, achieves a regret bound of O(ϵopt+ϵRM2C/k)O(\epsilon_{\text{opt}} + \sqrt{\epsilon^2_{\mathrm{RM}} C^* / k}) (where CC^* is the reference policy’s coverage coefficient), matching theoretical lower bounds and providing robustness as the number of generated candidates NN increases (Di et al., 3 Oct 2025).

4. Empirical and Theoretical Insights

Pass@kk-centric approaches have demonstrated strong empirical improvements across multiple domains:

  • Code Generation: Pass@kk-maximizing rankers achieve significant gains in top-ranked prediction accuracy. For example, on CodeContests, Top Pass improved pass@$1$ by 32.9% relative to strong baselines, with similar trends on APPS, MBPP, and HumanEval (Lyu et al., 11 Aug 2024).
  • Reinforcement Learning Tasks: PKPO and analytical Pass@kk training have unblocked learning on problems where Pass@$1$-optimized policies stall, due to improved exploration and better utilization of the candidate pool (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).
  • Inference-Scaling: Theoretical analyses prove that, under proper inference strategies (e.g., BoM), the error scales optimally as 1/k1/\sqrt{k} with respect to kk and benefits from increases in sampling budget NN (Di et al., 3 Oct 2025). BoN and majority voting do not generally exhibit this monotonic scaling.
  • Iterative Settings: In agentic or refinement-based systems, the simple Pass@kk metric may fail to reflect efficiency because a high pass@$1$ might be achieved via excessive refinements. To address this, Pass@ARC combines Pass@kk with a penalty for refinement steps, e.g.,

Pass@ARC=PassRate×exp(0.01(ARC1)2)\text{Pass}@\mathrm{ARC} = \text{PassRate} \times \exp(-0.01 (\mathrm{ARC} - 1)^2)

(Nadimi et al., 15 Mar 2025).

5. Exploration, Exploitation, and Advantage Design

One of the principal findings across recent work is that Pass@kk-based optimization can simultaneously promote exploration (diversity in sampled solutions) and exploitation (high-confidence, correct outputs). Analytical studies of the advantage function in RLVR show that the sum of optimization “strength” shifts toward harder problems as kk increases, biasing training towards scenarios with limited early success. Adaptive advantage modification based on problem difficulty or entropy further fine-tunes this balance, enabling models to focus on challenging instances without sacrificing accuracy on easier cases (Chen et al., 14 Aug 2025).

6. Limitations, Variants, and Practical Implications

While Pass@kk is robust as an evaluation metric where multiple attempts are permissible, it assumes independence of samples and may not be fully informative for settings where the notion of sample diversity or refinement steps is prominent. Extensions such as Pass@ARC address such shortcomings by penalizing inefficiency.

For practitioners, Pass@kk-aligned methods improve user experience by ensuring that the chance of encountering a usable or correct output within a small number of candidate generations is maximized—directly correlating with reduced manual examination and verification effort in applications ranging from code generation to automated system synthesis (Lyu et al., 11 Aug 2024, Nadimi et al., 15 Mar 2025).

7. Summary Table of Metric Variants and Strategies

Metric/Strategy Definition/Mechanism Notable Strengths
Pass@kk Probability at least one of kk outputs is correct Aligns with user experience
Pass@ARC Penalizes excess refinement cycles in success rate Captures efficiency of solutions
BoM Filters by frequency, selects by reward Minimax-optimal, robust scaling
PKPO & Analytical Optimizes joint sample utility for Pass@kk Low-variance, effective for RL

In conclusion, the Pass@kk metric and its variants have become central to the evaluation and optimization of systems in code generation, reasoning, and reinforcement learning, aligning both experimental and theoretical progress with the realities of user-facing performance in multi-candidate settings (Lyu et al., 11 Aug 2024, Nadimi et al., 15 Mar 2025, Walder et al., 21 May 2025, Chen et al., 14 Aug 2025, Di et al., 3 Oct 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Pass@$k$ Metric.