Pass@k Experiments in Automated Problem Solving

Updated 1 October 2025

Pass@k experiments are metrics that evaluate the likelihood that at least one out of k independently generated solutions is correct, highlighting solution diversity and reliability.
Methodological advances such as candidate ranking, variant generation, and pass@k-based reinforcement learning directly optimize the metric to significantly improve success rates.
Applications span code generation, language modeling, bandit decision making, and signal processing, where pass@k guides the evaluation of automated verification systems.

Pass@ $k$ experiments evaluate the probability that, out of $k$ distinct solution attempts to a problem (typically generated independently by a randomized algorithm or model), at least one is correct. This metric has become central in the assessment of LLMs, code generation systems, and various domains where automated solution verification is available, because it quantifies the practical success rate for users allowed multiple verification attempts. Pass@ $k$ is particularly informative in settings characterized by a high diversity in candidate solutions and significant variance in model or algorithm accuracy across different runs or prompt variants.

1. Formal Definition and Theoretical Foundations

Let $n$ be the total number of candidate solutions generated, and $c$ the number of correct solutions among them. The pass@ $k$ metric, for a given $k \leq n$ , is defined as the probability that at least one of the top- $k$ solutions is correct. When all possible orderings of candidates are equally likely, the formula is:

$\text{pass@}k = 1 - \frac{{n-c \choose k}}{{n \choose k}}$

This captures the probability that not all top- $k$ samples are incorrect. The metric generalizes to arbitrary $k$ ; for $k=1$ it reduces to the top-1 accuracy or pass rate.

Theoretical analyses further establish the statistical properties of pass@ $k$ estimation. For example, the variance of pass@ $k$ (under uniform shuffling) is also derived to assess the reliability of pass@ $k$ estimates as $k$ , $n$ , and $c$ vary (Lyu et al., 11 Aug 2024).

2. Motivations in Automated Problem Solving

Pass@ $k$ originated as an evaluative tool when exact correctness evaluation for candidate solutions is feasible at scale (e.g., code passes unit tests (Lyu et al., 11 Aug 2024), cybersecurity flag validation, or math question solvers with deterministic checkers). In practice, users are rarely satisfied sifting through hundreds of candidate solutions; thus, pass@ $k$ for small values of $k$ (e.g., 1, 5, 10) closely models the real utility of automated systems.

The concept is also linked to the notion of coverage in generative models: maximizing pass@ $k$ is equivalent to maximizing the probability that at least one of a small, user-inspectable set of outputs is satisfactory.

3. Algorithmic and Methodological Advances

a) Candidate Ranking and Direct pass@ $k$ Optimization

Recent methodology such as Top Pass (Lyu et al., 11 Aug 2024) shifts from naive candidate generation, which samples large numbers of solutions and relies on luck, to neural ranking stages that directly optimize the pass@ $k$ loss function during training. This is achieved by formulating pass@ $k$ as a ranking objective: the system is trained to ensure that at least one correct candidate is placed above the k-th incorrect candidate, using a convex surrogate loss (such as hinge-squared) to enable differentiable optimization. The approach is validated on code benchmarks (CodeContests, APPS, MBPP, HumanEval), achieving substantial improvements in pass@1, pass@3, and pass@5 metrics over previous binary-classification-based rankers.

b) Exploiting Model Inconsistency via Variants

The "Variator" agent introduced in (Dalal et al., 19 May 2025) leverages the observed inconsistency of LLMs: models often succeed or fail depending on minor prompt or problem presentation changes. The Variator generates $k$ semantically equivalent variants for each task, submits one candidate per variant, and reports pass@ $k$ on their aggregated results. Theoretical analysis shows that for “hard” problems (low base success rate), even slight variant-wise performance bumps lead to substantial pass@ $k$ gains; empirically, this yields consistent improvements over the baseline approach of resampling $k$ solutions to the original problem. This approach is robust to overfitting and memorization phenomena on standard public benchmarks.

c) Pass@ $k$ -Maximizing Reinforcement Learning

Traditional RL-based training for LLMs and automated problem solvers typically employs pass@1 reward, reinforcing only the best single candidate (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025). Recent research introduces direct pass@ $k$ reward shaping and corresponding policy gradient estimators, assigning joint rewards based on the maximum reward across $k$ rollouts. This leads to “Pass@k Policy Optimization” (PKPO) (Walder et al., 21 May 2025), where unbiased, low-variance estimators of gradient and reward are devised for both binary and continuous settings. Analytical solutions provide efficient computation for group-level rewards and advantage functions (Chen et al., 14 Aug 2025), reducing the variance of updates and enabling higher exploration during training.

Annealing $k$ during training (from higher values favoring exploration to $k=1$ emphasizing exploitation) is shown to simultaneously improve pass@1 and pass@ $k$ on challenging reasoning and mathematical domains.

d) Maintaining Diversity via Problem Synthesis

A critical insight is that vanilla pass@1 optimization, or RLVR training, can rapidly reduce policy entropy (generation diversity), thereby harming pass@ $k$ scores (Liang et al., 19 Aug 2025). To address this, the Self-play with Variational problem Synthesis (SvS) strategy augments training with self-generated problem variants based on successfully solved instances. By continually refreshing the training data with such variational problems, SvS sustains high policy entropy and yields large absolute gains in pass@ $k$ (e.g., +18.3% and +22.8% in pass@32 on AIME24 and AIME25). This approach generalizes across model sizes and diverse benchmarks.

4. Analytical Properties and Estimators

The pass@ $k$ metric is not only a practical measure but is also amenable to tractable analytical manipulation:

The expectation and variance of pass@ $k$ under random ordering are explicitly characterized (Lyu et al., 11 Aug 2024), enabling the design of unbiased estimators and variance-aware confidence intervals.
For RL and policy optimization, combinatorial estimators enable unbiased and low-variance gradient updates (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025). For $n$ generated samples with $c$ correct, the unbiased estimator for binary rewards is $\rho(n, c, k) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ .
In continuous- or quality-graded reward settings, advanced combinatorial coefficients weight individual samples’ contributions to group-level maximum reward.
Advantage function design, incorporating analytic computation (e.g., mean and variance of group max rewards), further refines stability and efficiency for RLVR training (Chen et al., 14 Aug 2025).

5. Applications Across Domains

Pass@ $k$ is employed in a variety of domains:

Code Generation: Evaluating the likelihood that a correct code snippet appears among $k$ generated candidates. Direct ranking optimization, as in Top Pass (Lyu et al., 11 Aug 2024), as well as pass@ $k$ -aware RL, substantially affects usability in practical developer workflows.
Language and Reasoning Models: RL with verifiable rewards for mathematics, scientific reasoning, or open-ended generation is increasingly benchmarked and trained using pass@ $k$ (Walder et al., 21 May 2025, Liang et al., 19 Aug 2025). Maintaining exploration is crucial for progress on harder tasks.
Bandits and Decision Making: In streaming bandit settings, the notion of pass@k translates to achieving low cumulative regret by identifying optimal or near-optimal actions from a set of attempts, tightly linking memory, sample complexity, and achievable regret (Wang, 2023).
Signal Processing and Alignment: Multireference alignment relies on the pass@ $k$ property to evaluate how many candidate signals can be reliably disentangled from mixture data, as in invariant-feature approaches for imaging (Boumal et al., 2017).
Other Structured Generation: Candidate diversity and ranking, especially when candidates can be independently verified, routinely use pass@ $k$ as the operative criterion of utility.

6. Impact, Open Challenges, and Future Directions

Several lines of ongoing research and open problems relate to pass@ $k$ :

Designing estimators and training algorithms that optimize pass@ $k$ efficiently for large $k$ or under severe computational constraints remains an active area of development (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).
How to best exploit or control LLM inconsistency for diversification and robustness is not fully understood; the phenomenon is likely to persist across future model generations (Dalal et al., 19 May 2025).
The fundamental interplay between policy entropy, exploration, and pass@ $k$ in RL is not merely a matter of setting $k$ , but of capturing the regularizing effects of group-based reward assignment and adaptive curriculum via methods such as SvS (Liang et al., 19 Aug 2025).
There is considerable interest in applying pass@ $k$ -based training and evaluation to domains beyond code and text—such as molecular design, theorem proving, and robust AI planning.
Combining pass@1 and pass@ $k$ optimization, via annealing or hybrid advantage functions, presents a promising direction for simultaneously improving both top-1 and multi-candidate success in model deployment scenarios (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).
Improving rankers and testers to reduce false positives/negatives (especially when test suites are weak or buggy implementations pass as correct) is a practical concern for any pass@ $k$ -based evaluation regime (Lyu et al., 11 Aug 2024).

Pass@ $k$ has thus become a foundational, mathematically tractable, and operationally critical metric for practical assessment, training, and deployment of modern AI systems that trade off diversity, verification, and user-centric solution search.