Pass@$k$ Inference: LLM Diversity Evaluation
- Pass@$k$ inference is a multi-sample evaluation metric that measures the probability of obtaining one correct answer from k independent LLM outputs, essential for verifying reasoning tasks.
- It employs a Bernoulli coverage model that mathematically relates single-sample success (Pass@1) to k-sample diversity through a clear analytical framework.
- Despite its diagnostic value, using Pass@$k$ as a training objective can collapse diversity, underscoring the need for alternative, diversity-driven optimization strategies.
Pass@ inference is a multi-sample evaluation metric and inference protocol central to the empirical assessment and deployment of LLMs in verifiable reasoning tasks such as mathematical problem solving, code synthesis, and formal logic. It quantifies the probability that at least one correct solution is obtained in independent samples drawn from a model, thereby serving as a practical measure of diversity-driven coverage. Despite its intuitive appeal and widespread adoption as an evaluation standard, the Pass@ metric exhibits nuanced properties as both a diagnostic and an objective for reinforcement learning with verifiable rewards (RLVR), critically shaping inference scaling and exploration strategies.
1. Formal Definition and Mathematical Properties
For an input (e.g., a prompt), let denote an LLM’s generative policy and be a binary correctness verifier. The single-sample (“Pass@$1$”) success probability is: The -sample Pass@ metric gives the probability that at least one sample in 0 independent draws is correct: 1 For a batch of 2 samples (3) with 4 correct, an unbiased estimator for empirical evaluation is: 5 Pass@6 is therefore interpretable as the Bernoulli coverage probability over 7 attempts and is analytically tractable under i.i.d. sampling.
2. Policy Gradient Structure and Learning Dynamics
The gradient of 8 with respect to 9 follows by the chain rule: 0 Where,
1
This decomposition shows that 2 is a positive scalar reweighting 3 of the Pass@4 gradient: 5 Key regimes:
- When 6, 7 but sampling-based gradient estimates are vacuous (rarely any correct 8 samples).
- When 9, 0, yielding a vanishing learning signal. Thus, Pass@1 gradients vanish in both the high-failure (exploration) and high-success regimes, failing to provide meaningful learning incentives at the boundaries (Yu, 20 Nov 2025).
3. Exploration Collapse and Prompt Interference Phenomena
Over repeated training with policy gradients (e.g., RLVR), modes discovered by the policy (with significant probability mass) are reinforced, while undiscovered modes (with probability 2) are exponentially unlikely to be found in 3 samples (4). This "exploration collapse" leads to: 5 Thus, as the model becomes confident on one mode, Pass@6 converges to Pass@7, eliminating multi-sample benefits.
Additionally, direct Pass@8 optimization can degrade Pass@9 due to "prompt interference" (Barakat et al., 24 Feb 2026): Pass@0 reweights gradient contributions toward “hard” prompts with low 1, but these can be negatively aligned with Pass@2. The expected gradient inner product
3
can be negative, guaranteeing that improving Pass@4 can reduce Pass@5 if negatively interfering prompts dominate. Empirically, strong negative alignment is observed for hard examples in mathematical reasoning benchmarks.
4. Pass@6 as Practical Diagnostic vs. Optimization Objective
Pass@7 is essential as a diagnostic to gauge uncoverable solution diversity at inference: a significant gap 8 in held-out evaluation indicates the model has rare, correct modes accessible via sampling. However, due to collinearity and vanishing learning signal in the gradient structure, direct optimization of Pass@9 provides no exploration benefit over Pass@0 and can collapse diversity (Yu, 20 Nov 2025). Instead, RL objectives promoting entropy or coverage, explicit diversity bonuses, or advantage-shaping with surrogate rewards are preferred for exploration.
Table: Summary of Pass@1 Metric’s Roles
| Role | Supported? | Mechanism |
|---|---|---|
| Diagnostic tool | Yes | Measures coverage of rare correct solutions |
| Training target | No (generally) | Gradient vanishes/extremely collinear; harms exploration |
| Exploration Aid | No | Collapses to Pass@2 as mode confidence rises |
| Reliability Tune | Indirect | Adjust 3 at inference for desired coverage/reliability |
5. Recommendations, Limitations, and Best Practices
- Training: Optimize for coverage/exploration using entropy-driven objectives or explicit diversity-enhancement, not Pass@4 directly. Monitor Pass@5 and Pass@6 side-by-side on held-out datasets.
- Inference: Use Pass@7 to decide the number of samples (8) that meaningfully increase coverage. A diminishing gap between Pass@9 and Pass@$1$0 indicates "saturation" in exploration.
- Sampling Strategies: Increase temperature, top-$1$1, or top-$1$2 to boost diversity—but excessive randomness will decrease Pass@$1$3.
- $1$4 Selection: Choose the minimal $1$5 that stabilizes diversity for cost-efficient inference, e.g., $1$6–$1$7 for complex math, $1$8–$1$9 for code synthesis, balancing latency constraints.
- Interpretation: In small-discrete-output regimes, large 0 Pass@1 degenerates towards the fraction of tasks with nonzero success probability, conflating guessing with reasoning (Dragoi et al., 9 Oct 2025). Complement Pass@2 with coverage-versus-reliability metrics (e.g., Cover@3).
6. Empirical and Theoretical Implications
Empirical evaluations confirm that Pass@4 is highly sensitive to sampling protocol and underlying policy concentration. On real tasks, reinforced over-concentration (e.g., via RLVR) increases Pass@5 at the expense of diversity and Pass@6. The trade-off is robust to architectures and datasets, and the theoretical gradient structure predicts this consistently (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026). Best-practice is therefore to use Pass@7 for inference-time exploration auditing and to optimize with objectives that reward spread or coverage explicitly.
7. Extensions and Future Directions
Recent research proposes more reliable evaluation protocols such as Bayesian posterior estimation (Hariri et al., 5 Oct 2025), breadth-depth metrics (Cover@8) (Dragoi et al., 9 Oct 2025), and scaling laws for Pass@9 as a function of cost, task hardness, and number of attempts (Levi, 2024, Kazdan et al., 6 Oct 2025). These alternatives expose the limitations of Pass@0-centric evaluation for fine-grained model comparison, especially in high-variance, low-coverage, or discrete-output domains. Open directions include adaptive sampling strategies, variance reduction in Pass@1 estimators, and metrics that jointly capture diversity and reliability.
In summary, Pass@2 inference is a key multi-sample success metric characterizing coverage in verifiable LLM tasks, with robust analytic properties and well-understood limitations. While indispensable as a diagnostic of latent diversity, it is ill-suited as a direct RL objective and susceptible to collapse when used naively for optimization. Research consensus is to deploy Pass@3 as a principled, inference-time measurement and to pair it with explorationally-motivated objectives and complementary reliability metrics in modern LLM evaluation pipelines (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026, Dragoi et al., 9 Oct 2025, Hariri et al., 5 Oct 2025).