Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pass@$k$ Inference: LLM Diversity Evaluation

Updated 6 April 2026
  • Pass@$k$ inference is a multi-sample evaluation metric that measures the probability of obtaining one correct answer from k independent LLM outputs, essential for verifying reasoning tasks.
  • It employs a Bernoulli coverage model that mathematically relates single-sample success (Pass@1) to k-sample diversity through a clear analytical framework.
  • Despite its diagnostic value, using Pass@$k$ as a training objective can collapse diversity, underscoring the need for alternative, diversity-driven optimization strategies.

Pass@kk inference is a multi-sample evaluation metric and inference protocol central to the empirical assessment and deployment of LLMs in verifiable reasoning tasks such as mathematical problem solving, code synthesis, and formal logic. It quantifies the probability that at least one correct solution is obtained in kk independent samples drawn from a model, thereby serving as a practical measure of diversity-driven coverage. Despite its intuitive appeal and widespread adoption as an evaluation standard, the Pass@kk metric exhibits nuanced properties as both a diagnostic and an objective for reinforcement learning with verifiable rewards (RLVR), critically shaping inference scaling and exploration strategies.

1. Formal Definition and Mathematical Properties

For an input xx (e.g., a prompt), let πθ(yx)\pi_\theta(y | x) denote an LLM’s generative policy and V(x,y){0,1}V(x, y)\in\{0,1\} be a binary correctness verifier. The single-sample (“Pass@$1$”) success probability is: J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y). The kk-sample Pass@kk metric gives the probability that at least one sample in kk0 independent draws is correct: kk1 For a batch of kk2 samples (kk3) with kk4 correct, an unbiased estimator for empirical evaluation is: kk5 Pass@kk6 is therefore interpretable as the Bernoulli coverage probability over kk7 attempts and is analytically tractable under i.i.d. sampling.

2. Policy Gradient Structure and Learning Dynamics

The gradient of kk8 with respect to kk9 follows by the chain rule: kk0 Where,

kk1

This decomposition shows that kk2 is a positive scalar reweighting kk3 of the Pass@kk4 gradient: kk5 Key regimes:

  • When kk6, kk7 but sampling-based gradient estimates are vacuous (rarely any correct kk8 samples).
  • When kk9, xx0, yielding a vanishing learning signal. Thus, Pass@xx1 gradients vanish in both the high-failure (exploration) and high-success regimes, failing to provide meaningful learning incentives at the boundaries (Yu, 20 Nov 2025).

3. Exploration Collapse and Prompt Interference Phenomena

Over repeated training with policy gradients (e.g., RLVR), modes discovered by the policy (with significant probability mass) are reinforced, while undiscovered modes (with probability xx2) are exponentially unlikely to be found in xx3 samples (xx4). This "exploration collapse" leads to: xx5 Thus, as the model becomes confident on one mode, Pass@xx6 converges to Pass@xx7, eliminating multi-sample benefits.

Additionally, direct Pass@xx8 optimization can degrade Pass@xx9 due to "prompt interference" (Barakat et al., 24 Feb 2026): Pass@πθ(yx)\pi_\theta(y | x)0 reweights gradient contributions toward “hard” prompts with low πθ(yx)\pi_\theta(y | x)1, but these can be negatively aligned with Pass@πθ(yx)\pi_\theta(y | x)2. The expected gradient inner product

πθ(yx)\pi_\theta(y | x)3

can be negative, guaranteeing that improving Pass@πθ(yx)\pi_\theta(y | x)4 can reduce Pass@πθ(yx)\pi_\theta(y | x)5 if negatively interfering prompts dominate. Empirically, strong negative alignment is observed for hard examples in mathematical reasoning benchmarks.

4. Pass@πθ(yx)\pi_\theta(y | x)6 as Practical Diagnostic vs. Optimization Objective

Pass@πθ(yx)\pi_\theta(y | x)7 is essential as a diagnostic to gauge uncoverable solution diversity at inference: a significant gap πθ(yx)\pi_\theta(y | x)8 in held-out evaluation indicates the model has rare, correct modes accessible via sampling. However, due to collinearity and vanishing learning signal in the gradient structure, direct optimization of Pass@πθ(yx)\pi_\theta(y | x)9 provides no exploration benefit over Pass@V(x,y){0,1}V(x, y)\in\{0,1\}0 and can collapse diversity (Yu, 20 Nov 2025). Instead, RL objectives promoting entropy or coverage, explicit diversity bonuses, or advantage-shaping with surrogate rewards are preferred for exploration.

Table: Summary of Pass@V(x,y){0,1}V(x, y)\in\{0,1\}1 Metric’s Roles

Role Supported? Mechanism
Diagnostic tool Yes Measures coverage of rare correct solutions
Training target No (generally) Gradient vanishes/extremely collinear; harms exploration
Exploration Aid No Collapses to Pass@V(x,y){0,1}V(x, y)\in\{0,1\}2 as mode confidence rises
Reliability Tune Indirect Adjust V(x,y){0,1}V(x, y)\in\{0,1\}3 at inference for desired coverage/reliability

5. Recommendations, Limitations, and Best Practices

  • Training: Optimize for coverage/exploration using entropy-driven objectives or explicit diversity-enhancement, not Pass@V(x,y){0,1}V(x, y)\in\{0,1\}4 directly. Monitor Pass@V(x,y){0,1}V(x, y)\in\{0,1\}5 and Pass@V(x,y){0,1}V(x, y)\in\{0,1\}6 side-by-side on held-out datasets.
  • Inference: Use Pass@V(x,y){0,1}V(x, y)\in\{0,1\}7 to decide the number of samples (V(x,y){0,1}V(x, y)\in\{0,1\}8) that meaningfully increase coverage. A diminishing gap between Pass@V(x,y){0,1}V(x, y)\in\{0,1\}9 and Pass@$1$0 indicates "saturation" in exploration.
  • Sampling Strategies: Increase temperature, top-$1$1, or top-$1$2 to boost diversity—but excessive randomness will decrease Pass@$1$3.
  • $1$4 Selection: Choose the minimal $1$5 that stabilizes diversity for cost-efficient inference, e.g., $1$6–$1$7 for complex math, $1$8–$1$9 for code synthesis, balancing latency constraints.
  • Interpretation: In small-discrete-output regimes, large J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).0 Pass@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).1 degenerates towards the fraction of tasks with nonzero success probability, conflating guessing with reasoning (Dragoi et al., 9 Oct 2025). Complement Pass@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).2 with coverage-versus-reliability metrics (e.g., Cover@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).3).

6. Empirical and Theoretical Implications

Empirical evaluations confirm that Pass@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).4 is highly sensitive to sampling protocol and underlying policy concentration. On real tasks, reinforced over-concentration (e.g., via RLVR) increases Pass@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).5 at the expense of diversity and Pass@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).6. The trade-off is robust to architectures and datasets, and the theoretical gradient structure predicts this consistently (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026). Best-practice is therefore to use Pass@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).7 for inference-time exploration auditing and to optimize with objectives that reward spread or coverage explicitly.

7. Extensions and Future Directions

Recent research proposes more reliable evaluation protocols such as Bayesian posterior estimation (Hariri et al., 5 Oct 2025), breadth-depth metrics (Cover@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).8) (Dragoi et al., 9 Oct 2025), and scaling laws for Pass@J1(x;θ)=Pryπθ(x)[V(x,y)=1]=yπθ(yx)V(x,y).J_1(x;\theta) = \Pr_{y\sim\pi_\theta(\cdot|x)} \left[ V(x, y) = 1 \right] = \sum_{y} \pi_\theta(y|x) V(x, y).9 as a function of cost, task hardness, and number of attempts (Levi, 2024, Kazdan et al., 6 Oct 2025). These alternatives expose the limitations of Pass@kk0-centric evaluation for fine-grained model comparison, especially in high-variance, low-coverage, or discrete-output domains. Open directions include adaptive sampling strategies, variance reduction in Pass@kk1 estimators, and metrics that jointly capture diversity and reliability.


In summary, Pass@kk2 inference is a key multi-sample success metric characterizing coverage in verifiable LLM tasks, with robust analytic properties and well-understood limitations. While indispensable as a diagnostic of latent diversity, it is ill-suited as a direct RL objective and susceptible to collapse when used naively for optimization. Research consensus is to deploy Pass@kk3 as a principled, inference-time measurement and to pair it with explorationally-motivated objectives and complementary reliability metrics in modern LLM evaluation pipelines (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026, Dragoi et al., 9 Oct 2025, Hariri et al., 5 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pass@$k$ Inference.