Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pass@1 Performance in RLVR Systems

Updated 20 April 2026
  • Pass@1 Performance is a metric that measures the probability that a model produces a correct solution on its first attempt in verifiable settings.
  • It is central to RLVR training, rapidly increasing one-shot accuracy while often reducing exploration diversity.
  • Empirical studies reveal a trade-off between optimizing for Pass@1 and multi-sample metrics (Pass@k), highlighting the need for balanced hybrid training methods.

Pass@1 Performance

Pass@1 is a foundational metric in the evaluation of stochastic policies for LLMs and other generative systems under verifiable settings, where the success of a single sampled output is measured. It quantifies the probability that a policy produces a correct solution on the first attempt and serves as a principal objective for reinforcement learning with verifiable rewards (RLVR) in code generation, mathematical reasoning, and logic tasks. Recent advances both clarify and question its operational importance, statistical characteristics, and trade-offs with multi-sample inference metrics such as Pass@k.

1. Formal Definition and Estimation of Pass@1

Let pθ(x)p_\theta(x) denote the probability that a model parameterized by θ\theta produces a correct solution given prompt xx. For a dataset of TT problems, Pass@1 is defined as:

Pass@1=1T∑i=1Tpθ(xi)\mathrm{Pass@}1 = \frac{1}{T} \sum_{i=1}^T p_\theta(x_i)

where pθ(x)=Ey∼πθ(⋅∣x)[r(x,y)]p_\theta(x) = \mathbb{E}_{y\sim\pi_\theta(\cdot|x)}[r(x,y)] and r(x,y)∈{0,1}r(x,y)\in\{0,1\} is a deterministic reward indicating correctness (Dragoi et al., 9 Oct 2025, Barakat et al., 24 Feb 2026, Peng et al., 16 Oct 2025). In empirical evaluation, with nn samples and cc correct completions for each prompt,

Pass@1=Ex[cn]\text{Pass@1} = \mathbb{E}_x\left[\frac{c}{n}\right]

where the expectation is taken across the test set (Peng et al., 16 Oct 2025, Zi et al., 5 Aug 2025).

2. Operational and Statistical Properties

Breadth, Depth, and Reliability

Pass@1 captures average one-shot success but collapses all problem-specific probabilities into a single average, masking the distribution of "hard" versus "easy" tasks (Dragoi et al., 9 Oct 2025). For θ\theta0, Pass@k reflects the probability that at least one of θ\theta1 independent draws is correct:

θ\theta2

As θ\theta3 increases, θ\theta4 converges to the fraction of tasks with θ\theta5, which may reward guessing or random exploration.

A key limitation of Pass@1, as identified by "Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries" (Dragoi et al., 9 Oct 2025), is that it cannot distinguish between consistent reliability across tasks and sporadic performance concentrated on a subset. Two models can have identical Pass@1 but radically different reliability profiles. To address this, the Cover@θ\theta6 metric is introduced:

θ\theta7

Pass@1 is the area under the Cover@θ\theta8 curve as θ\theta9 varies from 0 to 1, making it a coarse summary rather than a true measure of reliable reasoning.

3. Pass@1 in RLVR Training and Policy Optimization

Standard Pass@1-Optimizing Training

Within RLVR, Pass@1 is optimized by policy gradient methods using the single-sample reward. The classic update is:

xx0

where xx1 is a baseline for variance reduction (Chen et al., 14 Aug 2025). This drives the model toward exploitation, increasing the log-probability of correct responses and concentrating mass on high-likelihood tokens or completions (Peng et al., 16 Oct 2025).

Empirically, standard Pass@1 training can produce rapid gains in one-shot accuracy—e.g., Qwen2.5-7B on Enigmata validation, Pass@1 increased from 4.8% (base) to 12.9% after GRPO (Chen et al., 14 Aug 2025). However, this comes at the expense of decreased output entropy and limited exploration, risking convergence to suboptimal local modes (Chen et al., 14 Aug 2025).

Exploration-Exploitation Trade-Off

Training solely for Pass@1 systematically reduces exploration, visible as a steady decline in policy entropy and limited diversity among incorrect samples (Chen et al., 14 Aug 2025, Peng et al., 16 Oct 2025). This leads to a mode-collapse phenomenon: once high-confidence solutions are found for some tasks, new strategies for harder tasks are rarely explored.

Nonetheless, sufficiently introducing exploration—via entropy regularization or blending with Pass@k objectives—can help overcome local optima and ultimately result in higher final Pass@1 (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

4. Pass@1 versus Pass@k: Optimization Trade-Offs and Gradient Interference

Gradient Reweighting and Prompt Interference

Several studies show a nontrivial trade-off between optimizing for Pass@k (xx2) and Pass@1. "Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training" (Barakat et al., 24 Feb 2026) provides a detailed analysis: the Pass@k policy gradient upweights "hard" prompts with low xx3, since for these prompts, the gain in Pass@k is maximized by increasing the chance of even a single correct response. When these prompts are "negatively interfering" (i.e., their gradients conflict with the global Pass@1 objective), Pass@k optimization can cause the Pass@1 gradient to flip direction, resulting in a drop in Pass@1 even as Pass@k increases.

Given the reweighting:

xx4

Pass@k gradients are dominated by prompts with low success probability. In experiments with large LLMs (e.g., DeepSeek-R1 Distill series on MATH), a single step of Pass@k PG raised Pass@k by ≈ +0.12 while lowering Pass@1 by ≈ –0.02 (Barakat et al., 24 Feb 2026). Empirically, the hard-prompt weight can outscale easy-prompt contributions by many orders of magnitude.

Mitigation Strategies

Mitigations discussed include hybrid objectives that interpolate between xx5 (Pass@1) and xx6; gradient surgery to trim negatively interfering components using the kernel xx7; tunable reweighting to avoid over-amplifying low-xx8 prompts; and curriculum or balanced sampling to prevent hard subpopulations from dominating (Barakat et al., 24 Feb 2026).

Conversely, "Pass@K Policy Optimization" (Walder et al., 21 May 2025) demonstrates that properly annealing xx9—starting with large TT0 to foster exploration, then reducing to TT1 for exploitation—yields simultaneous improvements in both Pass@1 and Pass@k, especially on hard tasks where pure Pass@1 optimization stalls.

Code and Math Reasoning Tasks

On code generation benchmarks (HumanEval, ParEval), pass@1 is computed as the fraction of problems solved on the first sampled completion. The "PartialOrderEval" framework explores how pass@1 scales with prompt detail (Zi et al., 5 Aug 2025), revealing substantial pass@1 gains from more specific prompts—e.g., Qwen2.5-Coder-14B on HumanEval: pass@1 increases from 0.28 (minimal prompt) to 0.86 (detailed prompt), plateauing or slightly declining for excessive verbosity. Larger LLMs require less prompt detail to achieve a given pass@1, but niche or challenging domains (ParEval-OpenMP) continue to benefit across a wider range of prompt elaborations.

On mathematical reasoning tasks, multiple RLVR algorithms demonstrate significant Pass@1 jumps following post-training: in Reasoning Gym, Pass@1 rises from ≈50% (base) to ≈59% (GRPO), and up to ≈58.6% for KL-Cov (Dragoi et al., 9 Oct 2025). On OMEGA OOD, KL-Cov achieves 28.34% Pass@1 compared to 8.34% for the base model.

RLVR Optimization Effects

Optimization under Pass@1 reward reliably increases the correct-answer rate at the cost of reduced exploration (Chen et al., 14 Aug 2025, Peng et al., 16 Oct 2025). Pass@k-based training and hybrid advantage shaping (e.g., interpolating between Pass@1 and Pass@k, or targeting harder examples through hardness weighting) further accelerate Pass@1 convergence and can avoid early stagnation (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

However, naive Pass@k policy gradients may depress Pass@1, especially when dominated by hard, negatively interfering prompts (Barakat et al., 24 Feb 2026).

6. Methodological Nuances: Estimation, Variance, and Reward Transformations

Low-Variance Estimators

Pass@1 admits simple unbiased estimators (empirical fraction correct per prompt). For Pass@k, unbiased low-variance estimators are more complex and are based on combinatorial counts of positive samples in the batch (Walder et al., 21 May 2025, Peng et al., 16 Oct 2025). Leave-one-out (loo) baselining and "loo-1" variants reduce variance in policy gradients, enabling efficient RLVR implementations with drop-in reward transformation functions (e.g., sloo_minus_one(g, k)) (Walder et al., 21 May 2025). These estimators generalize to continuous reward settings as well.

Analytical Advantage-Shaping

Closed-form expressions for policy advantages under Pass@1 and Pass@k provide actionable structures for adaptive advantage design:

TT2

with TT3 the batch mean reward (Chen et al., 14 Aug 2025). Blending these with Pass@k-style advantages or using online signals such as rollout entropy allows for adaptive training regimes.

7. Practical Implications, Benchmarks, and Limitations

Pass@1 remains a baseline operational constraint in many deployments due to latency, cost, verifier coverage, and fallback requirements (Barakat et al., 24 Feb 2026, Peng et al., 16 Oct 2025). Its simplicity, statistical interpretability, and direct correspondence to one-shot correctness underpin its ubiquity. However, as highlighted in "Beyond Pass@k" (Dragoi et al., 9 Oct 2025), Pass@1 neither reveals consistent high-reliability performance nor discriminates between shallow exploration and deep, reliable reasoning. High Pass@1 may mask brittle performance or local optima.

Cover@TT4 and similar breadth–depth metrics provide essential context to avoid misleading extrapolations from Pass@1 scores alone. Moreover, optimizing only Pass@1 risks policy collapse, whereas well-controlled multi-sample (Pass@k) training with reward transformation or annealing schedules can simultaneously boost diversity and single-sample accuracy—provided that gradient conflict (prompt interference) is managed (Walder et al., 21 May 2025, Barakat et al., 24 Feb 2026).


Summary Table: Pass@1 in Recent RLVR Algorithms and Benchmarks

Study / Algorithm Pass@1 Empirical Gains Notable Observations
GRPO, RLVR (math/coding tasks) +10% to +170% after training Policy collapse, over-concentration (Peng et al., 16 Oct 2025)
PKPO (anneal k: 8→1) +15–20% over Pass@1 baseline; no early plateau High-k for exploration, low-k for exploitation (Walder et al., 21 May 2025)
SimKO +0.5–1.7 pp gain over GRPO Slight Pass@1 increase, large Pass@k boost (Peng et al., 16 Oct 2025)
Pure Pass@k PG (uncontrolled) Pass@k ↑, Pass@1 can ↓ Prompt interference, gradient conflict (Barakat et al., 24 Feb 2026)
Prompt specificity (codegen) e.g. 0.28→0.86 (HumanEval) by adding detail Larger models need less detail (Zi et al., 5 Aug 2025)

Pass@1 offers a crucial yet incomplete lens for evaluating and tuning LLMs in verifiable reasoning settings. Its operational value, statistical properties, and optimization behavior are now understood as interdependent with exploration, sample diversity, and broader reliability metrics. Current research characterizes both its strengths and its systematic limitations: holistic evaluation and training increasingly require moving beyond Pass@1 in isolation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pass@1 Performance.