Pass@1 Performance in RLVR Systems

Updated 20 April 2026

Pass@1 Performance is a metric that measures the probability that a model produces a correct solution on its first attempt in verifiable settings.
It is central to RLVR training, rapidly increasing one-shot accuracy while often reducing exploration diversity.
Empirical studies reveal a trade-off between optimizing for Pass@1 and multi-sample metrics (Pass@k), highlighting the need for balanced hybrid training methods.

Pass@1 Performance

Pass@1 is a foundational metric in the evaluation of stochastic policies for LLMs and other generative systems under verifiable settings, where the success of a single sampled output is measured. It quantifies the probability that a policy produces a correct solution on the first attempt and serves as a principal objective for reinforcement learning with verifiable rewards (RLVR) in code generation, mathematical reasoning, and logic tasks. Recent advances both clarify and question its operational importance, statistical characteristics, and trade-offs with multi-sample inference metrics such as Pass@k.

1. Formal Definition and Estimation of Pass@1

Let $p_\theta(x)$ denote the probability that a model parameterized by $\theta$ produces a correct solution given prompt $x$ . For a dataset of $T$ problems, Pass@1 is defined as:

$\mathrm{Pass@}1 = \frac{1}{T} \sum_{i=1}^T p_\theta(x_i)$

where $p_\theta(x) = \mathbb{E}_{y\sim\pi_\theta(\cdot|x)}[r(x,y)]$ and $r(x,y)\in\{0,1\}$ is a deterministic reward indicating correctness (Dragoi et al., 9 Oct 2025, Barakat et al., 24 Feb 2026, Peng et al., 16 Oct 2025). In empirical evaluation, with $n$ samples and $c$ correct completions for each prompt,

$\text{Pass@1} = \mathbb{E}_x\left[\frac{c}{n}\right]$

where the expectation is taken across the test set (Peng et al., 16 Oct 2025, Zi et al., 5 Aug 2025).

2. Operational and Statistical Properties

Breadth, Depth, and Reliability

Pass@1 captures average one-shot success but collapses all problem-specific probabilities into a single average, masking the distribution of "hard" versus "easy" tasks (Dragoi et al., 9 Oct 2025). For $\theta$ 0, Pass@k reflects the probability that at least one of $\theta$ 1 independent draws is correct:

$\theta$ 2

As $\theta$ 3 increases, $\theta$ 4 converges to the fraction of tasks with $\theta$ 5, which may reward guessing or random exploration.

A key limitation of Pass@1, as identified by "Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries" (Dragoi et al., 9 Oct 2025), is that it cannot distinguish between consistent reliability across tasks and sporadic performance concentrated on a subset. Two models can have identical Pass@1 but radically different reliability profiles. To address this, the Cover@ $\theta$ 6 metric is introduced:

$\theta$ 7

Pass@1 is the area under the Cover@ $\theta$ 8 curve as $\theta$ 9 varies from 0 to 1, making it a coarse summary rather than a true measure of reliable reasoning.

3. Pass@1 in RLVR Training and Policy Optimization

Standard Pass@1-Optimizing Training

Within RLVR, Pass@1 is optimized by policy gradient methods using the single-sample reward. The classic update is:

$x$ 0

where $x$ 1 is a baseline for variance reduction (Chen et al., 14 Aug 2025). This drives the model toward exploitation, increasing the log-probability of correct responses and concentrating mass on high-likelihood tokens or completions (Peng et al., 16 Oct 2025).

Empirically, standard Pass@1 training can produce rapid gains in one-shot accuracy—e.g., Qwen2.5-7B on Enigmata validation, Pass@1 increased from 4.8% (base) to 12.9% after GRPO (Chen et al., 14 Aug 2025). However, this comes at the expense of decreased output entropy and limited exploration, risking convergence to suboptimal local modes (Chen et al., 14 Aug 2025).

Exploration-Exploitation Trade-Off

Training solely for Pass@1 systematically reduces exploration, visible as a steady decline in policy entropy and limited diversity among incorrect samples (Chen et al., 14 Aug 2025, Peng et al., 16 Oct 2025). This leads to a mode-collapse phenomenon: once high-confidence solutions are found for some tasks, new strategies for harder tasks are rarely explored.

Nonetheless, sufficiently introducing exploration—via entropy regularization or blending with Pass@k objectives—can help overcome local optima and ultimately result in higher final Pass@1 (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

4. Pass@1 versus Pass@k: Optimization Trade-Offs and Gradient Interference

Gradient Reweighting and Prompt Interference

Several studies show a nontrivial trade-off between optimizing for Pass@k ( $x$ 2) and Pass@1. "Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training" (Barakat et al., 24 Feb 2026) provides a detailed analysis: the Pass@k policy gradient upweights "hard" prompts with low $x$ 3, since for these prompts, the gain in Pass@k is maximized by increasing the chance of even a single correct response. When these prompts are "negatively interfering" (i.e., their gradients conflict with the global Pass@1 objective), Pass@k optimization can cause the Pass@1 gradient to flip direction, resulting in a drop in Pass@1 even as Pass@k increases.

Given the reweighting:

$x$ 4

Pass@k gradients are dominated by prompts with low success probability. In experiments with large LLMs (e.g., DeepSeek-R1 Distill series on MATH), a single step of Pass@k PG raised Pass@k by ≈ +0.12 while lowering Pass@1 by ≈ –0.02 (Barakat et al., 24 Feb 2026). Empirically, the hard-prompt weight can outscale easy-prompt contributions by many orders of magnitude.

Mitigation Strategies

Mitigations discussed include hybrid objectives that interpolate between $x$ 5 (Pass@1) and $x$ 6; gradient surgery to trim negatively interfering components using the kernel $x$ 7; tunable reweighting to avoid over-amplifying low- $x$ 8 prompts; and curriculum or balanced sampling to prevent hard subpopulations from dominating (Barakat et al., 24 Feb 2026).

Conversely, "Pass@K Policy Optimization" (Walder et al., 21 May 2025) demonstrates that properly annealing $x$ 9—starting with large $T$ 0 to foster exploration, then reducing to $T$ 1 for exploitation—yields simultaneous improvements in both Pass@1 and Pass@k, especially on hard tasks where pure Pass@1 optimization stalls.

5. Empirical Trends and Benchmarks

Code and Math Reasoning Tasks

On code generation benchmarks (HumanEval, ParEval), pass@1 is computed as the fraction of problems solved on the first sampled completion. The "PartialOrderEval" framework explores how pass@1 scales with prompt detail (Zi et al., 5 Aug 2025), revealing substantial pass@1 gains from more specific prompts—e.g., Qwen2.5-Coder-14B on HumanEval: pass@1 increases from 0.28 (minimal prompt) to 0.86 (detailed prompt), plateauing or slightly declining for excessive verbosity. Larger LLMs require less prompt detail to achieve a given pass@1, but niche or challenging domains (ParEval-OpenMP) continue to benefit across a wider range of prompt elaborations.

On mathematical reasoning tasks, multiple RLVR algorithms demonstrate significant Pass@1 jumps following post-training: in Reasoning Gym, Pass@1 rises from ≈50% (base) to ≈59% (GRPO), and up to ≈58.6% for KL-Cov (Dragoi et al., 9 Oct 2025). On OMEGA OOD, KL-Cov achieves 28.34% Pass@1 compared to 8.34% for the base model.

RLVR Optimization Effects

Optimization under Pass@1 reward reliably increases the correct-answer rate at the cost of reduced exploration (Chen et al., 14 Aug 2025, Peng et al., 16 Oct 2025). Pass@k-based training and hybrid advantage shaping (e.g., interpolating between Pass@1 and Pass@k, or targeting harder examples through hardness weighting) further accelerate Pass@1 convergence and can avoid early stagnation (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

However, naive Pass@k policy gradients may depress Pass@1, especially when dominated by hard, negatively interfering prompts (Barakat et al., 24 Feb 2026).

6. Methodological Nuances: Estimation, Variance, and Reward Transformations

Low-Variance Estimators

Pass@1 admits simple unbiased estimators (empirical fraction correct per prompt). For Pass@k, unbiased low-variance estimators are more complex and are based on combinatorial counts of positive samples in the batch (Walder et al., 21 May 2025, Peng et al., 16 Oct 2025). Leave-one-out (loo) baselining and "loo-1" variants reduce variance in policy gradients, enabling efficient RLVR implementations with drop-in reward transformation functions (e.g., sloo_minus_one(g, k)) (Walder et al., 21 May 2025). These estimators generalize to continuous reward settings as well.

Analytical Advantage-Shaping

Closed-form expressions for policy advantages under Pass@1 and Pass@k provide actionable structures for adaptive advantage design:

$T$ 2

with $T$ 3 the batch mean reward (Chen et al., 14 Aug 2025). Blending these with Pass@k-style advantages or using online signals such as rollout entropy allows for adaptive training regimes.

7. Practical Implications, Benchmarks, and Limitations

Pass@1 remains a baseline operational constraint in many deployments due to latency, cost, verifier coverage, and fallback requirements (Barakat et al., 24 Feb 2026, Peng et al., 16 Oct 2025). Its simplicity, statistical interpretability, and direct correspondence to one-shot correctness underpin its ubiquity. However, as highlighted in "Beyond Pass@k" (Dragoi et al., 9 Oct 2025), Pass@1 neither reveals consistent high-reliability performance nor discriminates between shallow exploration and deep, reliable reasoning. High Pass@1 may mask brittle performance or local optima.

Cover@ $T$ 4 and similar breadth–depth metrics provide essential context to avoid misleading extrapolations from Pass@1 scores alone. Moreover, optimizing only Pass@1 risks policy collapse, whereas well-controlled multi-sample (Pass@k) training with reward transformation or annealing schedules can simultaneously boost diversity and single-sample accuracy—provided that gradient conflict (prompt interference) is managed (Walder et al., 21 May 2025, Barakat et al., 24 Feb 2026).

Summary Table: Pass@1 in Recent RLVR Algorithms and Benchmarks

Study / Algorithm	Pass@1 Empirical Gains	Notable Observations
GRPO, RLVR (math/coding tasks)	+10% to +170% after training	Policy collapse, over-concentration (Peng et al., 16 Oct 2025)
PKPO (anneal k: 8→1)	+15–20% over Pass@1 baseline; no early plateau	High-k for exploration, low-k for exploitation (Walder et al., 21 May 2025)
SimKO	+0.5–1.7 pp gain over GRPO	Slight Pass@1 increase, large Pass@k boost (Peng et al., 16 Oct 2025)
Pure Pass@k PG (uncontrolled)	Pass@k ↑, Pass@1 can ↓	Prompt interference, gradient conflict (Barakat et al., 24 Feb 2026)
Prompt specificity (codegen)	e.g. 0.28→0.86 (HumanEval) by adding detail	Larger models need less detail (Zi et al., 5 Aug 2025)

Pass@1 offers a crucial yet incomplete lens for evaluating and tuning LLMs in verifiable reasoning settings. Its operational value, statistical properties, and optimization behavior are now understood as interdependent with exploration, sample diversity, and broader reliability metrics. Current research characterizes both its strengths and its systematic limitations: holistic evaluation and training increasingly require moving beyond Pass@1 in isolation.

Markdown Report Issue Upgrade to Chat

References (6)

Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries (2025)

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training (2026)

SimKO: Simple Pass@K Policy Optimization (2025)

More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation (2025)

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models (2025)

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pass@1 Performance.

Pass@1 Performance in RLVR Systems

1. Formal Definition and Estimation of Pass@1

2. Operational and Statistical Properties

Breadth, Depth, and Reliability

3. Pass@1 in RLVR Training and Policy Optimization

Standard Pass@1-Optimizing Training

Exploration-Exploitation Trade-Off

4. Pass@1 versus Pass@k: Optimization Trade-Offs and Gradient Interference

Gradient Reweighting and Prompt Interference

Mitigation Strategies

5. Empirical Trends and Benchmarks

Code and Math Reasoning Tasks

RLVR Optimization Effects

6. Methodological Nuances: Estimation, Variance, and Reward Transformations

Low-Variance Estimators

Analytical Advantage-Shaping

7. Practical Implications, Benchmarks, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Pass@1 Performance in RLVR Systems

1. Formal Definition and Estimation of Pass@1

2. Operational and Statistical Properties

Breadth, Depth, and Reliability

3. Pass@1 in RLVR Training and Policy Optimization

Standard Pass@1-Optimizing Training

Exploration-Exploitation Trade-Off

4. Pass@1 versus Pass@k: Optimization Trade-Offs and Gradient Interference

Gradient Reweighting and Prompt Interference

Mitigation Strategies

5. Empirical Trends and Benchmarks

Code and Math Reasoning Tasks

RLVR Optimization Effects

6. Methodological Nuances: Estimation, Variance, and Reward Transformations

Low-Variance Estimators

Analytical Advantage-Shaping

7. Practical Implications, Benchmarks, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research