Pass@k Optimization-Induced Pass@1 Degradation

Updated 28 February 2026

The paper demonstrates that optimizing pass@k in RLVR shifts gradient emphasis towards low-success instances, leading to pass@1 degradation.
It reveals that vanishing learning signals and prompt interference cause exploration collapse and reduced single-shot accuracy.
Mitigation strategies such as entropy regularization and multi-objective training help balance exploration with high single-trial performance.

Pass@k optimization-induced Pass@1 degradation refers to the empirically and theoretically validated phenomenon that arises when reinforcement learning with verifiable rewards (RLVR) directly optimizes the pass@k metric—defined as the probability that at least one of k independent samples from a policy is correct—rather than the more conventional pass@1 (single-sample) success rate. While intuitively appealing for encouraging exploration, policy-gradient ascent on pass@k can, under typical conditions, reduce or stall the single-trial accuracy (pass@1), especially for tasks featuring discrete answer spaces, hard/easy prompt interference, or policy entropy collapse. This trade-off has been rigorously formulated in recent literature, with studies elucidating the underlying gradient dynamics, empirical manifestations on LLMs, and strategies for mitigation (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025, Chen et al., 14 Aug 2025, Barakat et al., 24 Feb 2026, Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025, Liang et al., 19 Aug 2025).

1. Theoretical Foundations: Pass@k and Its Policy Gradient

Given a prompt $x$ and an RL policy $\pi_\theta(y|x)$ , pass@1 is $J_1(x;\theta) = \mathbb{P}_{y\sim\pi_\theta}[V(x,y)=1]$ . Pass@k generalizes this to $J_k(x;\theta) = 1 - (1 - J_1(x;\theta))^k$ , the probability that at least one of $k$ independent samples is correct (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026). The joint policy gradient is given by

$\nabla_\theta J_k(x;\theta) = k\,(1-J_1(x;\theta))^{k-1}\;\nabla_\theta J_1(x;\theta)$

(Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026, Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025). This implies that optimizing pass@k is a nonnegative, per-instance reweighting of the pass@1 gradient, with the scalar factor $\alpha_k(x;\theta) = k(1-J_1(x;\theta))^{k-1}$ . As such, the direction of optimization does not change, but the allocation of gradient magnitude shifts toward instances with low single-sample success (i.e., hard prompts).

2. Mechanisms and Regimes of Degradation

Vanishing Learning Signal and Exploration Collapse

Two principal pathologies emerge when optimizing pass@k (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025):

Low-success regime: For instances where $J_1(x;\theta) \approx 0$ , $\alpha_k$ is large but the empirical REINFORCE gradient is virtually always zero because correct samples are rarely observed. Thus, even with high reweighting, the signal vanishes.
High-success regime: As $J_1(x;\theta) \to 1$ , $\alpha_k \to 0$ , leading to near-zero gradients for problems the model already solves well. The policy ceases to fine-tune confident outputs, which impedes further improvement in pass@1.

When the policy collapses onto a single solution mode (exploration collapse), both pass@1 and pass@k saturate, and the gap $\Delta(k) = \text{pass@k} - \text{pass@1}$ shrinks to zero. In practice, this often means the model over-commits to the earliest discovered mode, potentially missing better optima and locking out improved solutions (Yu, 20 Nov 2025).

Prompt Interference and Gradient Conflict

Recent work introduces the concept of "prompt interference" to formalize how optimizing pass@k may actively decrease pass@1 (Barakat et al., 24 Feb 2026). The key object is the prompt-interference kernel $\kappa_\theta(x,x') = \langle \nabla p_\theta(x),\nabla p_\theta(x')\rangle$ . Prompts for which $a_\theta(x):= \langle \nabla J_1(x;\theta),\nabla J_1(\theta)\rangle < 0$ are termed negatively interfering: improving performance on such prompts can directly reduce the average pass@1. Pass@k policy gradients disproportionately emphasize these hard, negatively interfering samples, and if their aggregated covariance is sufficiently large and negative, the overall step in parameter space provably rotates away from the pass@1 direction, decreasing single-shot accuracy even as pass@k increases (Proposition 3.1 and 3.4 in (Barakat et al., 24 Feb 2026)).

3. Empirical Manifestations in RLVR and LLM Fine-tuning

Empirical studies confirm the theoretical findings across mathematical reasoning, code synthesis, and traversal tasks (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025, Chen et al., 14 Aug 2025, Walder et al., 21 May 2025, Barakat et al., 24 Feb 2026):

Crossover Phenomenon: RLVR models fine-tuned for pass@k often yield higher performance at small k (e.g., pass@1, pass@4), but for large k, pretrained base models surpass RLVR models in pass@k, and RLVR models may demonstrate lower pass@1 on held-out tasks (Dragoi et al., 9 Oct 2025).
Entropy Collapse: RLVR geared toward pass@1 drives policy entropy downward, leading to limited sample diversity and stalling of pass@k at large k (Liang et al., 19 Aug 2025).
Gradient Allocation Bias: Optimization on pass@k with large k specifically targets increasing the coverage (breadth) of low-success prompts (lifting $p_i$ from near 0 to near 0.1), sacrificing the refinement (depth) of tasks with $p_i$ near 1, negatively impacting pass@1 (Dragoi et al., 9 Oct 2025, Chen et al., 14 Aug 2025).
Negative Covariance Confirmation: Quantitative MATH dataset experiments demonstrate that the inner product between pass@k and pass@1 gradients can become negative when pass@k's prompt weighting aligns with negatively interfering prompts (Barakat et al., 24 Feb 2026).

4. Analytical and Algorithmic Explanations

Gradient and Advantage Structures

Pass@k policy gradients, both in REINFORCE and advantage-shaping schemes (GRPO, RLOO), can be interpreted as mixtures of per-sample gradients, with sharp reweighting toward hard examples and vanishing signals for easy ones (Thrampoulidis et al., 27 Oct 2025, Chen et al., 14 Aug 2025, Walder et al., 21 May 2025). For instance, the sum-of-absolute-advantage $\eta(N_\text{pos})$ peaks at intermediate accuracy under pass@k and rapidly decays in the high-accuracy regime, sharply down-weighting further improvements in high-confidence cases. Direct pass@k training therefore "kills" the gradient on easy examples, entrenching mediocrity in pass@1 (Chen et al., 14 Aug 2025, Thrampoulidis et al., 27 Oct 2025).

Impact of Discrete Answer Spaces

In domains with small, discrete answer spaces (e.g., numeric math), pass@k optimization at large k is especially prone to misleading the practitioner: any nonzero probability mass ('random guessing') is, by combinatorial accumulation, sufficient for high pass@k, even if $p_i$ is tiny. This degeneracy results in models that appear highly capable at large sample budgets, yet have unreliable or even degraded single-shot accuracy (Dragoi et al., 9 Oct 2025, Yu, 20 Nov 2025).

5. Mitigation Strategies and Advanced Techniques

To counteract pass@k-induced pass@1 degradation, a suite of strategies has been proposed and validated:

Explicit Exploration Bonuses: Incorporate entropy regularization or maximum-coverage rewards to maintain exploration and avoid premature mode collapse (Yu, 20 Nov 2025, Liang et al., 19 Aug 2025).
Multi-objective Training: Jointly optimize convex combinations of pass@1 and pass@k or use risk-sensitive objectives to balance breadth and depth (Barakat et al., 24 Feb 2026, Chen et al., 14 Aug 2025, Walder et al., 21 May 2025).
Advantage Shaping and Blending: Apply combination advantage functions, blend per-example gradients by current accuracy or entropy, or use smoothed surrogate rewards to preserve useful learning signals on both hard and easy prompts (Thrampoulidis et al., 27 Oct 2025, Chen et al., 14 Aug 2025).
Tempered Reweighting: Replace k in the weighting function with a tunable γ or anneal k during training, starting from k>1 for exploration and reducing to k=1 for exploitation (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025, Yu, 20 Nov 2025).
Gradient Surgery and Prompt Curation: Project out negatively interfering gradient directions or train subsets of prompts with more conservative objectives (Barakat et al., 24 Feb 2026).
Online Data Diversification (SvS): Use self-play and variational problem synthesis to continually synthesize challenging variants, thereby promoting entropy and sustaining both pass@k and pass@1 (Liang et al., 19 Aug 2025).

6. Diagnostic and Evaluation Best Practices

A consensus emerges that pass@k, although a valuable diagnostic of latent coverage, is unsuitable as a sole training objective (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025). For evaluation and reporting:

Always supplement pass@k with single-trial metrics (pass@1) and coverage curves (Cover@τ at moderate-to-high τ) to diagnose reliability and genuine reasoning ability (Dragoi et al., 9 Oct 2025).
Be cautious in interpreting high pass@k at large k, especially in finite-support or small-answer-space tasks, as this may reflect random guessing.
Employ coverage-area-under-the-curve metrics (e.g., AvgAUC^+_cover) for aggregate, pairwise-aware model ranking (Dragoi et al., 9 Oct 2025).

Table: Summary of Key Mitigation Techniques

Technique	Principle	Effect on Pass@1/Pass@k
Entropy regularization	Promotes exploration	Preserves both
Multi-objective blending	Trades off coverage/accuracy	Balances
Annealing k	Early exploration, final exploit	Boosts both if scheduled
Advantage shaping	Smooths reward/gradient decay	Avoids vanishing signal
Online data synthesis (SvS)	Sustains problem and solution diversity	Extends both boundaries

7. Outlook and Open Challenges

The optimization-induced pass@1 degradation when training directly on pass@k reflects a broader challenge in reinforcement learning for reasoning: aligning exploration incentives with policy reliability. Analytical findings establish the inevitability of vanishing gradients and counterproductive gradient directions in naive pass@k policy optimization, especially in the presence of prompt interference and low-entropy policies (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026). Modern RLVR approaches appear to benefit from strategically decoupling metric evaluation (diagnostics at multiple k and reliability thresholds) from training objectives, blending exploration-promoting and accuracy-consolidating methodologies, and continually expanding the frontier of both solution depth and breadth (Liang et al., 19 Aug 2025, Walder et al., 21 May 2025, Dragoi et al., 9 Oct 2025).

Active research directions include more precise characterization of gradient conflicts across task distributions, development of robust surrogate rewards and dynamic curricula, and systematic benchmarking of adaptive schedules or entropy-aware advantage shaping schemes in large-scale LLM post-training.