Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pass@k Optimization-Induced Pass@1 Degradation

Updated 28 February 2026
  • The paper demonstrates that optimizing pass@k in RLVR shifts gradient emphasis towards low-success instances, leading to pass@1 degradation.
  • It reveals that vanishing learning signals and prompt interference cause exploration collapse and reduced single-shot accuracy.
  • Mitigation strategies such as entropy regularization and multi-objective training help balance exploration with high single-trial performance.

Pass@k optimization-induced Pass@1 degradation refers to the empirically and theoretically validated phenomenon that arises when reinforcement learning with verifiable rewards (RLVR) directly optimizes the pass@k metric—defined as the probability that at least one of k independent samples from a policy is correct—rather than the more conventional pass@1 (single-sample) success rate. While intuitively appealing for encouraging exploration, policy-gradient ascent on pass@k can, under typical conditions, reduce or stall the single-trial accuracy (pass@1), especially for tasks featuring discrete answer spaces, hard/easy prompt interference, or policy entropy collapse. This trade-off has been rigorously formulated in recent literature, with studies elucidating the underlying gradient dynamics, empirical manifestations on LLMs, and strategies for mitigation (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025, Chen et al., 14 Aug 2025, Barakat et al., 24 Feb 2026, Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025, Liang et al., 19 Aug 2025).

1. Theoretical Foundations: Pass@k and Its Policy Gradient

Given a prompt xx and an RL policy πθ(y∣x)\pi_\theta(y|x), pass@1 is J1(x;θ)=Py∼πθ[V(x,y)=1]J_1(x;\theta) = \mathbb{P}_{y\sim\pi_\theta}[V(x,y)=1]. Pass@k generalizes this to Jk(x;θ)=1−(1−J1(x;θ))kJ_k(x;\theta) = 1 - (1 - J_1(x;\theta))^k, the probability that at least one of kk independent samples is correct (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026). The joint policy gradient is given by

∇θJk(x;θ)=k (1−J1(x;θ))k−1  ∇θJ1(x;θ)\nabla_\theta J_k(x;\theta) = k\,(1-J_1(x;\theta))^{k-1}\;\nabla_\theta J_1(x;\theta)

(Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026, Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025). This implies that optimizing pass@k is a nonnegative, per-instance reweighting of the pass@1 gradient, with the scalar factor αk(x;θ)=k(1−J1(x;θ))k−1\alpha_k(x;\theta) = k(1-J_1(x;\theta))^{k-1}. As such, the direction of optimization does not change, but the allocation of gradient magnitude shifts toward instances with low single-sample success (i.e., hard prompts).

2. Mechanisms and Regimes of Degradation

Vanishing Learning Signal and Exploration Collapse

Two principal pathologies emerge when optimizing pass@k (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025):

  • Low-success regime: For instances where J1(x;θ)≈0J_1(x;\theta) \approx 0, αk\alpha_k is large but the empirical REINFORCE gradient is virtually always zero because correct samples are rarely observed. Thus, even with high reweighting, the signal vanishes.
  • High-success regime: As J1(x;θ)→1J_1(x;\theta) \to 1, αk→0\alpha_k \to 0, leading to near-zero gradients for problems the model already solves well. The policy ceases to fine-tune confident outputs, which impedes further improvement in pass@1.

When the policy collapses onto a single solution mode (exploration collapse), both pass@1 and pass@k saturate, and the gap Δ(k)=pass@k−pass@1\Delta(k) = \text{pass@k} - \text{pass@1} shrinks to zero. In practice, this often means the model over-commits to the earliest discovered mode, potentially missing better optima and locking out improved solutions (Yu, 20 Nov 2025).

Prompt Interference and Gradient Conflict

Recent work introduces the concept of "prompt interference" to formalize how optimizing pass@k may actively decrease pass@1 (Barakat et al., 24 Feb 2026). The key object is the prompt-interference kernel κθ(x,x′)=⟨∇pθ(x),∇pθ(x′)⟩\kappa_\theta(x,x') = \langle \nabla p_\theta(x),\nabla p_\theta(x')\rangle. Prompts for which aθ(x):=⟨∇J1(x;θ),∇J1(θ)⟩<0a_\theta(x):= \langle \nabla J_1(x;\theta),\nabla J_1(\theta)\rangle < 0 are termed negatively interfering: improving performance on such prompts can directly reduce the average pass@1. Pass@k policy gradients disproportionately emphasize these hard, negatively interfering samples, and if their aggregated covariance is sufficiently large and negative, the overall step in parameter space provably rotates away from the pass@1 direction, decreasing single-shot accuracy even as pass@k increases (Proposition 3.1 and 3.4 in (Barakat et al., 24 Feb 2026)).

3. Empirical Manifestations in RLVR and LLM Fine-tuning

Empirical studies confirm the theoretical findings across mathematical reasoning, code synthesis, and traversal tasks (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025, Chen et al., 14 Aug 2025, Walder et al., 21 May 2025, Barakat et al., 24 Feb 2026):

  • Crossover Phenomenon: RLVR models fine-tuned for pass@k often yield higher performance at small k (e.g., pass@1, pass@4), but for large k, pretrained base models surpass RLVR models in pass@k, and RLVR models may demonstrate lower pass@1 on held-out tasks (Dragoi et al., 9 Oct 2025).
  • Entropy Collapse: RLVR geared toward pass@1 drives policy entropy downward, leading to limited sample diversity and stalling of pass@k at large k (Liang et al., 19 Aug 2025).
  • Gradient Allocation Bias: Optimization on pass@k with large k specifically targets increasing the coverage (breadth) of low-success prompts (lifting pip_i from near 0 to near 0.1), sacrificing the refinement (depth) of tasks with pip_i near 1, negatively impacting pass@1 (Dragoi et al., 9 Oct 2025, Chen et al., 14 Aug 2025).
  • Negative Covariance Confirmation: Quantitative MATH dataset experiments demonstrate that the inner product between pass@k and pass@1 gradients can become negative when pass@k's prompt weighting aligns with negatively interfering prompts (Barakat et al., 24 Feb 2026).

4. Analytical and Algorithmic Explanations

Gradient and Advantage Structures

Pass@k policy gradients, both in REINFORCE and advantage-shaping schemes (GRPO, RLOO), can be interpreted as mixtures of per-sample gradients, with sharp reweighting toward hard examples and vanishing signals for easy ones (Thrampoulidis et al., 27 Oct 2025, Chen et al., 14 Aug 2025, Walder et al., 21 May 2025). For instance, the sum-of-absolute-advantage η(Npos)\eta(N_\text{pos}) peaks at intermediate accuracy under pass@k and rapidly decays in the high-accuracy regime, sharply down-weighting further improvements in high-confidence cases. Direct pass@k training therefore "kills" the gradient on easy examples, entrenching mediocrity in pass@1 (Chen et al., 14 Aug 2025, Thrampoulidis et al., 27 Oct 2025).

Impact of Discrete Answer Spaces

In domains with small, discrete answer spaces (e.g., numeric math), pass@k optimization at large k is especially prone to misleading the practitioner: any nonzero probability mass ('random guessing') is, by combinatorial accumulation, sufficient for high pass@k, even if pip_i is tiny. This degeneracy results in models that appear highly capable at large sample budgets, yet have unreliable or even degraded single-shot accuracy (Dragoi et al., 9 Oct 2025, Yu, 20 Nov 2025).

5. Mitigation Strategies and Advanced Techniques

To counteract pass@k-induced pass@1 degradation, a suite of strategies has been proposed and validated:

6. Diagnostic and Evaluation Best Practices

A consensus emerges that pass@k, although a valuable diagnostic of latent coverage, is unsuitable as a sole training objective (Yu, 20 Nov 2025, Dragoi et al., 9 Oct 2025). For evaluation and reporting:

  • Always supplement pass@k with single-trial metrics (pass@1) and coverage curves (Cover@Ï„ at moderate-to-high Ï„) to diagnose reliability and genuine reasoning ability (Dragoi et al., 9 Oct 2025).
  • Be cautious in interpreting high pass@k at large k, especially in finite-support or small-answer-space tasks, as this may reflect random guessing.
  • Employ coverage-area-under-the-curve metrics (e.g., AvgAUC+_cover) for aggregate, pairwise-aware model ranking (Dragoi et al., 9 Oct 2025).

Table: Summary of Key Mitigation Techniques

Technique Principle Effect on Pass@1/Pass@k
Entropy regularization Promotes exploration Preserves both
Multi-objective blending Trades off coverage/accuracy Balances
Annealing k Early exploration, final exploit Boosts both if scheduled
Advantage shaping Smooths reward/gradient decay Avoids vanishing signal
Online data synthesis (SvS) Sustains problem and solution diversity Extends both boundaries

7. Outlook and Open Challenges

The optimization-induced pass@1 degradation when training directly on pass@k reflects a broader challenge in reinforcement learning for reasoning: aligning exploration incentives with policy reliability. Analytical findings establish the inevitability of vanishing gradients and counterproductive gradient directions in naive pass@k policy optimization, especially in the presence of prompt interference and low-entropy policies (Yu, 20 Nov 2025, Barakat et al., 24 Feb 2026). Modern RLVR approaches appear to benefit from strategically decoupling metric evaluation (diagnostics at multiple k and reliability thresholds) from training objectives, blending exploration-promoting and accuracy-consolidating methodologies, and continually expanding the frontier of both solution depth and breadth (Liang et al., 19 Aug 2025, Walder et al., 21 May 2025, Dragoi et al., 9 Oct 2025).

Active research directions include more precise characterization of gradient conflicts across task distributions, development of robust surrogate rewards and dynamic curricula, and systematic benchmarking of adaptive schedules or entropy-aware advantage shaping schemes in large-scale LLM post-training.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pass@k Optimization-Induced Pass@1 Degradation.