Pass@$k$ Policy Optimization

Updated 27 November 2025

Pass@$k$ Policy Optimization is a reinforcement learning approach that optimizes the probability of achieving at least one verifiable correct solution from k independent samples.
It employs policy gradient estimators that scale the standard REINFORCE updates, balancing exploration and exploitation through set-level reward transformations.
Enhancements such as advantage shaping, entropy regularization, and adaptive k-annealing improve sample diversity and mitigate exploration collapse in complex reasoning tasks.

Pass@ $k$ Policy Optimization (PKPO) is a class of reinforcement learning methods in which the optimization objective is the "pass@ $k$ " metric—the probability that, in $k$ independent samples, at least one yields a verifiable correct solution. In the context of LLMs and Reinforcement Learning with Verifiable Rewards (RLVR), PKPO has emerged from the recognition that optimizing for single-sample (pass@$1$) accuracy systematically underutilizes the capacity for diverse, exploratory sampling. PKPO frameworks therefore directly optimize the utility of sets of samples (i.e., over mini-batch groupings), with the goal of improving the ability of LLMs to uncover solutions to complex or difficult reasoning tasks.

1. Formal Definition of the Pass@ $k$ Objective

Let $\pi_\theta(y|x)$ denote an autoregressive policy with parameters $\theta$ producing trajectory $y$ from prompt $x$ . A deterministic verifier $V(x, y) \in \{0, 1\}$ indicates correctness. The per-sample (pass@$1$) success probability is

$J_1(x; \theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)}[V(x, y)].$

The pass@ $k$ objective for input $x$ is the probability that at least one out of $k$ i.i.d. samples is correct: $J_k(x; \theta) = 1 - (1 - J_1(x; \theta))^k.$ Averaging over the prompt distribution $\mathcal{D}$ yields the global objective,

$\mathrm{Pass@}k(\theta) = \mathbb{E}_{x \sim \mathcal{D}}[J_k(x; \theta)].$

This metric naturally extends to continuous rewards $g: \mathcal{X} \rightarrow \mathbb{R}$ , with

$\mathrm{Pass@}k^g = \mathbb{E}_{x_1:k \sim p}[ \max_{i=1..k} g(x_i) ].$

These definitions establish pass@ $k$ as a set-level performance criterion, distinguishing it from the sample-level focus of traditional RLVR (Yu, 20 Nov 2025, Walder et al., 21 May 2025).

2. Policy Gradient Estimators for Pass@ $k$

The PKPO gradient is derived by the application of the chain rule to the pass@ $k$ functional. Letting $q(x) \triangleq J_1(x; \theta)$ ,

$\nabla_\theta J_k(x; \theta) = k (1 - q(x))^{k-1} \nabla_\theta J_1(x; \theta).$

Here, $\nabla_\theta J_1(x; \theta)$ is the standard REINFORCE term,

$\nabla_\theta J_1(x; \theta) = \mathbb{E}_{y \sim \pi_\theta}[ V(x, y) \nabla_\theta \log \pi_\theta(y|x) ].$

Thus, $\nabla_\theta J_k(x; \theta)$ is a per-example reweighting of the pass@$1$ gradient by a scaling factor $\alpha_k(x; \theta) = k [1 - J_1(x; \theta)]^{k-1}$ . The global pass@ $k$ policy gradient estimator therefore becomes

$\nabla_\theta \mathrm{Pass@}k(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \alpha_k(x; \theta)\, \mathbb{E}_{y \sim \pi_\theta}[ V(x, y)\, \nabla_\theta \log \pi_\theta(y|x) ] \right].$

For binary rewards, low-variance unbiased estimators, such as $\rho(N, c, K) = 1 - \binom{N-c}{K}/\binom{N}{K}$ (with $c$ successes in $N$ rollouts), are used to reconstruct pass@ $k$ and its gradient efficiently; analogous combinatorial constructs exist for continuous rewards (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025). Leave-one-out and "minus-one" baselines further reduce estimator variance, yielding robust policy updates in large-batch RL (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025).

3. Reward Transformations, Advantage Shaping, and Surrogate Objectives

Recent analyses have unified two major approaches for optimizing pass@ $k$ : (1) direct REINFORCE-style policy gradients and (2) advantage-shaping methods that modify advantage functions in GRPO or PPO-like schemes. Both are equivalent to policy gradient ascent on surrogate reward functionals $F(\rho_{K, \theta})$ , with possible additive regularizers (e.g., entropy penalization) (Thrampoulidis et al., 27 Oct 2025).

For example, shaping the advantage by $(1 - \hat{\rho})$ yields exactly the population policy gradient for

$F(\rho) = 2\,\arcsin(\sqrt{\rho}) + \sqrt{\rho(1-\rho)},$

which regularizes for exploration. This unification yields a general recipe:

Given any strictly increasing $F$ , policy gradients operate on $J(\theta) = \mathbb{E}_{x, a} [ F( \rho_{K,\theta}(x,a) ) ]$ .
Empirical approximations use the batch mean $\hat{\rho}$ , RLOO scaling, and normalization by the empirical standard deviation. Practically, direct and advantage-shaping PKPO are algorithmically interchangeable in deep RLVR (Thrampoulidis et al., 27 Oct 2025).

4. Exploration, Exploitation, and the Pathology of Exploration Collapse

Pass@ $k$ optimization inherently amplifies correct-sample gradients only when $J_1(x; \theta)$ is not near $0$ or $1$, yielding a strong effect in intermediate regimes. Specifically:

In the low-success regime ( $J_1 \rightarrow 0$ ), the scaling factor $\alpha_k \rightarrow k$ , but the gradient evaporates because nearly all samples are failures.
In the high-success regime ( $J_1 \rightarrow 1$ ), $\alpha_k \rightarrow 0$ , rendering gradients vanishingly small even if alternative modes are unexplored.

Consequently, PKPO provides little signal precisely when exploration is most needed, both initially (to escape local minima) and after early success (when only a single mode is dominant). As RL updates reinforce discovered solutions, probability mass concentrates, and the difference $\Delta(k) = \mathrm{pass}@k - \mathrm{pass}@1$ converges to $0$. This "exploration collapse" effect is formalized: $\Delta_t(k) = 1 - (1 - p_t)^k - p_t \to 0 \quad \text{as } p_t \rightarrow 1.$ Thus, pass@ $k$ ERC gradients are collinear with pass@$1$ and cannot independently drive mode discovery (Yu, 20 Nov 2025).

5. Algorithmic Enhancements: Set-Wise, Asymmetric, and Entropy-Regularized Methods

PKPO methods have been extended by introducing novel reward transformations, adaptive $k$ -annealing, group-level advantage computation, and explicit entropy regularization:

Set-wise reward transformations: PKPO can be cast as standard policy gradient RL with per-batch reward vectors transformed via combinatorial estimators (binary or continuous) (Walder et al., 21 May 2025).
Advantage closure: Closed-form group-advantage expressions for pass@ $k$ allow efficient, $O(1)$ computation per sample and enable hand-crafted advantage shapes to accelerate learning or adapt exploration/exploitation tradeoffs (Chen et al., 14 Aug 2025).
SimKO: SimKO mitigates probability over-concentration by asymmetrically smoothing correct-token gradients (boosting top- $K$ likelihoods) and penalizing rank-1 token mass for incorrect responses. Only high-entropy tokens are affected (e.g., top 20% by entropy, with $\alpha=0.01$ , $K=3$ or $4$), and the method is integrated into a modified GRPO loss (Peng et al., 16 Oct 2025).
Entropy-regularized and set-level objectives: Entropy or reward-level penalties ( $\Omega(\rho)$ ) directly enforce mode coverage, while set-level PCPO strategies further ablate the independence assumption (Thrampoulidis et al., 27 Oct 2025, Walder et al., 21 May 2025).

These enhancements can directly alleviate exploration collapse and facilitate broad solution coverage in RLVR settings (Walder et al., 21 May 2025, Peng et al., 16 Oct 2025).

6. Empirical Performance and Limitations

Empirical studies reveal that PKPO variants:

Outperform standard pass@$1$ RL in harder, multi-modal reasoning domains (e.g., MATH, ARC-AGI-1), with improvements scaling in $k$ and especially with $k$ -annealing schedules (Walder et al., 21 May 2025).
Maintain higher effective policy entropy, increase answer diversity, and adaptively balance exploration/exploitation (Chen et al., 14 Aug 2025, Peng et al., 16 Oct 2025).
Achieve robust, unbiased gradient estimates with minimal variance due to transformation and leave-one-out techniques (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025).
SimKO and similar methods consistently boost pass@ $k$ while moderate values for smoothing and entropy thresholds avoid pass@$1$ performance degradation. However, excessive smoothing or application to all tokens can harm pass@$1$, and benefits plateau as $K$ increases beyond moderate values (Peng et al., 16 Oct 2025).

These gains underscore the potential of PKPO, yet also highlight a universal limitation: optimizing pass@ $k$ in isolation lacks the ability to fundamentally overcome exploration collapse or promote truly diverse solution finding in the late training regime. Synergistic objectives and adaptive regularization are necessary for robust, mode-covering reasoning (Yu, 20 Nov 2025).

7. Interpretations, Misconceptions, and Prospective Directions

Critical analyses have clarified several misconceptions:

Direct optimization of pass@ $k$ does not introduce new gradient directions beyond pass@$1$; it merely rescales per-example updates (Yu, 20 Nov 2025).
Pass@ $k$ fails to provide learning signal when models are either too weak (no correct sampling) or too strong (single-mode dominance); thus, it is not inherently a remedy for RLVR's exploration-exploitation imbalance (Yu, 20 Nov 2025).
Instead, pass@ $k$ is most effective as an inference-time diagnostic of latent diversity, not as a training objective per se (Yu, 20 Nov 2025).
Promising approaches incorporate explicit exploration incentives—entropy regularization, count/novelty bonuses, asymmetric token updates, or set-level surrogate objectives (Thrampoulidis et al., 27 Oct 2025, Walder et al., 21 May 2025, Peng et al., 16 Oct 2025).

A plausible implication is that future PKPO development will focus on compositional surrogate reward design, adaptive advantage shaping, and hybrid training schedules that integrate both success-maximizing and diversity-inducing signals. This suggests a shift away from naive pass@ $k$ maximization toward principled, unified objectives informed by RLVR's underlying mode-seeking and exploration constraints.