SimKO: Simple Pass@K Optimization

Updated 17 October 2025

The paper introduces an asymmetric gradient redistribution mechanism that mitigates probability over-concentration in large language models.
It employs top-K smoothing for verified-correct tokens and amplified penalties for errors to foster diverse candidate generation.
Empirical results across benchmarks show that SimKO improves both pass@K and pass@1 metrics by flattening token probability distributions.

Simple Pass@K Optimization (SimKO) is a reinforcement learning methodology developed to counter the probability over-concentration effect in LLMs trained with verifiable rewards. SimKO provides a targeted approach for improving the pass@K metric—the likelihood that at least one of the top K candidates is correct—by asymmetrically redistributing gradient updates at critical token positions. This promotes exploration in sequence generation, leading to increased diversity of candidate solutions and improved performance under pass@K. The method is motivated by empirical findings that standard RLVR approaches, which heavily favor exploitation, drive LLM policies toward highly concentrated probability mass on the top-1 candidate, suppressing alternative reasoning paths and reducing the utility of sampling multiple candidates per prompt (Peng et al., 16 Oct 2025).

1. Motivation and Probability Concentration in RLVR

Reinforcement learning with verifiable rewards (RLVR) has enabled notable breakthroughs in LLM reasoning but is characterized by a systematic bias toward exploitation. Analysis of token-level distributions in RLVR-trained models shows a consistent effect: the top-1 candidate's probability increases throughout training, while sub-top candidates are suppressed. This over-concentration correlates negatively with pass@K—stronger concentration leads to poorer diversity and lower probability that any among K samples is correct. Conventional methods such as group-reward policy optimization (GRPO) maximize pass@1 while diminishing collective utility over sets, limiting exploration and learning progress on complex examples (Peng et al., 16 Oct 2025, Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

2. Asymmetric Gradient Redistribution Mechanism

SimKO introduces an asymmetrical update mechanism at the token level to mitigate concentration and promote exploration. For verified-correct responses, SimKO distributes positive gradient updates over the top-K candidates (ranked by model probability) at high-entropy tokens—effectively implementing top-K label smoothing. Rather than reinforcement of a single (argmax) token, the probability mass is redistributed according to:

$\gamma^{\text{pos}}_{i,l} = (1 - \alpha) \cdot \gamma_{i,l} + \frac{\alpha}{|\mathbb{I}_{\text{top}K}|} \sum_{k \in \mathbb{I}_{\text{top}K}} \text{sg}\left(\frac{\gamma_{i,l}}{\gamma_{i,l}^{(k)}}\right) \cdot \gamma_{i,l}^{(k)}$

where $\alpha$ is a smoothing parameter, $\mathbb{I}_{\text{top}K}$ indexes the top-K tokens, and $\text{sg}(\cdot)$ is a stop-gradient operator. For verified-incorrect responses, SimKO strongly penalizes the top-1 candidate by amplifying the update with a factor $\lambda > 1$ , leaving non-top tokens' penalties relatively mild. These two operations are applied only at "forking" tokens (identified via entropy threshold $\tau$ ) where the model is numerically uncertain and multiple continuation paths are plausible (Peng et al., 16 Oct 2025).

3. Formalization and Dynamics

Let $H\left(\pi_t(\cdot \mid s_{i,l})\right)$ be the entropy at token position $l$ of sample $i$ . SimKO replaces the original likelihood ratio $\gamma_{i,l}$ from GRPO with the entropy- and advantage-conditioned term:

$\widetilde{\gamma}_{i,l} = \begin{cases} \gamma^{\text{pos}}_{i,l} & \text{if } H\left(\pi_t(\cdot \mid s_{i,l})\right) > \tau \text{ and } A_{i,l} > 0 \ \gamma^{\text{neg}}_{i,l} & \text{if } H\left(\pi_t(\cdot \mid s_{i,l})\right) > \tau \text{ and } A_{i,l} < 0 \ \gamma_{i,l} & \text{otherwise} \end{cases}$

where $A_{i,l}$ is the advantage signal (positive for correct, negative for incorrect). Top-K gradients are smoothed for positives; for negatives, $\gamma^{\text{neg}}_{i,l}$ applies an amplified penalty to the top-1 token.

To monitor the distribution’s evolution, the model’s average log-probability over top-K candidates is tracked:

$\Lambda^{(k)} = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{l=1}^{|y_i|} \log \pi_t \left(y_{i,l}^{(k)} \mid s_{i,l}\right)$

SimKO prevents $\Lambda^{(1)}$ from collapsing toward maximal confidence and increases $\Lambda^{(2)}$ , $\Lambda^{(3)}$ , confirming improved diversity.

4. Empirical Performance Across Benchmarks

SimKO has been evaluated on mathematical reasoning datasets (MATH500, AMC, AIME, Minerva, Olympiad), logical benchmarks (Synlogic-easy, BBH), and several model backbones. Results consistently show increased pass@K for a wide range of K (including K as high as 256), with simultaneous improvement in pass@1. In ablation studies on models such as Qwen2.5-Math-7B, SimKO produced a 4.4% absolute increase in pass@256, outperforming the GRPO baseline. Histograms of token-level probabilities reveal that SimKO flattens the distribution: probability mass is shifted from the top-1 candidate to other viable alternatives, with high-entropy "forking" tokens showing non-negligible probability among several candidates. This directly correlates with higher empirical pass@K (Peng et al., 16 Oct 2025).

5. Impact on Exploration and RLVR Training Dynamics

By introducing token-level top-K smoothing for correct responses and targeted penalization for the dominant candidate in errors, SimKO fundamentally shifts RLVR training from pure exploitation to a mixed regime that rewards exploration. Quantitative analysis links policy entropy and distributional diversity to improved exploration: higher policy entropy after SimKO training indicates a richer set of plausible continuations and higher likelihood that at least one K-sample is correct. As a result, SimKO enables LLMs to escape local optima and investigate alternative reasoning paths that would otherwise be suppressed in pass@1-centric training (Peng et al., 16 Oct 2025, Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

6. Limitations, Hyperparameter Dependencies, and Future Directions

The efficacy of SimKO depends on several hyperparameters: smoothing weight $\alpha$ , entropy threshold $\tau$ for identifying "informative" tokens, the value of K, and penalty factor $\lambda$ . The selection and tuning of these parameters is task- and model-dependent; misconfiguration may undermine exploration or dilute exploitation. SimKO applies its modifications only when token entropy exceeds $\tau$ , and there is potential for critical tokens to be overlooked. Implementational complexity increases modestly compared to classic RLVR, due to additional computation of top-K probabilities and gradient manipulations.

Future research directions suggested include adaptive hyperparameter schedules, broader application to domains beyond mathematical and logical reasoning, refined selection of critical token positions, and integration with other exploration-oriented techniques such as off-policy data augmentation or reward transformation. The principle of joint optimization for sets (as in pass@K policy optimization (Walder et al., 21 May 2025)) and group-advantage design (Chen et al., 14 Aug 2025) aligns with SimKO and may yield further improvements if combined (Peng et al., 16 Oct 2025).

7. Theoretical and Algorithmic Connections

SimKO connects to a broader theme: reframing model learning objectives from pass@1 to pass@K maximization, emphasizing set-wise reward signals rather than per-sample rewards. This is conceptually compatible with approaches in k-subset sampling and gradient estimator design (Ahmed et al., 2022), reward transformations for pass@K policy optimization (Walder et al., 21 May 2025), and code ranking via pass@K-maximized losses (Lyu et al., 11 Aug 2024). The import of analyzing token-level learning dynamics—especially entropy at forking decisions—appears to extend naturally to other policy optimization frameworks exploiting candidate diversity. A plausible implication is that future RLVR approaches will increasingly use such distribution-level manipulations to balance exploration and exploitation, calibrated to evaluation metrics of practical significance.

SimKO constitutes a principled modification to RLVR training. By leveraging asymmetric, entropy-aware redistribution of policy gradients at critical sequence points, it mitigates the over-concentration problem and raises pass@K (and pass@1) metrics across diverse tasks and model backbones. Its operational simplicity, coupled with empirical effectiveness, suggests its broad utility for advancing exploration in large-scale generative models.