SimKO: Simple Pass@K Optimization

Updated 1 March 2026

SimKO is a reinforcement learning policy optimization technique that mitigates the exploration–exploitation imbalance in fine-tuning large language models.
It employs an entropy-sensitive asymmetric objective that boosts multiple plausible correct tokens while penalizing overconfident incorrect predictions.
Empirical evaluations show SimKO improves pass@K accuracy across various benchmarks without sacrificing pass@1 performance, making it a cost-effective integration in RLVR pipelines.

SimKO (Simple Pass@K Optimization) is a policy optimization method developed to address the exploration–exploitation imbalance in reinforcement learning with verifiable rewards (RLVR) for LLMs. While RLVR methods such as Group Relative Policy Optimization (GRPO) have improved single-sample accuracy (pass@1), they systematically degrade diversity-aware accuracy (pass@K for $K>1$ ). SimKO targets this over-concentration effect by employing an asymmetric objective: boosting the probability of multiple plausible candidates in correct trajectories while applying stronger penalties to highly probable but incorrect predictions, thus widening the explored solution space without sacrificing single-answer precision (Peng et al., 16 Oct 2025).

1. Over-Concentration in RLVR and Motivation

On-policy RLVR algorithms, such as GRPO, operate by generating $G$ samples per question and up-weighting responses validated as correct, while suppressing those deemed incorrect. This dynamic drives the policy distribution toward near-determinism at each decoding step, as repeated positive reinforcement is concentrated on already dominant tokens, resulting in substantial gains in pass@1 but a marked decline in pass@K ( $K>1$ ). The key observation is that standard RLVR leads to token-level probability distributions where the top-1 candidate increasingly monopolizes probability mass, while probabilities for alternative tokens collapse, severely restricting exploration and deteriorating the odds that any of the top-K samples is correct (Peng et al., 16 Oct 2025).

Quantitatively, the effect is measured via the mean log-probabilities $\Lambda^{(k)}$ for the $k$ -th ranked token: $\Lambda^{(k)} = \frac{1}{G}\sum_{i=1}^G\frac{1}{|y_i|}\sum_{l=1}^{|y_i|} \log \pi\left( y^{(k)}_{i,l} \mid s_{i,l} \right).$ Empirically, after GRPO training, $\Lambda^{(1)}$ increases (converging toward $\log 1$ ), while $\Lambda^{(2)}, \Lambda^{(3)}$ plummet (often to below $-20$ ), confirming extreme concentration on the single most likely token.

2. SimKO: Asymmetric Ratio Modification for Policy Optimization

SimKO modifies the policy ratio $\gamma_{i,l}$ used in PPO/GRPO-style RL objectives. It introduces an entropy-sensitive, asymmetric adjustment applied only at high-entropy “forking” tokens—positions where the policy is non-deterministic and multiple continuations are viable: $H\left(\pi(\cdot \mid s_{i,l})\right) = -\sum_{a\in V} \pi_\theta(a\mid s_{i,l})\log\pi_\theta(a\mid s_{i,l}).$ For positions with entropy exceeding a threshold $\tau$ , SimKO distinguishes between positive and negative trajectories based on the per-token advantage $A_{i,l}$ :

Correct tokens ( $A_{i,l}>0$ ): The importance ratio is smoothed across the top-K candidates via convex combination controlled by $\alpha$ : $\gamma_{i,l}^{\mathrm{pos}} = (1-\alpha)\,\gamma_{i,l} + \frac{\alpha}{K} \sum_{k\in \mathcal{I}_{\mathrm{topK}}} \mathrm{sg}\left( \frac{\gamma_{i,l}}{\gamma_{i,l}^{(k)}} \right) \gamma_{i,l}^{(k)},$ where $\mathrm{sg}$ detaches the gradient for unbiasedness. This shifts up probability mass for all plausible candidates, reducing token-level sparsity.
Incorrect tokens ( $A_{i,l}<0$ ): For top-1 tokens, the penalty applied to $\gamma_{i,l}$ is amplified by a factor $\lambda>1$ : $\gamma_{i,l}^{\mathrm{neg}} = \begin{cases} \lambda\,\gamma_{i,l}, & \text{if } y_{i,l} \text{ is top-1}, \ \gamma_{i,l}, & \text{otherwise}. \end{cases}$ This discourages overcommitment to incorrect high-probability choices, while avoiding excessive penalization for lower-ranked alternatives.

All other tokens utilize the standard PPO/GRPO ratio.

3. Implementation and Algorithmic Components

SimKO extends standard PPO/GRPO algorithms with minimal architectural modification, requiring only per-token ratio replacement at “forking” positions and basic bookkeeping for entropy and top-K candidate indices. The principal hyperparameters include:

$K$ : Number of top candidates to smooth (recommended 3–5)
$\alpha$ : Smoothing strength for correct trajectories (e.g., 0.01)
$\lambda$ : Penalty for overconfident incorrect choices (e.g., 1.1)
$\tau$ : Entropy threshold for “forking” (e.g., 80th percentile of entropy over token positions)

Pseudocode for the update is as follows:

for each batch of (s, a, old_logp, A):
    logp = πθ.log_prob(a | s)
    ratio = exp(logp - old_logp)

    ent = -sum πθ(·|s) * log πθ(·|s)
    w = (ent > τ) # fork tokens only

    topk_probs = topK(πθ(·|s), K)
    old_topk_probs = topK(πref(·|s), K)
    topk_ratio = sum(exp(topk_probs - old_topk_probs) * detach(ratio / topk_ratio))

    pos_mask = (A > 0) & w
    ratio[pos_mask] = (1 - α) * ratio[pos_mask] + (α / K) * topk_ratio[pos_mask]

    neg_mask = (A < 0) & is_top1(a) & w
    ratio[neg_mask] *= λ

    pg_loss = -mean(ratio * A)
    # ... add clipping and KL penalty ...
    optimize(pg_loss)

Integration is a drop-in replacement within existing RLVR (PPO/GRPO) loops.

4. Empirical Evaluation

Experiments assess SimKO on a range of backbone models and benchmark datasets. Backbones include Qwen2.5-Math-7B, Qwen2.5-7B, and Llama3.2-3B-Instruct; evaluation spans math tasks (MATH-500, Minerva, OlympiadBench, AMC, AIME, GSM8K) and logic tasks (Synlogic-easy, Big-Bench Hard).

Key metrics are unbiased pass@1 and pass@K up to $K=256$ . Representative results are summarized as follows:

Model/Dataset	Base Model	GRPO	SimKO	SimKO Gain (vs GRPO)
Qwen2.5-Math-7B (p@1/256)	25.8/76.4	41.7/76.1	43.4/80.5	+1.7/+4.4
Qwen2.5-7B (p@1/256)	26.6/76.4	38.4/72.3	38.9/74.3	+0.5/+2.0
Llama3.2-3B (p@1/256)	14.2/68.9	23.3/69.5	24.0/70.8	+0.7/+1.3
Qwen2.5-7B Synlogic (p@128)	—	49.4	55.0	+5.6
Qwen2.5-7B BBH (p@128)	—	88.2	92.0	+3.8

Ablations indicate that setting $\alpha=0$ (i.e., reverting to GRPO) eliminates SimKO's pass@K gains, while optimal performance is reached for $\alpha \approx 0.01$ , $\lambda = 1.1$ , and $\tau$ set to the 80th percentile of entropy. Gains for pass@K are consistent across choices of $K$ , peaking for $K=3$ or $K=4$ .

Component ablations reveal that omitting the positive smoothing term reduces pass@256 by approximately 0.6%, and omitting the negative penalty term reduces pass@256 by approximately 1.6%.

5. Mechanistic Insights and Guiding Principles

SimKO's primary contributions stem from its asymmetric, entropy-aware smoothing and penalization strategy:

Asymmetric updates: Positive smoothing across top-K tokens in correct trajectories counteracts the spike in single-token probabilities, preserving altnerative valid candidates. Negative penalization applied only to overconfident incorrect top-1 tokens prevents "squeezing" probability mass into other regions of the distribution, which could further entrench suboptimal predictions.
Entropy-based selection: Applying SimKO exclusively at high-entropy “forking” tokens concentrates exploration where multiple continuations are syntactically and semantically viable, preventing disruption of structured or grammar-level tokens.
Integration and Efficiency: The method incurs negligible additional computational cost, requiring only modification of per-token ratios in existing PPO/GRPO pipelines.

Recommended default hyperparameters are $K=3$ –5 (reflecting the concentration of probability mass in typical LLM output heads), $\alpha \approx 0.01$ for smoothing, $\lambda \approx 1.05$ –1.1 for penalization, and $\tau$ at the 70–90th percentile of observed token entropy.

6. Broader Implications and Practical Recommendations

SimKO directly addresses the exploration limitations endemic to RLVR-style finetuning for LLMs, reliably improving pass@K while maintaining or improving pass@1. The method is model-agnostic and applies to any on-policy RLVR framework, including but not limited to GRPO and PPO, and can be instantiated on diverse model architectures and domains without significant tuning. A plausible implication is that such entropy-sensitive, top-K-based smoothing methods may generalize to other domains requiring controlled exploration, particularly those where both diversity and precision of LLM outputs are necessary (Peng et al., 16 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SimKO: Simple Pass@K Policy Optimization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SimKO.