SimKO: Simple Pass@K Optimization
- SimKO is a reinforcement learning policy optimization technique that mitigates the exploration–exploitation imbalance in fine-tuning large language models.
- It employs an entropy-sensitive asymmetric objective that boosts multiple plausible correct tokens while penalizing overconfident incorrect predictions.
- Empirical evaluations show SimKO improves pass@K accuracy across various benchmarks without sacrificing pass@1 performance, making it a cost-effective integration in RLVR pipelines.
SimKO (Simple Pass@K Optimization) is a policy optimization method developed to address the exploration–exploitation imbalance in reinforcement learning with verifiable rewards (RLVR) for LLMs. While RLVR methods such as Group Relative Policy Optimization (GRPO) have improved single-sample accuracy (pass@1), they systematically degrade diversity-aware accuracy (pass@K for ). SimKO targets this over-concentration effect by employing an asymmetric objective: boosting the probability of multiple plausible candidates in correct trajectories while applying stronger penalties to highly probable but incorrect predictions, thus widening the explored solution space without sacrificing single-answer precision (Peng et al., 16 Oct 2025).
1. Over-Concentration in RLVR and Motivation
On-policy RLVR algorithms, such as GRPO, operate by generating samples per question and up-weighting responses validated as correct, while suppressing those deemed incorrect. This dynamic drives the policy distribution toward near-determinism at each decoding step, as repeated positive reinforcement is concentrated on already dominant tokens, resulting in substantial gains in pass@1 but a marked decline in pass@K (). The key observation is that standard RLVR leads to token-level probability distributions where the top-1 candidate increasingly monopolizes probability mass, while probabilities for alternative tokens collapse, severely restricting exploration and deteriorating the odds that any of the top-K samples is correct (Peng et al., 16 Oct 2025).
Quantitatively, the effect is measured via the mean log-probabilities for the -th ranked token: Empirically, after GRPO training, increases (converging toward ), while plummet (often to below ), confirming extreme concentration on the single most likely token.
2. SimKO: Asymmetric Ratio Modification for Policy Optimization
SimKO modifies the policy ratio used in PPO/GRPO-style RL objectives. It introduces an entropy-sensitive, asymmetric adjustment applied only at high-entropy “forking” tokens—positions where the policy is non-deterministic and multiple continuations are viable: For positions with entropy exceeding a threshold , SimKO distinguishes between positive and negative trajectories based on the per-token advantage :
- Correct tokens (): The importance ratio is smoothed across the top-K candidates via convex combination controlled by : where detaches the gradient for unbiasedness. This shifts up probability mass for all plausible candidates, reducing token-level sparsity.
- Incorrect tokens (): For top-1 tokens, the penalty applied to is amplified by a factor : This discourages overcommitment to incorrect high-probability choices, while avoiding excessive penalization for lower-ranked alternatives.
All other tokens utilize the standard PPO/GRPO ratio.
3. Implementation and Algorithmic Components
SimKO extends standard PPO/GRPO algorithms with minimal architectural modification, requiring only per-token ratio replacement at “forking” positions and basic bookkeeping for entropy and top-K candidate indices. The principal hyperparameters include:
- : Number of top candidates to smooth (recommended 3–5)
- : Smoothing strength for correct trajectories (e.g., 0.01)
- : Penalty for overconfident incorrect choices (e.g., 1.1)
- : Entropy threshold for “forking” (e.g., 80th percentile of entropy over token positions)
Pseudocode for the update is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
for each batch of (s, a, old_logp, A): logp = πθ.log_prob(a | s) ratio = exp(logp - old_logp) ent = -sum πθ(·|s) * log πθ(·|s) w = (ent > τ) # fork tokens only topk_probs = topK(πθ(·|s), K) old_topk_probs = topK(πref(·|s), K) topk_ratio = sum(exp(topk_probs - old_topk_probs) * detach(ratio / topk_ratio)) pos_mask = (A > 0) & w ratio[pos_mask] = (1 - α) * ratio[pos_mask] + (α / K) * topk_ratio[pos_mask] neg_mask = (A < 0) & is_top1(a) & w ratio[neg_mask] *= λ pg_loss = -mean(ratio * A) # ... add clipping and KL penalty ... optimize(pg_loss) |
4. Empirical Evaluation
Experiments assess SimKO on a range of backbone models and benchmark datasets. Backbones include Qwen2.5-Math-7B, Qwen2.5-7B, and Llama3.2-3B-Instruct; evaluation spans math tasks (MATH-500, Minerva, OlympiadBench, AMC, AIME, GSM8K) and logic tasks (Synlogic-easy, Big-Bench Hard).
Key metrics are unbiased pass@1 and pass@K up to . Representative results are summarized as follows:
| Model/Dataset | Base Model | GRPO | SimKO | SimKO Gain (vs GRPO) |
|---|---|---|---|---|
| Qwen2.5-Math-7B (p@1/256) | 25.8/76.4 | 41.7/76.1 | 43.4/80.5 | +1.7/+4.4 |
| Qwen2.5-7B (p@1/256) | 26.6/76.4 | 38.4/72.3 | 38.9/74.3 | +0.5/+2.0 |
| Llama3.2-3B (p@1/256) | 14.2/68.9 | 23.3/69.5 | 24.0/70.8 | +0.7/+1.3 |
| Qwen2.5-7B Synlogic (p@128) | — | 49.4 | 55.0 | +5.6 |
| Qwen2.5-7B BBH (p@128) | — | 88.2 | 92.0 | +3.8 |
Ablations indicate that setting (i.e., reverting to GRPO) eliminates SimKO's pass@K gains, while optimal performance is reached for , , and set to the 80th percentile of entropy. Gains for pass@K are consistent across choices of , peaking for or .
Component ablations reveal that omitting the positive smoothing term reduces pass@256 by approximately 0.6%, and omitting the negative penalty term reduces pass@256 by approximately 1.6%.
5. Mechanistic Insights and Guiding Principles
SimKO's primary contributions stem from its asymmetric, entropy-aware smoothing and penalization strategy:
- Asymmetric updates: Positive smoothing across top-K tokens in correct trajectories counteracts the spike in single-token probabilities, preserving altnerative valid candidates. Negative penalization applied only to overconfident incorrect top-1 tokens prevents "squeezing" probability mass into other regions of the distribution, which could further entrench suboptimal predictions.
- Entropy-based selection: Applying SimKO exclusively at high-entropy “forking” tokens concentrates exploration where multiple continuations are syntactically and semantically viable, preventing disruption of structured or grammar-level tokens.
- Integration and Efficiency: The method incurs negligible additional computational cost, requiring only modification of per-token ratios in existing PPO/GRPO pipelines.
Recommended default hyperparameters are –5 (reflecting the concentration of probability mass in typical LLM output heads), for smoothing, –1.1 for penalization, and at the 70–90th percentile of observed token entropy.
6. Broader Implications and Practical Recommendations
SimKO directly addresses the exploration limitations endemic to RLVR-style finetuning for LLMs, reliably improving pass@K while maintaining or improving pass@1. The method is model-agnostic and applies to any on-policy RLVR framework, including but not limited to GRPO and PPO, and can be instantiated on diverse model architectures and domains without significant tuning. A plausible implication is that such entropy-sensitive, top-K-based smoothing methods may generalize to other domains requiring controlled exploration, particularly those where both diversity and precision of LLM outputs are necessary (Peng et al., 16 Oct 2025).