Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening (2506.02355v2)

Published 3 Jun 2025 in cs.LG

Abstract: Reinforcement learning is emerging as a primary driver for improving LLM reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve LLM reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release

Summary

The paper introduces an unlikeliness reward that counteracts GRPO’s rank bias by penalizing high-probability solutions and emphasizing rarer correct proofs.
It employs a modified reward function and adjusted PPO epochs to improve multi-sample performance and enhance sample diversity in formal theorem proving.
The approach yields competitive pass@N scores, demonstrating practical benefits for training LLMs in reasoning tasks.

This paper investigates the limitations of Group Relative Policy Optimization (GRPO), a common reinforcement learning (RL) algorithm used for training LLMs in reasoning tasks, particularly formal theorem proving. The authors identify a "rank bias" in GRPO, where the algorithm preferentially reinforces already high-probability solutions, neglecting rarer but correct ones. This leads to "distribution sharpening": the model becomes better at solving problems it could already solve (improving pass@N for small N) but fails to discover new solutions or improve performance when many samples are allowed (hurting pass@N for large N). This is a significant drawback in domains like formal theorem proving, where generating and verifying multiple candidates is a standard practice.

To address this, the paper introduces the "unlikeliness reward," a modification to the reward function that penalizes high-probability correct solutions, thereby relatively up-weighting rarer correct solutions. The modified reward $r_i$ for a sample $y_i$ in a group of $G$ samples is:

$r_i = R(x, y_i)\left(1 - \beta_{\text{rank}} \frac{G - \text{rank}(y_i)}{G} \right)$

where $R(x, y_i)$ is the original binary reward (1 if $y_i$ proves theorem $x$ , 0 otherwise), $\text{rank}(y_i)$ is the rank of $y_i$ under the current policy (rank 0 for highest probability), and $\beta_{\text{rank}}$ is a hyperparameter controlling the penalty strength (set to 0.25 in experiments). This encourages the model to explore and reinforce less likely, yet valid, proof strategies.

Additionally, the authors find that increasing the number of PPO epochs (optimization steps per batch) can also mitigate rank bias. When multiple gradient steps are taken, high-probability solutions might quickly saturate the PPO clipping objective, allowing subsequent steps to focus on lower-probability solutions that are still within the clipping bounds. However, this comes at the cost of increased training time.

Implementation and Experimental Setup

Task: Formal theorem proving in Lean, using a verifier as the reward function.
Base Model: DeepSeek-Prover-V1.5-SFT.
Dataset: A 10K subset of the Lean Workbook dataset (2406.03847) combined with 244 problems from miniF2F-valid (2109.00110) for training (9.6K problems) and validation (200 problems). Final large-scale experiments use a 11k theorem subset from (2502.07640).
GRPO Implementation: Built on the verl framework (2409.19256), with a Python wrapper for the Lean REPL from (2408.08152).
Hyperparameters (GRPO-Default):
- Learning rate: 1e-6 (reduced from original 5e-6 for stability)
- KL loss coefficient ( $\beta_\mathrm{KL}$ ): 0.02
- Samples per problem (G): 32
- PPO epochs (K): 1
Hyperparameters (GRPO-Unlikeliness variants):
- $\beta_\mathrm{KL}$ : 0.10 (increased to preserve diversity)
- $\beta_\mathrm{rank}$ : 0.25
- PPO epochs (K): 1 or 2
Evaluation Metric: pass@N, the probability that at least one of N independently sampled proof attempts succeeds.

$\text{pass@}N(x; \pi_\theta) = \mathbbm{1}\left\{ \max_{1 \leq j \leq N} R(x, y_j) = 1 \right\}$
Efficient Updates: A buffer of recent samples with non-zero advantage is maintained, and model updates are performed only when the buffer reaches the target batch size. This is similar to Dynamic Sampling (2503.14476).

Key Findings and Results

GRPO's Rank Bias: Standard GRPO (GRPO-Default) improves pass@N for small N (e.g., N=1 to N=16) but underperforms the base model for large N (e.g., N > 64). Analysis of "uplift rate" (the probability that GRPO increases a solution's likelihood) shows a strong positive correlation with the initial probability rank of the solution: high-probability solutions are much more likely to be uplifted than low-probability ones (Figure 4).
Unlikeliness Reward Mitigates Bias:
- Improved pass@N for large N: GRPO with unlikeliness reward (GRPO-Unlikeliness-1 and GRPO-Unlikeliness-2) significantly improves pass@N across a wide range of N, especially for larger N values, outperforming GRPO-Default (Figure 6).
- Reversed Uplift Pattern: The unlikeliness reward reverses the rank bias, leading to higher uplift rates for lower-probability correct solutions (Figure 7).
- Increased Sample Diversity: Models trained with unlikeliness reward show higher unique proof generation during training, indicating greater sample diversity (Figure 8).
Effect of PPO Epochs:
- Increasing PPO epochs (GRPO-Epochs-2, GRPO-Epochs-3) also improves pass@N for large N and increases the uplift rate for low-probability solutions, though the rank bias pattern remains (Figures 6, 7).
- It also increases sample diversity (Figure 8).
- Trade-off: More PPO epochs significantly increase training time. For example, policy update time per batch increases from ~70s (1 epoch) to ~140s (2 epochs) and ~210s (3 epochs).
KL Penalty: Increasing the KL loss coefficient (e.g., from 0.02 to 0.10) helps prevent the deterioration of pass@N at large N by preserving the base model's distribution but doesn't substantially improve it on its own without addressing rank bias (Appendix C).
Cumulative Accuracy: All GRPO variants, especially GRPO-Unlikeliness-2, solved more training problems during a single epoch than a static base model, indicating effective online learning and generalization within the epoch (Table 2).
Competitive Performance: The best variant, GRPO-Unlikeliness-2 (K=2, $\beta_\mathrm{KL}$ $β_{KL}$ =0.10, $\beta_\mathrm{rank}$ $β_{rank}$ =0.25), trained on a larger dataset, achieves competitive performance with DeepSeek-Prover-V1.5-RL (2408.08152) on the miniF2F-test benchmark (Table 3). For example, on $\mathcal{D}_\text{val}$ $D_{val}$ :
- V1.5-SFT: pass@128 = 83.1%
- V1.5-RL: pass@128 = 87.5%
- Ours (GRPO-Unlikeliness-2): pass@128 = 88.8%

Practical Implications for Implementation

Addressing Distribution Sharpening: When using PPO-style algorithms like GRPO for tasks where multi-sample performance (pass@N for large N) is crucial, practitioners should be aware of the potential for distribution sharpening and rank bias.

Unlikeliness Reward as a Simple Fix: The unlikeliness reward offers a straightforward and effective method to counteract this bias. It requires minimal changes to the existing GRPO framework:

During reward calculation, rank the $G$ samples generated for a problem based on their probabilities under $\pi_{\theta_{old}}$ .

Apply the multiplicative penalty to the rewards of correct samples based on their rank.

# Pseudocode for unlikeliness reward
# G: number of samples per problem
# beta_rank: hyperparameter (e.g., 0.25)

# For each problem x:
#   samples = [y_1, ..., y_G]
#   probabilities = [pi_old(y_i | x) for y_i in samples]
#   rewards_R = [verifier(x, y_i) for y_i in samples] # Original 0/1 rewards

#   # Get ranks (0 for highest probability)
#   sorted_indices = sorted(range(G), key=lambda i: probabilities[i], reverse=True)
#   ranks = [0] * G
#   for rank_val, original_idx in enumerate(sorted_indices):
#       ranks[original_idx] = rank_val

#   modified_rewards = [0.0] * G
#   for i in range(G):
#       if rewards_R[i] > 0: # Only apply to correct samples
#           penalty_factor = 1.0 - beta_rank * ( (G - 1 - ranks[i]) / (G -1) if G > 1 else 0) # G - 1 - rank for 0 to G-1
#           # The paper uses (G - rank(y_i)) / G, where rank is 1 to G, with G being highest prob.
#           # Assuming rank 0 is highest prob in paper's formula (text says rank 0 is highest):
#           # penalty_factor = 1.0 - beta_rank * ( (G - ranks[i]) / G ) # if ranks[i] is 0..G-1
#           # Let's stick to the paper's formula text: "rank 0 corresponding to the highest-probability sample"
#           # And "G - rank(y_i) / G". If rank(y_i) is from 0 to G-1.
#           # For highest prob (rank 0): 1 - beta_rank * (G/G) = 1 - beta_rank
#           # For lowest prob (rank G-1): 1 - beta_rank * (1/G)
#           # This means higher prob gets more penalty.
#           # The formula in the paper is (1 - beta_rank * ( (G - rank(y_i)) / G) )
#           # rank(y_i) from 0 (highest) to G-1 (lowest)
#           # if rank(y_i) = 0 (highest prob): 1 - beta_rank * (G/G) = 1 - beta_rank
#           # if rank(y_i) = G-1 (lowest prob): 1 - beta_rank * ( (G - (G-1)) / G ) = 1 - beta_rank * (1/G)
#           # This means higher probability samples get a *smaller* reward multiplier.

#           # Correct interpretation based on "penalty is applied to higher-probability solutions":
#           # Higher rank value (lower probability) should get less penalty.
#           # The paper states: "rank 0 corresponding to the highest-probability sample."
#           # And reward is R(x,y_i) * (1 - beta_rank * (G - rank(y_i))/G )
#           # If rank(y_i) is 0 for highest prob, G-1 for lowest prob:
#           #   Highest prob (rank=0): factor = 1 - beta_rank * (G-0)/G = 1 - beta_rank
#           #   Lowest prob (rank=G-1): factor = 1 - beta_rank * (G-(G-1))/G = 1 - beta_rank * (1/G)
#           # This means the reward for the highest probability sample is scaled by (1-beta_rank),
#           # and the reward for the lowest probability sample is scaled by (1 - beta_rank/G).
#           # Since beta_rank > 0, (1-beta_rank) < (1-beta_rank/G).
#           # This indeed penalizes higher-probability solutions more.

#           unlikeliness_factor = 1.0 - beta_rank * ( (G - ranks[i]) / G )
#           modified_rewards[i] = rewards_R[i] * unlikeliness_factor
#       else:
#           modified_rewards[i] = rewards_R[i] # 0 for incorrect samples

#   # Then compute advantages using these modified_rewards
#   advantages = normalize(modified_rewards) # GRPO-style normalization

Tuning PPO Epochs: While increasing PPO epochs can also help, it significantly increases training duration. The unlikeliness reward is presented as a more direct and efficient solution.
KL Regularization: A higher KL penalty (e.g., $\beta_\mathrm{KL}=0.1$ ) is beneficial for maintaining diversity and preventing pass@N degradation, complementing the unlikeliness reward.
Open Source Contribution: The paper provides an open pipeline for training formal theorem provers with RL that achieves competitive results.

The paper concludes that by directly addressing GRPO's rank bias with the unlikeliness reward, it's possible to substantially improve multi-sample performance and sample diversity in RL-trained LLMs for tasks like formal theorem proving.

PDF Markdown

Tweets

https://twitter.com/wellecks/status/1930260524391883186

https://twitter.com/fly51fly/status/1931482449034576289

https://twitter.com/wellecks/status/1930260577651204314

https://twitter.com/arxivsanitybot/status/1930453040592568537