- The paper introduces an unlikeliness reward that counteracts GRPO’s rank bias by penalizing high-probability solutions and emphasizing rarer correct proofs.
- It employs a modified reward function and adjusted PPO epochs to improve multi-sample performance and enhance sample diversity in formal theorem proving.
- The approach yields competitive pass@N scores, demonstrating practical benefits for training LLMs in reasoning tasks.
This paper investigates the limitations of Group Relative Policy Optimization (GRPO), a common reinforcement learning (RL) algorithm used for training LLMs in reasoning tasks, particularly formal theorem proving. The authors identify a "rank bias" in GRPO, where the algorithm preferentially reinforces already high-probability solutions, neglecting rarer but correct ones. This leads to "distribution sharpening": the model becomes better at solving problems it could already solve (improving pass@N for small N) but fails to discover new solutions or improve performance when many samples are allowed (hurting pass@N for large N). This is a significant drawback in domains like formal theorem proving, where generating and verifying multiple candidates is a standard practice.
To address this, the paper introduces the "unlikeliness reward," a modification to the reward function that penalizes high-probability correct solutions, thereby relatively up-weighting rarer correct solutions. The modified reward ri for a sample yi in a group of G samples is:
ri=R(x,yi)(1−βrankGG−rank(yi))
where R(x,yi) is the original binary reward (1 if yi proves theorem x, 0 otherwise), rank(yi) is the rank of yi under the current policy (rank 0 for highest probability), and βrank is a hyperparameter controlling the penalty strength (set to 0.25 in experiments). This encourages the model to explore and reinforce less likely, yet valid, proof strategies.
Additionally, the authors find that increasing the number of PPO epochs (optimization steps per batch) can also mitigate rank bias. When multiple gradient steps are taken, high-probability solutions might quickly saturate the PPO clipping objective, allowing subsequent steps to focus on lower-probability solutions that are still within the clipping bounds. However, this comes at the cost of increased training time.
Implementation and Experimental Setup
Key Findings and Results
- GRPO's Rank Bias: Standard GRPO (GRPO-Default) improves pass@N for small N (e.g., N=1 to N=16) but underperforms the base model for large N (e.g., N > 64). Analysis of "uplift rate" (the probability that GRPO increases a solution's likelihood) shows a strong positive correlation with the initial probability rank of the solution: high-probability solutions are much more likely to be uplifted than low-probability ones (Figure 4).
- Unlikeliness Reward Mitigates Bias:
- Improved pass@N for large N: GRPO with unlikeliness reward (GRPO-Unlikeliness-1 and GRPO-Unlikeliness-2) significantly improves pass@N across a wide range of N, especially for larger N values, outperforming GRPO-Default (Figure 6).
- Reversed Uplift Pattern: The unlikeliness reward reverses the rank bias, leading to higher uplift rates for lower-probability correct solutions (Figure 7).
- Increased Sample Diversity: Models trained with unlikeliness reward show higher unique proof generation during training, indicating greater sample diversity (Figure 8).
- Effect of PPO Epochs:
- Increasing PPO epochs (GRPO-Epochs-2, GRPO-Epochs-3) also improves pass@N for large N and increases the uplift rate for low-probability solutions, though the rank bias pattern remains (Figures 6, 7).
- It also increases sample diversity (Figure 8).
- Trade-off: More PPO epochs significantly increase training time. For example, policy update time per batch increases from ~70s (1 epoch) to ~140s (2 epochs) and ~210s (3 epochs).
- KL Penalty: Increasing the KL loss coefficient (e.g., from 0.02 to 0.10) helps prevent the deterioration of pass@N at large N by preserving the base model's distribution but doesn't substantially improve it on its own without addressing rank bias (Appendix C).
- Cumulative Accuracy: All GRPO variants, especially GRPO-Unlikeliness-2, solved more training problems during a single epoch than a static base model, indicating effective online learning and generalization within the epoch (Table 2).
- Competitive Performance: The best variant, GRPO-Unlikeliness-2 (K=2, βKL=0.10, βrank=0.25), trained on a larger dataset, achieves competitive performance with DeepSeek-Prover-V1.5-RL (2408.08152) on the miniF2F-test benchmark (Table 3). For example, on Dval:
- V1.5-SFT: pass@128 = 83.1%
- V1.5-RL: pass@128 = 87.5%
- Ours (GRPO-Unlikeliness-2): pass@128 = 88.8%
Practical Implications for Implementation
The paper concludes that by directly addressing GRPO's rank bias with the unlikeliness reward, it's possible to substantially improve multi-sample performance and sample diversity in RL-trained LLMs for tasks like formal theorem proving.