SEED-GRPO: Semantic Entropy in Policy Optimization
- SEED-GRPO is a reinforcement learning algorithm that integrates semantic entropy into policy optimization for fine-tuning large language models.
- It dynamically adjusts policy updates based on measured uncertainty, leading to enhanced generalization, training stability, and improved reasoning accuracy.
- Empirical results show significant gains on mathematical reasoning benchmarks, with a 7B-parameter model outperforming larger baselines.
Seed-GRPO (Semantic Entropy EnhanceD Group Relative Policy Optimization) is a reinforcement learning algorithm designed to fine-tune LLMs with uncertainty-aware updates. By explicitly measuring semantic entropy—the diversity of meaning in a model’s outputs for each prompt—SEED-GRPO dynamically adjusts the magnitude of policy updates, leading to improved generalization, training stability, and accuracy on difficult reasoning benchmarks.
1. Foundation: GRPO and its Limitations
Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning method for LLMs where policy updates rely on group-normalized rewards. For a set of sampled outputs from a prompt , the GRPO advantage is calculated via shift and scale normalization:
where is the reward (e.g., correctness), is the group mean, and ensures numerical stability. The gradient update incorporates importance sampling ratios relative to an old policy and KL-penalty regularization toward a reference policy.
Vanilla GRPO applies equal update magnitude to all prompts, disregarding how confident the model is about the question. This approach may overfit to noisy supervision on ambiguous queries or under-exploit clear signals on well-understood ones.
2. Semantic Entropy: Quantifying Model Uncertainty
SEED-GRPO introduces semantic entropy (SE) to measure how diverse model outputs are in terms of meaning rather than surface token variation. For a prompt , rollouts are sampled, and responses are grouped into semantic clusters , where each cluster corresponds to a unique answer meaning. The semantic entropy is approximated as
with
where is the sampling policy. Higher implies greater uncertainty (outputs spread across different meanings); lower indicates model confidence.
Unlike token-level Shannon entropy, semantic entropy is robust to paraphrasing, focusing on genuine answer diversity.
3. Uncertainty-Aware Policy Optimization
SEED-GRPO uses semantic entropy to modulate the advantage for each sampled response. For each rollout, the advantage is scaled according to the prompt’s uncertainty:
where:
- is a sensitivity hyperparameter,
- is the maximal entropy with samples,
- is a nonnegative scaling function (linear, exponential, or focal).
This adaptive scaling ensures conservative updates when the prompt produces high entropy (uncertainty) and standard updates when entropy is low (confidence). In practice, is often chosen as a linear function for robust performance. The resulting policy gradient update for each sample is:
with .
4. Empirical Performance and Benchmark Results
Experiments on five mathematical reasoning datasets (AIME24, AMC, MATH, Minerva, OlympiadBench) demonstrate substantial improvements using SEED-GRPO:
- MATH: 83.4% Pass@1 accuracy
- AIME24: 56.7%
- AMC: 68.7%
- Minerva: 34.2%
- OlympiadBench: 48.0%
Ablation studies confirm the scaling function and as sensitive parameters; linear scaling is found most robust. Increasing group size (the number of sampled rollouts per prompt) enhances semantic entropy estimation and downstream performance, especially for challenging tasks. Notably, a 7B-parameter model with SEED-GRPO can outperform prior 32B baselines, indicating efficient utilization of training signal.
5. Implications and Extensions
By incorporating a calibration signal from semantic entropy, SEED-GRPO mitigates overfitting to noisy rewards and stabilizes generalization on ambiguous prompts. This method serves as an implicit curriculum, allowing aggressive learning on confident prompts while being cautious on uncertain ones. Its applicability generalizes beyond math reasoning—potential domains include:
- Multimodal question answering
- Code synthesis
- Open-ended natural language inference
SEED-GRPO’s dynamic uncertainty-aware updates align with emerging needs for curriculum RL, robust model calibration, and interpretability in LLMs.
6. Limitations and Future Directions
Next steps proposed in the paper include:
- Incorporating Intermediate Reasoning: Extending clustering from final answers to entire reasoning paths for finer uncertainty assessment.
- External Semantic Models: Using additional models (e.g., GPT-4o, Gemini, RoBERTa, SBERT) to improve semantic clustering quality.
- Test-Time Adaptivity: Adjusting the number of model rollouts or fallback search heuristics depending on runtime semantic entropy.
- Alternate Scaling Functions: Exploring exponential, focal, or adaptive variants for scaling .
- Domain Generalization: Extension to domains with less rigid answer structure, such as free-form dialogue or multimodal reasoning.
These directions aim to enhance granular uncertainty modeling, improve generalization, and further stabilize RL-based LLM optimization.
7. Related Methodologies and Comparison
SEED-GRPO is situated within a family of critic-free RL methods—GRPO, PPO, TIC GRPO—where group normalization and shift/scale adjustment differentiate methods. GRPO relies on relative group advantages; PPO on explicit value baselines; TIC GRPO employs trajectory-level probability ratios for unbiased gradients. SEED-GRPO is unique in its explicit response-driven uncertainty calibration and its demonstrated empirical superiority for Pass@1 accuracy in mathematical reasoning benchmarks.
In summary, SEED-GRPO represents a principled approach for integrating semantic uncertainty into LLM fine-tuning, yielding enhanced accuracy, stability, and interpretability in settings where diverse outputs reflect underlying epistemic boundaries (Chen et al., 18 May 2025).