SEED-GRPO: Semantic Entropy in Policy Optimization

Updated 24 September 2025

SEED-GRPO is a reinforcement learning algorithm that integrates semantic entropy into policy optimization for fine-tuning large language models.
It dynamically adjusts policy updates based on measured uncertainty, leading to enhanced generalization, training stability, and improved reasoning accuracy.
Empirical results show significant gains on mathematical reasoning benchmarks, with a 7B-parameter model outperforming larger baselines.

Seed-GRPO (Semantic Entropy EnhanceD Group Relative Policy Optimization) is a reinforcement learning algorithm designed to fine-tune LLMs with uncertainty-aware updates. By explicitly measuring semantic entropy—the diversity of meaning in a model’s outputs for each prompt—SEED-GRPO dynamically adjusts the magnitude of policy updates, leading to improved generalization, training stability, and accuracy on difficult reasoning benchmarks.

1. Foundation: GRPO and its Limitations

Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning method for LLMs where policy updates rely on group-normalized rewards. For a set of $G$ sampled outputs $\{o_1, \ldots, o_G\}$ from a prompt $q$ , the GRPO advantage is calculated via shift and scale normalization:

$A_i = \frac{r_i - \frac{1}{G}\sum_{j=1}^G r_j}{\sqrt{\frac{1}{G}\sum_{j=1}^G (r_j - \mu_G)^2} + \epsilon}$

where $r_i$ is the reward (e.g., correctness), $\mu_G$ is the group mean, and $\epsilon$ ensures numerical stability. The gradient update incorporates importance sampling ratios relative to an old policy and KL-penalty regularization toward a reference policy.

Vanilla GRPO applies equal update magnitude to all prompts, disregarding how confident the model is about the question. This approach may overfit to noisy supervision on ambiguous queries or under-exploit clear signals on well-understood ones.

2. Semantic Entropy: Quantifying Model Uncertainty

SEED-GRPO introduces semantic entropy (SE) to measure how diverse model outputs are in terms of meaning rather than surface token variation. For a prompt $q$ , $G$ rollouts are sampled, and responses $\{o_1, \ldots, o_G\}$ are grouped into $K$ semantic clusters $\{C_1, \ldots, C_K\}$ , where each cluster corresponds to a unique answer meaning. The semantic entropy is approximated as

$SE(q) \approx -\frac{1}{K} \sum_{k=1}^K \log p(C_k | q)$

with

$p(C_k | q) = \sum_{o_i \in C_k} \pi_{\theta_{\text{old}}}(o_i | q)$

where $\pi_{\theta_{\text{old}}}$ is the sampling policy. Higher $SE(q)$ implies greater uncertainty (outputs spread across different meanings); lower $SE(q)$ indicates model confidence.

Unlike token-level Shannon entropy, semantic entropy is robust to paraphrasing, focusing on genuine answer diversity.

3. Uncertainty-Aware Policy Optimization

SEED-GRPO uses semantic entropy to modulate the advantage for each sampled response. For each rollout, the advantage $A_i$ is scaled according to the prompt’s uncertainty:

$\hat{A}_i = A_i \cdot f\left( \frac{\alpha \; SE(q)}{SE_{\max}} \right)$

where:

$\alpha$ is a sensitivity hyperparameter,
$SE_{\max} = \log G$ is the maximal entropy with $G$ samples,
$f(\cdot)$ is a nonnegative scaling function (linear, exponential, or focal).

This adaptive scaling ensures conservative updates when the prompt produces high entropy (uncertainty) and standard updates when entropy is low (confidence). In practice, $f$ is often chosen as a linear function for robust performance. The resulting policy gradient update for each sample is:

$\nabla_\theta L_i(\theta) = \nabla_\theta \log \pi_\theta(o_i | q) \cdot \text{ratio}_i(\theta) \cdot \hat{A}_i$

with $\text{ratio}_i(\theta) = \pi_\theta(o_i | q) / \pi_{\theta_{\text{old}}}(o_i | q)$ .

4. Empirical Performance and Benchmark Results

Experiments on five mathematical reasoning datasets (AIME24, AMC, MATH, Minerva, OlympiadBench) demonstrate substantial improvements using SEED-GRPO:

MATH: 83.4% Pass@1 accuracy
AIME24: 56.7%
AMC: 68.7%
Minerva: 34.2%
OlympiadBench: 48.0%

Ablation studies confirm the scaling function and $\alpha$ as sensitive parameters; linear scaling is found most robust. Increasing group size $G$ (the number of sampled rollouts per prompt) enhances semantic entropy estimation and downstream performance, especially for challenging tasks. Notably, a 7B-parameter model with SEED-GRPO can outperform prior 32B baselines, indicating efficient utilization of training signal.

5. Implications and Extensions

By incorporating a calibration signal from semantic entropy, SEED-GRPO mitigates overfitting to noisy rewards and stabilizes generalization on ambiguous prompts. This method serves as an implicit curriculum, allowing aggressive learning on confident prompts while being cautious on uncertain ones. Its applicability generalizes beyond math reasoning—potential domains include:

Multimodal question answering
Code synthesis
Open-ended natural language inference

SEED-GRPO’s dynamic uncertainty-aware updates align with emerging needs for curriculum RL, robust model calibration, and interpretability in LLMs.

6. Limitations and Future Directions

Next steps proposed in the paper include:

Incorporating Intermediate Reasoning: Extending clustering from final answers to entire reasoning paths for finer uncertainty assessment.
External Semantic Models: Using additional models (e.g., GPT-4o, Gemini, RoBERTa, SBERT) to improve semantic clustering quality.
Test-Time Adaptivity: Adjusting the number of model rollouts or fallback search heuristics depending on runtime semantic entropy.
Alternate Scaling Functions: Exploring exponential, focal, or adaptive variants for scaling $A_i$ .
Domain Generalization: Extension to domains with less rigid answer structure, such as free-form dialogue or multimodal reasoning.

These directions aim to enhance granular uncertainty modeling, improve generalization, and further stabilize RL-based LLM optimization.

SEED-GRPO is situated within a family of critic-free RL methods—GRPO, PPO, TIC GRPO—where group normalization and shift/scale adjustment differentiate methods. GRPO relies on relative group advantages; PPO on explicit value baselines; TIC GRPO employs trajectory-level probability ratios for unbiased gradients. SEED-GRPO is unique in its explicit response-driven uncertainty calibration and its demonstrated empirical superiority for Pass@1 accuracy in mathematical reasoning benchmarks.

In summary, SEED-GRPO represents a principled approach for integrating semantic uncertainty into LLM fine-tuning, yielding enhanced accuracy, stability, and interpretability in settings where diverse outputs reflect underlying epistemic boundaries (Chen et al., 18 May 2025).

PDF Markdown Chat (Pro)

References (1)

SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization (2025)

Follow Topic

Get notified by email when new papers are published related to SEED-GRPO.