Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 33 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 74 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 362 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

SEED-GRPO: Semantic Entropy in Policy Optimization

Updated 24 September 2025
  • SEED-GRPO is a reinforcement learning algorithm that integrates semantic entropy into policy optimization for fine-tuning large language models.
  • It dynamically adjusts policy updates based on measured uncertainty, leading to enhanced generalization, training stability, and improved reasoning accuracy.
  • Empirical results show significant gains on mathematical reasoning benchmarks, with a 7B-parameter model outperforming larger baselines.

Seed-GRPO (Semantic Entropy EnhanceD Group Relative Policy Optimization) is a reinforcement learning algorithm designed to fine-tune LLMs with uncertainty-aware updates. By explicitly measuring semantic entropy—the diversity of meaning in a model’s outputs for each prompt—SEED-GRPO dynamically adjusts the magnitude of policy updates, leading to improved generalization, training stability, and accuracy on difficult reasoning benchmarks.

1. Foundation: GRPO and its Limitations

Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning method for LLMs where policy updates rely on group-normalized rewards. For a set of GG sampled outputs {o1,,oG}\{o_1, \ldots, o_G\} from a prompt qq, the GRPO advantage is calculated via shift and scale normalization:

Ai=ri1Gj=1Grj1Gj=1G(rjμG)2+ϵA_i = \frac{r_i - \frac{1}{G}\sum_{j=1}^G r_j}{\sqrt{\frac{1}{G}\sum_{j=1}^G (r_j - \mu_G)^2} + \epsilon}

where rir_i is the reward (e.g., correctness), μG\mu_G is the group mean, and ϵ\epsilon ensures numerical stability. The gradient update incorporates importance sampling ratios relative to an old policy and KL-penalty regularization toward a reference policy.

Vanilla GRPO applies equal update magnitude to all prompts, disregarding how confident the model is about the question. This approach may overfit to noisy supervision on ambiguous queries or under-exploit clear signals on well-understood ones.

2. Semantic Entropy: Quantifying Model Uncertainty

SEED-GRPO introduces semantic entropy (SE) to measure how diverse model outputs are in terms of meaning rather than surface token variation. For a prompt qq, GG rollouts are sampled, and responses {o1,,oG}\{o_1, \ldots, o_G\} are grouped into KK semantic clusters {C1,,CK}\{C_1, \ldots, C_K\}, where each cluster corresponds to a unique answer meaning. The semantic entropy is approximated as

SE(q)1Kk=1Klogp(Ckq)SE(q) \approx -\frac{1}{K} \sum_{k=1}^K \log p(C_k | q)

with

p(Ckq)=oiCkπθold(oiq)p(C_k | q) = \sum_{o_i \in C_k} \pi_{\theta_{\text{old}}}(o_i | q)

where πθold\pi_{\theta_{\text{old}}} is the sampling policy. Higher SE(q)SE(q) implies greater uncertainty (outputs spread across different meanings); lower SE(q)SE(q) indicates model confidence.

Unlike token-level Shannon entropy, semantic entropy is robust to paraphrasing, focusing on genuine answer diversity.

3. Uncertainty-Aware Policy Optimization

SEED-GRPO uses semantic entropy to modulate the advantage for each sampled response. For each rollout, the advantage AiA_i is scaled according to the prompt’s uncertainty:

A^i=Aif(α  SE(q)SEmax)\hat{A}_i = A_i \cdot f\left( \frac{\alpha \; SE(q)}{SE_{\max}} \right)

where:

  • α\alpha is a sensitivity hyperparameter,
  • SEmax=logGSE_{\max} = \log G is the maximal entropy with GG samples,
  • f()f(\cdot) is a nonnegative scaling function (linear, exponential, or focal).

This adaptive scaling ensures conservative updates when the prompt produces high entropy (uncertainty) and standard updates when entropy is low (confidence). In practice, ff is often chosen as a linear function for robust performance. The resulting policy gradient update for each sample is:

θLi(θ)=θlogπθ(oiq)ratioi(θ)A^i\nabla_\theta L_i(\theta) = \nabla_\theta \log \pi_\theta(o_i | q) \cdot \text{ratio}_i(\theta) \cdot \hat{A}_i

with ratioi(θ)=πθ(oiq)/πθold(oiq)\text{ratio}_i(\theta) = \pi_\theta(o_i | q) / \pi_{\theta_{\text{old}}}(o_i | q).

4. Empirical Performance and Benchmark Results

Experiments on five mathematical reasoning datasets (AIME24, AMC, MATH, Minerva, OlympiadBench) demonstrate substantial improvements using SEED-GRPO:

  • MATH: 83.4% Pass@1 accuracy
  • AIME24: 56.7%
  • AMC: 68.7%
  • Minerva: 34.2%
  • OlympiadBench: 48.0%

Ablation studies confirm the scaling function and α\alpha as sensitive parameters; linear scaling is found most robust. Increasing group size GG (the number of sampled rollouts per prompt) enhances semantic entropy estimation and downstream performance, especially for challenging tasks. Notably, a 7B-parameter model with SEED-GRPO can outperform prior 32B baselines, indicating efficient utilization of training signal.

5. Implications and Extensions

By incorporating a calibration signal from semantic entropy, SEED-GRPO mitigates overfitting to noisy rewards and stabilizes generalization on ambiguous prompts. This method serves as an implicit curriculum, allowing aggressive learning on confident prompts while being cautious on uncertain ones. Its applicability generalizes beyond math reasoning—potential domains include:

  • Multimodal question answering
  • Code synthesis
  • Open-ended natural language inference

SEED-GRPO’s dynamic uncertainty-aware updates align with emerging needs for curriculum RL, robust model calibration, and interpretability in LLMs.

6. Limitations and Future Directions

Next steps proposed in the paper include:

  • Incorporating Intermediate Reasoning: Extending clustering from final answers to entire reasoning paths for finer uncertainty assessment.
  • External Semantic Models: Using additional models (e.g., GPT-4o, Gemini, RoBERTa, SBERT) to improve semantic clustering quality.
  • Test-Time Adaptivity: Adjusting the number of model rollouts or fallback search heuristics depending on runtime semantic entropy.
  • Alternate Scaling Functions: Exploring exponential, focal, or adaptive variants for scaling AiA_i.
  • Domain Generalization: Extension to domains with less rigid answer structure, such as free-form dialogue or multimodal reasoning.

These directions aim to enhance granular uncertainty modeling, improve generalization, and further stabilize RL-based LLM optimization.

SEED-GRPO is situated within a family of critic-free RL methods—GRPO, PPO, TIC GRPO—where group normalization and shift/scale adjustment differentiate methods. GRPO relies on relative group advantages; PPO on explicit value baselines; TIC GRPO employs trajectory-level probability ratios for unbiased gradients. SEED-GRPO is unique in its explicit response-driven uncertainty calibration and its demonstrated empirical superiority for Pass@1 accuracy in mathematical reasoning benchmarks.

In summary, SEED-GRPO represents a principled approach for integrating semantic uncertainty into LLM fine-tuning, yielding enhanced accuracy, stability, and interpretability in settings where diverse outputs reflect underlying epistemic boundaries (Chen et al., 18 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SEED-GRPO.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube