Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 156 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 109 tok/s Pro
Kimi K2 168 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Stratified GRPO: Fair Credit in RL

Updated 8 October 2025
  • The paper introduces Stratified GRPO, a method that partitions trajectories by search call counts to compute locally normalized advantages, eliminating cross-stratum bias.
  • It employs Stratified Advantage Normalization (SAN) to achieve unbiased learning signals with zero mean and unit variance within each homogeneous subgroup.
  • Empirical evaluations on multi-hop benchmarks show improvements of up to 14.5 points in reward metrics, affirming its effectiveness for LLM search agents.

Stratified Group Relative Policy Optimization (Stratified GRPO) is an advanced reinforcement learning methodology designed to address challenges in credit assignment arising from structural heterogeneity in agent trajectories, particularly in LLM search agents that interact with external tools. Standard policy gradient methods use a single global baseline, which introduces cross-stratum bias when evaluating structurally diverse trajectories. Stratified GRPO overcomes this by partitioning trajectories into homogeneous strata and computing normalized advantages within each stratum, thereby ensuring fair comparison and robust reward signals.

1. Motivation and Problem Statement

Stratified GRPO was developed in response to the limitations of global normalization in reinforcement learning for search-based LLM agents, whose trajectories vary significantly in the number, placement, and outcome of search engine calls. The structural heterogeneity produces distributions of rewards that are not directly comparable. In standard GRPO or policy gradient approaches, all trajectories in a batch are evaluated against a single global baseline (mean and variance), which results in cross-stratum bias—misleading credit assignment when heterogeneous strategies or search counts are present. This bias can distort learning signals, impeding exploration and effective multi-hop search strategies (Zhu et al., 7 Oct 2025).

2. Stratified Advantage Normalization (SAN)

The central component of Stratified GRPO is Stratified Advantage Normalization (SAN), which partitions trajectories into strata based on a structural property—most commonly the number of search calls. Within each stratum kk for a prompt xx, SAN calculates the empirical mean μk(x)\mu_k(x) and standard deviation σk(x)\sigma_k(x) of the rewards over all trajectories in that group: μk(x)=1nkτiBk(x)R(τi),σk(x)=1nkτiBk(x)(R(τi)μk(x))2\mu_k(x) = \frac{1}{n_k} \sum_{\tau_i \in B_k(x)} R(\tau_i), \qquad \sigma_k(x) = \sqrt{\frac{1}{n_k} \sum_{\tau_i \in B_k(x)} (R(\tau_i) - \mu_k(x))^2} The SAN advantage for trajectory τi\tau_i in stratum kk is

ASAN(τi)=R(τi)μk(x)σk(x)+ϵA_{SAN}(\tau_i) = \frac{R(\tau_i) - \mu_k(x)}{\sigma_k(x) + \epsilon}

where ϵ\epsilon ensures numerical stability. This local normalization yields conditionally unbiased, unit-variance estimates inside each stratum. Crucially, by restricting comparisons to homogeneous peers, SAN entirely removes the systematic offset (the cross-stratum bias term) present when using global baselines: AG(τi)=(R(τi)μk(x))+(μk(x)Rˉglobal)A_G(\tau_i) = (R(\tau_i) - \mu_k(x)) + (\mu_k(x) - \bar{R}_{global})

3. Theoretical Properties

SAN offers strong theoretical guarantees:

  • Unbiasedness and unit variance within strata: For any stratum kk, E[ASAN(τ)k,x]=0\mathbb{E}[A_{SAN}(\tau) | k, x] = 0 and Var(ASAN(τ)k,x)=1\mathrm{Var}(A_{SAN}(\tau) | k, x) = 1.
  • Global preservation: In the large-sample regime, aggregating over all strata recovers the global normalization statistics; both SAN and global normalization have mean zero and unit variance across the batch. However, the global baseline is susceptible to bias from stratum offsets and reward scale disparities, which SAN consistently eliminates.
  • Bias elimination: The paper rigorously proves that cross-stratum bias vanishes inside each stratum, resulting in a cleaner, scale-stable learning signal (Zhu et al., 7 Oct 2025).

4. Practical Implementation and Stability

There are practical considerations when implementing SAN, especially in finite-sample regimes. Small strata can exhibit noisy estimates of mean and variance due to limited sample size. To mitigate this, the approach uses a blended advantage: Ablend(τ)=αASAN(τ)+(1α)AGN(τ)A_{blend}(\tau) = \alpha \cdot A_{SAN}(\tau) + (1 - \alpha) \cdot A_{GN}(\tau) where AGNA_{GN} is the global normalized advantage and α[0,1]\alpha \in [0,1] controls the trade-off. With α\alpha near 1, the method maintains local normalization benefits, borrowing global estimator stability when needed. This blending ensures robust policy updates even when stratum sizes are small.

5. Empirical Evaluation and Performance

Extensive experiments evaluate Stratified GRPO across single-hop and multi-hop question-answering benchmarks (Natural Questions, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle) (Zhu et al., 7 Oct 2025). Results consistently show that Stratified GRPO:

  • Outperforms standard GRPO by up to 11.3 points in mean training reward and policy effectiveness.
  • Maintains a more stable reward curve throughout training, with discernible improvements in exploration efficiency.
  • Especially excels in multi-hop tasks, achieving gains up to 14.5 points over baseline methods.
  • Discovery of effective search strategies, such as converging to an optimal average search call count, is facilitated by the improved learning signal purity.

6. Broader Implications and Future Directions

Stratified GRPO generalizes to any scenario in reinforcement learning where structural trajectory heterogeneity induces distributional mismatch. The theoretical and empirical results indicate that explicit stratification—partitioning data into homogeneous subgroups and normalizing within each—constitutes a principled remedy for learning bias and credit assignment failure. The success of SAN suggests further investigations into stratification schemes based on more complex structure (e.g., modality, action space mixture) and adaptive partitioning. The blending strategy offers a blueprint for variance reduction in real-world RL implementations under limited data. A plausible implication is that stratification may also prove valuable in other domains, such as multi-agent systems and mixed discrete-continuous RL, where structural diversity is prevalent.

7. Summary

Stratified GRPO (as developed in (Zhu et al., 7 Oct 2025)) is a reinforcement learning framework for structurally heterogeneous agent trajectories. By partitioning trajectories into homogeneous strata—most often by search count—and normalizing rewards and advantages locally, it eliminates cross-stratum bias and achieves unbiased, unit-variance advantage estimates. Linear blending with the global estimator increases practical stability under finite-sample conditions. Empirical evaluations confirm substantial improvements over standard GRPO in both reward attainment and training stability. These results establish stratification as a foundational technique for robust credit assignment and exploration in RL for tool-augmented LLM agents, with broader application potential throughout reinforcement learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Stratified GRPO.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube