Stratified GRPO: Fair Credit in RL

Updated 8 October 2025

The paper introduces Stratified GRPO, a method that partitions trajectories by search call counts to compute locally normalized advantages, eliminating cross-stratum bias.
It employs Stratified Advantage Normalization (SAN) to achieve unbiased learning signals with zero mean and unit variance within each homogeneous subgroup.
Empirical evaluations on multi-hop benchmarks show improvements of up to 14.5 points in reward metrics, affirming its effectiveness for LLM search agents.

Stratified Group Relative Policy Optimization (Stratified GRPO) is an advanced reinforcement learning methodology designed to address challenges in credit assignment arising from structural heterogeneity in agent trajectories, particularly in LLM search agents that interact with external tools. Standard policy gradient methods use a single global baseline, which introduces cross-stratum bias when evaluating structurally diverse trajectories. Stratified GRPO overcomes this by partitioning trajectories into homogeneous strata and computing normalized advantages within each stratum, thereby ensuring fair comparison and robust reward signals.

1. Motivation and Problem Statement

Stratified GRPO was developed in response to the limitations of global normalization in reinforcement learning for search-based LLM agents, whose trajectories vary significantly in the number, placement, and outcome of search engine calls. The structural heterogeneity produces distributions of rewards that are not directly comparable. In standard GRPO or policy gradient approaches, all trajectories in a batch are evaluated against a single global baseline (mean and variance), which results in cross-stratum bias—misleading credit assignment when heterogeneous strategies or search counts are present. This bias can distort learning signals, impeding exploration and effective multi-hop search strategies (Zhu et al., 7 Oct 2025).

2. Stratified Advantage Normalization (SAN)

The central component of Stratified GRPO is Stratified Advantage Normalization (SAN), which partitions trajectories into strata based on a structural property—most commonly the number of search calls. Within each stratum $k$ for a prompt $x$ , SAN calculates the empirical mean $\mu_k(x)$ and standard deviation $\sigma_k(x)$ of the rewards over all trajectories in that group: $\mu_k(x) = \frac{1}{n_k} \sum_{\tau_i \in B_k(x)} R(\tau_i), \qquad \sigma_k(x) = \sqrt{\frac{1}{n_k} \sum_{\tau_i \in B_k(x)} (R(\tau_i) - \mu_k(x))^2}$ The SAN advantage for trajectory $\tau_i$ in stratum $k$ is

$A_{SAN}(\tau_i) = \frac{R(\tau_i) - \mu_k(x)}{\sigma_k(x) + \epsilon}$

where $\epsilon$ ensures numerical stability. This local normalization yields conditionally unbiased, unit-variance estimates inside each stratum. Crucially, by restricting comparisons to homogeneous peers, SAN entirely removes the systematic offset (the cross-stratum bias term) present when using global baselines: $A_G(\tau_i) = (R(\tau_i) - \mu_k(x)) + (\mu_k(x) - \bar{R}_{global})$

3. Theoretical Properties

SAN offers strong theoretical guarantees:

Unbiasedness and unit variance within strata: For any stratum $k$ , $\mathbb{E}[A_{SAN}(\tau) | k, x] = 0$ and $\mathrm{Var}(A_{SAN}(\tau) | k, x) = 1$ .
Global preservation: In the large-sample regime, aggregating over all strata recovers the global normalization statistics; both SAN and global normalization have mean zero and unit variance across the batch. However, the global baseline is susceptible to bias from stratum offsets and reward scale disparities, which SAN consistently eliminates.
Bias elimination: The paper rigorously proves that cross-stratum bias vanishes inside each stratum, resulting in a cleaner, scale-stable learning signal (Zhu et al., 7 Oct 2025).

4. Practical Implementation and Stability

There are practical considerations when implementing SAN, especially in finite-sample regimes. Small strata can exhibit noisy estimates of mean and variance due to limited sample size. To mitigate this, the approach uses a blended advantage: $A_{blend}(\tau) = \alpha \cdot A_{SAN}(\tau) + (1 - \alpha) \cdot A_{GN}(\tau)$ where $A_{GN}$ is the global normalized advantage and $\alpha \in [0,1]$ controls the trade-off. With $\alpha$ near 1, the method maintains local normalization benefits, borrowing global estimator stability when needed. This blending ensures robust policy updates even when stratum sizes are small.

5. Empirical Evaluation and Performance

Extensive experiments evaluate Stratified GRPO across single-hop and multi-hop question-answering benchmarks (Natural Questions, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle) (Zhu et al., 7 Oct 2025). Results consistently show that Stratified GRPO:

Outperforms standard GRPO by up to 11.3 points in mean training reward and policy effectiveness.
Maintains a more stable reward curve throughout training, with discernible improvements in exploration efficiency.
Especially excels in multi-hop tasks, achieving gains up to 14.5 points over baseline methods.
Discovery of effective search strategies, such as converging to an optimal average search call count, is facilitated by the improved learning signal purity.

6. Broader Implications and Future Directions

Stratified GRPO generalizes to any scenario in reinforcement learning where structural trajectory heterogeneity induces distributional mismatch. The theoretical and empirical results indicate that explicit stratification—partitioning data into homogeneous subgroups and normalizing within each—constitutes a principled remedy for learning bias and credit assignment failure. The success of SAN suggests further investigations into stratification schemes based on more complex structure (e.g., modality, action space mixture) and adaptive partitioning. The blending strategy offers a blueprint for variance reduction in real-world RL implementations under limited data. A plausible implication is that stratification may also prove valuable in other domains, such as multi-agent systems and mixed discrete-continuous RL, where structural diversity is prevalent.

7. Summary

Stratified GRPO (as developed in (Zhu et al., 7 Oct 2025)) is a reinforcement learning framework for structurally heterogeneous agent trajectories. By partitioning trajectories into homogeneous strata—most often by search count—and normalizing rewards and advantages locally, it eliminates cross-stratum bias and achieves unbiased, unit-variance advantage estimates. Linear blending with the global estimator increases practical stability under finite-sample conditions. Empirical evaluations confirm substantial improvements over standard GRPO in both reward attainment and training stability. These results establish stratification as a foundational technique for robust credit assignment and exploration in RL for tool-augmented LLM agents, with broader application potential throughout reinforcement learning.

PDF Markdown Chat (Pro)

References (1)

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Stratified GRPO.