Papers
Topics
Authors
Recent
Search
2000 character limit reached

GiGPO: Group-in-Group Policy Optimization

Updated 10 February 2026
  • GiGPO is a reinforcement learning algorithm that employs a two-level advantage estimation architecture to optimize credit assignment.
  • It leverages group-based statistics to compute both macro and micro relative advantages without relying on auxiliary critic models.
  • Empirical evaluations on benchmarks like ALFWorld and WebShop demonstrate significant improvements in success rates over GRPO.

Group-in-Group Policy Optimization (GiGPO) is a reinforcement learning (RL) algorithm designed to address the challenges of credit assignment in long-horizon LLM agent environments. GiGPO introduces a hierarchical mechanism for estimating advantages at both the episodic and step levels, leveraging group-based statistics to yield fine-grained credit signals without requiring critics or auxiliary models. The approach preserves the advantages of prior group-based RL methods—namely, critic-free design, low memory overhead, and stable convergence—while overcoming their limitations in environments with sparse and delayed rewards, as commonly encountered in complex LLM agent tasks (Feng et al., 16 May 2025).

1. Background and Motivation

Long-horizon agent environments, such as ALFWorld (up to 50 steps, 20K tokens per episode) and WebShop (complex web browsing, multi-turn, 1.1 M products), challenge traditional RL algorithms due to severe reward sparsity and delayed feedback—often only a single success or failure signal per episode. In these settings, standard group-based RL algorithms such as RLOO and GRPO compute trajectory-level returns within groups and estimate a single relative advantage per trajectory by normalizing each return against its peers. While effective and stable for single-turn tasks, these methods collapse all intermediate actions into a single scalar and fail to provide step-wise credit assignment, impeding optimization in settings where agent-environment interactions span many steps and optimal behaviors may hinge on individual action quality at specific junctures. GiGPO was developed to address this limitation by enabling fine-grained (per-step) credit signals without compromising group RL's critic-free, memory-efficient properties.

2. Two-Level Advantage Estimation Architecture

GiGPO operationalizes a hierarchical advantage estimation process encompassing two nested levels:

2.1 Episode-Level (Macro) Relative Advantage

At the highest level, GiGPO samples NN complete trajectories {τi}i=1N\{\tau_i\}_{i=1}^N under the same task xx and identical initial state s1(1)=...=s1(N)s_1^{(1)}=...=s_1^{(N)}. Each trajectory’s return is defined as R(τi)=t=1Trt(i)R(\tau_i)=\sum_{t=1}^{T}r_t^{(i)} (or R(τi){0,1}R(\tau_i) \in \{0,1\} in fully terminal-reward tasks). These are grouped:

GE={(τi,R(τi))}i=1NG^E = \{ (\tau_i, R(\tau_i)) \}_{i=1}^N

The episode-level ("macro") relative advantage for trajectory τi\tau_i is:

AE(τi)=R(τi)mean{R(τj)}Fnorm({R(τj)})A^E(\tau_i) = \frac{R(\tau_i) - \mathrm{mean}\{R(\tau_j)\}}{F_{\mathrm{norm}}\left(\{R(\tau_j)\}\right)}

where FnormF_{\mathrm{norm}} is typically the group's standard deviation, or can be set to $1$ for the unbiased Leave-One-Out variant.

2.2 Step-Level (Micro) Relative Advantage via Anchor-State Grouping

Within the same group, anchor-state grouping is performed by hashing all states encountered across {τi}\{\tau_i\} to create a set U={s~1,...,s~U}\mathcal{U} = \{\tilde s_1, ..., \tilde s_U\} of unique states. For each anchor state s~\tilde s, all occurrences (st(i),at(i),rt(i))(s_t^{(i)}, a_t^{(i)}, r_t^{(i)}) with st(i)=s~s_t^{(i)}=\tilde s are collected:

GS(s~)={(at(i),rt(i))st(i)=s~}G^S(\tilde s) = \{ (a_t^{(i)}, r_t^{(i)}) \mid s_t^{(i)} = \tilde s \}

Long-term effects are captured via discounted returns Rt(i)=k=tTγktrk(i)R_t^{(i)} = \sum_{k=t}^{T} \gamma^{k-t} r_k^{(i)}, yielding updated groups:

GS(s~)={(at(i),Rt(i))st(i)=s~}G^S(\tilde s) = \{ (a_t^{(i)}, R_t^{(i)}) \mid s_t^{(i)} = \tilde s \}

The micro relative advantage is:

AS(at(i))=Rt(i)mean{Rt(j)}Fnorm({Rt(j)})A^S(a_t^{(i)}) = \frac{R_t^{(i)} - \mathrm{mean}\{R_t^{(j)}\}}{F_{\mathrm{norm}}\left(\{R_t^{(j)}\}\right)}

for all (at(j),Rt(j))GS(s~)(a_t^{(j)}, R_t^{(j)}) \in G^S(\tilde s).

2.3 Combined Per-Step Advantage

The final advantage used in the PPO-style policy update is:

A(at(i))=AE(τi)+ωAS(at(i)),A(a_t^{(i)}) = A^E(\tau_i) + \omega A^S(a_t^{(i)}),

with ω0\omega \geq 0 balancing global and local credit (typically ω=1\omega = 1).

3. Algorithmic Procedure and Complexity

The GiGPO workflow is as follows:

  1. Rollout: Sample NN trajectories on a shared task from identical initial state using the current policy, accumulating experience tuples (st(i),at(i),rt(i))(s_t^{(i)}, a_t^{(i)}, r_t^{(i)}).
  2. Episode-group statistics: Compute per-trajectory returns and groupwise macro relative advantages.
  3. Anchor-state grouping: Build a hash map over all encountered states for per-state grouping; compute discounted returns and micro relative advantages for actions from matched states.
  4. Policy update: For each (i,t)(i, t), form the advantage A(at(i))A(a_t^{(i)}) and optimize the policy using a clipped surrogate loss with KL penalty.

All computations beyond trajectory storage are O(NT)O(NT) time (hash grouping and advantage arithmetic), incurring negligible additional cost (anchor hashing \approx0.01s/iteration, step-advantage arithmetic \approx0.53s, <0.2%<0.2\% of iteration time). No auxiliary model forward passes are required, and total rollout and memory footprint are identical to GRPO (Feng et al., 16 May 2025).

4. Theoretical Properties

GiGPO maintains several properties critical for scalability in LLM agent applications:

  • Critic-free learning: No value-function or critic networks are trained; both macro and micro advantages are derived solely from group statistics.
  • Low memory overhead: Storage is restricted to NN trajectories of length TT and a hash map over encountered states (maximum size NTNT); identical to GRPO.
  • Stable convergence: The method preserves the PPO-style clipped objective. Fixed normalization (Fnorm=1F_{\mathrm{norm}} = 1) is optionally employed to avoid "difficulty bias."
  • Algorithmic complexity: Rollout and memory are on par with GRPO, with negligible additional cost from anchor-state grouping and advantage computations.

5. Empirical Results and Benchmarking

GiGPO was evaluated on ALFWorld and WebShop agent benchmarks using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct LLMs. Shared hyperparameters included group size N=8N=8, KL-penalty β=0.01\beta = 0.01, learning rate 10610^{-6}, ω=1\omega=1, γ=0.95\gamma=0.95, and clipping ϵ=0.1\epsilon=0.1. Against baselines including prompting (Qwen, ReAct, Reflexion), PPO (actor-critic), and group-based critic-free methods, GiGPO outperformed GRPO in all settings:

Model Task GRPO Success % GiGPO (unbiased) Success % Improvement
Qwen2.5-1.5B-Instruct ALFWorld 72.8 86.1 +13.3 percentage points
Qwen2.5-1.5B-Instruct WebShop 75.8 83.5 +7.7 pp, >9% relative
Qwen2.5-7B-Instruct ALFWorld 77.6 90.2 +12.6 pp
Qwen2.5-7B-Instruct WebShop 79.3 86.2 +6.9 pp, >9% relative

Both standard deviation-normalized (FnormF_{\mathrm{norm}} = std) and unbiased (Fnorm=1F_{\mathrm{norm}} = 1) variants of GiGPO consistently outperformed the baselines. Ablations demonstrated that removing either the episode-level or step-level components leads to substantial degradation in performance, indicating that both hierarchical signals are essential. FnormF_{\mathrm{norm}} has task-specific effects: normalization to standard deviation may over-amplify gradients in low-variance groups, while fixing Fnorm=1F_{\mathrm{norm}} = 1 increases stability in harder tasks.

6. Limitations and Prospects for Extension

GiGPO currently relies on exact state matching for anchor-state grouping; thus, in high-noise or partially observable environments, repeated states may be rare, in which case GiGPO degenerates gracefully to vanilla GRPO—losing micro advantages but retaining baseline stability. Possible avenues for advancement include:

  • Approximate state-matching by leveraging learned state embeddings or graph-based isomorphism to group “similar” (rather than identical) contexts.
  • Integration of advanced single-turn group RL techniques (e.g., dynamic sampling, higher-order clipping; DAPO) to further enhance GiGPO.
  • Extension of anchor-grouping to multi-agent systems or hierarchical RL, where repeated subgoals are present.
  • Combining GiGPO with intrinsic exploration strategies to address more severe reward sparsity.

A plausible implication is that, by disentangling global and local credit assignment mechanisms in a scalable, critic-free manner, GiGPO provides a structured foundation for LLM agent RL optimization under challenging long-horizon and sparse-reward regimes (Feng et al., 16 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Group-in-Group Policy Optimization (GiGPO).