Hierarchy-of-Groups Policy Optimization

Updated 4 July 2026

Hierarchy-of-Groups Policy Optimization is a critic-free, group-based RL method that organizes steps into nested, history-aware groups to address context inconsistency.
It employs a k-step context operator to form hierarchical groups that balance bias and variance in long-horizon credit assignment using adaptive weighted aggregation.
Empirical results on platforms like ALFWorld and WebShop show HGPO’s practical improvements without extra models or rollouts, highlighting its scalability and efficiency.

Hierarchy-of-Groups Policy Optimization (HGPO) is a critic-free, group-based reinforcement learning method for long-horizon agentic tasks in which a LLM interacts with an environment over many turns. It was proposed to address historical context inconsistency in stepwise relative-advantage estimation: in long-horizon settings, two steps can share the same current state while differing in the recent interaction history retained by the memory module, so state-only grouping can compare samples that are not actually conditioned on the same effective prompt. HGPO resolves this by assigning each step to multiple nested groups indexed by context depth, computing a distinct relative advantage in each group, and aggregating those advantages with adaptive weights, thereby targeting a favorable bias-variance trade-off without extra models or rollouts (He et al., 26 Feb 2026).

1. Long-horizon agentic RL and the motivation for hierarchical groups

HGPO is formulated for long-horizon interaction settings such as ALFWorld and WebShop, where an LLM agent $\pi_\theta$ receives a task example $\bm{x}$ , observes states $\bm{s}_t \in \mathcal{S}$ , generates textual actions $\bm{a}_t \in \mathcal{V}^n$ , and accumulates sparse delayed reward over a trajectory

$\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$

In a trajectory-wise formulation, the policy conditions on the full history,

$\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$

but as the horizon grows this creates context explosion: prompt length grows with the number of turns, making RL training expensive and less scalable. Recent stepwise methods avoid that by using a memory module that retains only the most recent $K \ll T$ interactions, so the prompt length remains approximately bounded (He et al., 26 Feb 2026).

The central observation behind HGPO is that this memory mechanism changes what it means for two decision points to be comparable. In a stepwise setup, the effective conditioning context is not only the current observation $\bm{s}_t$ , but also the recent historical interactions preserved in memory. Two steps with the same current state may therefore correspond to different effective prompts. HGPO treats this mismatch as a structural source of bias in stepwise group-based RL, rather than as a secondary implementation detail (He et al., 26 Feb 2026).

This positions HGPO within the broader trajectory of group-based RL for LLM agents. Standard group methods such as GRPO compare whole trajectories, while later stepwise methods compare steps that share a current state. HGPO keeps the critic-free, group-relative paradigm but makes grouping explicitly history-aware, which the paper presents as the key requirement for long-horizon agentic credit assignment (He et al., 26 Feb 2026).

2. Context inconsistency and the failure mode of naive stepwise grouping

The paper contrasts three comparison targets. At the coarsest level, a trajectory-level group $G_\tau$ yields the trajectory-relative advantage

$A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$

where $\bm{x}$ 0 is the standard deviation of rewards within $\bm{x}$ 1. This assigns the same advantage to every step of a trajectory and is therefore coarse for long-horizon credit assignment (He et al., 26 Feb 2026).

A finer stepwise alternative groups together steps with the same current state. If $\bm{x}$ 2 is the set of steps sharing state $\bm{x}$ 3, the step-level relative advantage is

$\bm{x}$ 4

This improves granularity but introduces the failure mode HGPO is designed to correct: steps can share a current state while differing in retained history, so the comparison group is not prompt-consistent (He et al., 26 Feb 2026).

The paper formalizes the ideal comparator as an Oracle group, whose members share both the same current state and identical historical context. Oracle groups produce the lowest-bias comparison, but they are empirically rare and small. Using only Oracle groups would therefore reduce bias at the cost of sharply increased variance. HGPO is built around this trade-off: ordinary state-only groups are large but biased, while Oracle-like groups are precise but sparse (He et al., 26 Feb 2026).

Empirically, the paper reports that both trajectory-level and state-only step-level estimators are biased relative to Oracle advantages, with trajectory-level bias substantially larger. This is the paper’s direct evidence that context inconsistency is not merely theoretical; it materially degrades policy optimization in long-horizon agents (He et al., 26 Feb 2026).

3. Formal construction of the hierarchy of groups

HGPO defines historical context through a $\bm{x}$ 5-step context operator. For step $\bm{x}$ 6,

$\bm{x}$ 7

with $\bm{x}$ 8, where $\bm{x}$ 9 is the memory length. For $\bm{s}_t \in \mathcal{S}$ 0, this reduces to the current state alone; for larger $\bm{s}_t \in \mathcal{S}$ 1, it includes increasingly long suffixes of the recent state history (He et al., 26 Feb 2026).

Using this operator, the $\bm{s}_t \in \mathcal{S}$ 2-th hierarchical group for step $\bm{s}_t \in \mathcal{S}$ 3 is

$\bm{s}_t \in \mathcal{S}$ 4

with index set

$\bm{s}_t \in \mathcal{S}$ 5

Thus, a level- $\bm{s}_t \in \mathcal{S}$ 6 group contains all steps whose current state and previous $\bm{s}_t \in \mathcal{S}$ 7 states exactly match those of the anchor step (He et al., 26 Feb 2026).

These groups form a nested chain,

$\bm{s}_t \in \mathcal{S}$ 8

The hierarchy is therefore monotone in context specificity: increasing $\bm{s}_t \in \mathcal{S}$ 9 makes groups more history-consistent and smaller. In the paper’s interpretation, $\bm{a}_t \in \mathcal{V}^n$ 0 reproduces ordinary state-only step grouping, while $\bm{a}_t \in \mathcal{V}^n$ 1 approximates Oracle grouping under the retained memory budget (He et al., 26 Feb 2026).

A practical consequence is that HGPO uses exact equality of recent state sequences as its grouping rule. The implementation performs context-aware grouping offline via hashmap lookups over existing rollouts. The method does not rely on embedding similarity or approximate nearest-neighbor matching in its main formulation (He et al., 26 Feb 2026).

4. Hierarchical advantage aggregation and the optimization objective

At each hierarchy level $\bm{a}_t \in \mathcal{V}^n$ 2, HGPO computes a group-relative advantage using the same centering-and-normalization template as GRPO-style methods: $\bm{a}_t \in \mathcal{V}^n$ 3 Here $\bm{a}_t \in \mathcal{V}^n$ 4 is the per-step return-like quantity used for comparison, and $\bm{a}_t \in \mathcal{V}^n$ 5 is the standard deviation within the $\bm{a}_t \in \mathcal{V}^n$ 6-th hierarchical group (He et al., 26 Feb 2026).

The final HGPO advantage is an adaptive weighted combination of these level-specific estimates: $\bm{a}_t \in \mathcal{V}^n$ 7 with

$\bm{a}_t \in \mathcal{V}^n$ 8

The implementation additionally omits groups with zero advantage in this aggregation. When $\bm{a}_t \in \mathcal{V}^n$ 9, the estimator averages uniformly across hierarchy levels; increasing $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 0 shifts weight toward deeper, more context-consistent groups (He et al., 26 Feb 2026).

This aggregated advantage enters a PPO/GRPO-style clipped objective with KL regularization: $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 1 with importance ratio

$\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 2

The method therefore remains critic-free and uses the same rollouts as GRPO/GiGPO, changing only how stepwise advantages are constructed (He et al., 26 Feb 2026).

The algorithmic loop is correspondingly simple. Each iteration copies the current policy to $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 3, samples a task and $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 4 parallel environments, rolls out trajectories, performs context-aware hierarchical grouping by hash lookup, computes $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 5 and then $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 6, and finally updates the policy by maximizing $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 7. The paper reports no extra value model, no auxiliary estimator, and no extra rollouts (He et al., 26 Feb 2026).

5. Bias-variance trade-off, empirical behavior, and benchmark results

The paper’s main formal result is Proposition 1, which states a bias-variance trade-off for the hierarchical estimator. Let $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 8 and $\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.$ 9. Under the assumed monotonicity conditions

$\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 0

and

$\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 1

the aggregated estimator satisfies

$\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 2

together with

$\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 3

The result is not a convergence theorem; it is a trade-off justification showing that HGPO interpolates between the biased low-variance estimator at $\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 4 and the lower-bias higher-variance estimator at large $\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 5 (He et al., 26 Feb 2026).

The empirical evaluation uses ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, under the same computational constraints as the baselines. For Qwen2.5-1.5B-Instruct, HGPO with $\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 6 improves over GiGPO from $\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 7 to $\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 8 on ALFWorld and from $\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),$ 9 to $K \ll T$ 0 on WebShop; with $K \ll T$ 1, it improves from $K \ll T$ 2 to $K \ll T$ 3 on ALFWorld and from $K \ll T$ 4 to $K \ll T$ 5 on WebShop. The paper summarizes average improvements over GiGPO for the 1.5B model as $K \ll T$ 6 on ALFWorld with $K \ll T$ 7, $K \ll T$ 8 on ALFWorld with $K \ll T$ 9, $\bm{s}_t$ 0 on WebShop with $\bm{s}_t$ 1, and $\bm{s}_t$ 2 on WebShop with $\bm{s}_t$ 3 (He et al., 26 Feb 2026).

For Qwen2.5-7B-Instruct, gains are smaller but generally favorable, especially in success rate. The paper explicitly notes that HGPO helps smaller models more, attributing this to their longer, more redundant, and noisier rollouts, which intensify context-inconsistency bias (He et al., 26 Feb 2026).

The ablations are especially diagnostic. Removing higher-level groups $\bm{s}_t$ 4 and using only $\bm{s}_t$ 5 yields notable degradation, especially on ALFWorld, which the paper reports as around $\bm{s}_t$ 6 drop in in-distribution success and around $\bm{s}_t$ 7 drop in out-of-distribution success. Uniform weighting ( $\bm{s}_t$ 8) can be competitive when $\bm{s}_t$ 9, but becomes worse at larger $G_\tau$ 0, while $G_\tau$ 1 can overemphasize small high-context groups and hurt performance. Adding the trajectory-level advantage of Eq. $G_\tau$ 2 mostly hurts, which the paper interprets as evidence that trajectory-level estimates are too biased to be useful. Oracle-only training also fails, reinforcing the claim that pure low-bias grouping is not enough because variance becomes too severe (He et al., 26 Feb 2026).

The practical overhead is reported as minimal. HGPO uses group size $G_\tau$ 3, $G_\tau$ 4 rollout groups per rollout, hence $G_\tau$ 5 environments in total; maximum steps are $G_\tau$ 6 for ALFWorld and $G_\tau$ 7 for WebShop; the default settings are $G_\tau$ 8, $G_\tau$ 9, $A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$ 0, rollout temperature $A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$ 1, validation temperature $A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$ 2, and prompt history length $A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$ 3. The paper reports average runtime overhead of about $A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$ 4 s over GRPO and $A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$ 5 s over GiGPO, described as less than $A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$ 6 of total execution time, with only slight memory overhead from additional hashing lookups (He et al., 26 Feb 2026).

6. Relation to adjacent group-based methods, nomenclature, and limitations

HGPO sits within a rapidly growing family of group-based policy optimizers, but its defining move is specifically the use of historical-context-consistent hierarchical groups. GRPO provides trajectory-level grouping, and later theoretical work shows that the GRPO policy gradient is a second-order U-statistic, with a finite-sample error analysis and a universal group-size scaling law; that result is about flat prompt-local grouping rather than history-aware hierarchical step grouping (Zhou et al., 1 Mar 2026). GiGPO introduces a two-level structure for LLM agents—episode-level groups plus repeated-state step-level groups—but it groups steps by anchor state rather than by explicit historical-context depth (Feng et al., 16 May 2025). HGPO can therefore be understood as a history-aware generalization of stepwise group-relative advantage estimation (He et al., 26 Feb 2026).

Other neighboring methods use “groups” in different senses. HAPO is directly relevant to group-relative RL in sparse-reward RLVR, but it is explicitly a single-level group-relative optimization framework with conditional teacher injection, not a hierarchy-of-groups method (Wu et al., 11 Mar 2026). "Fibration Policy Optimization" develops a broader multi-scale stability-control framework—domain, prompt group, trajectory, token—through Fiber Bundle Gating and the Fibration Gating Hierarchy, which is closely related in spirit to hierarchical grouping but is framed algebraically rather than through historical-context-matched step groups (Li et al., 9 Mar 2026). By contrast, "Harmonized Group Policy Optimization" in graph recommendation uses the same acronym HGPO but denotes degree-based group-relative optimization with a cross-group variance penalty, not hierarchy-of-groups policy optimization (Luo et al., 18 May 2025).

The limitations of HGPO are immediate from its construction. It assumes exact matching of recent raw states; if an agent uses summarized memory rather than raw divisible history, straightforward hierarchical grouping becomes intractable. The paper therefore suggests future grouping based on embedding similarity of memory for summarized-memory agents. It also uses a simple deterministic weighting rule $A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},$ 7, leaving uncertainty-aware or learned weighting open. Finally, its theory is a bias-variance proposition rather than a full convergence analysis, and some notation—especially the precise semantics of per-step return symbols—remains less fully formalized than the grouping mechanism itself (He et al., 26 Feb 2026).

In that sense, HGPO is best regarded as a concrete answer to a specific pathology of long-horizon agentic RL: group-relative comparisons must be aligned with the actual conditioning context of the policy. Its hierarchy is not an architectural hierarchy of managers and subpolicies, but a hierarchy of comparability classes indexed by shared historical context. That choice of hierarchy is the method’s central technical contribution (He et al., 26 Feb 2026).