Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchy-of-Groups Policy Optimization

Updated 4 July 2026
  • Hierarchy-of-Groups Policy Optimization is a critic-free, group-based RL method that organizes steps into nested, history-aware groups to address context inconsistency.
  • It employs a k-step context operator to form hierarchical groups that balance bias and variance in long-horizon credit assignment using adaptive weighted aggregation.
  • Empirical results on platforms like ALFWorld and WebShop show HGPO’s practical improvements without extra models or rollouts, highlighting its scalability and efficiency.

Hierarchy-of-Groups Policy Optimization (HGPO) is a critic-free, group-based reinforcement learning method for long-horizon agentic tasks in which a LLM interacts with an environment over many turns. It was proposed to address historical context inconsistency in stepwise relative-advantage estimation: in long-horizon settings, two steps can share the same current state while differing in the recent interaction history retained by the memory module, so state-only grouping can compare samples that are not actually conditioned on the same effective prompt. HGPO resolves this by assigning each step to multiple nested groups indexed by context depth, computing a distinct relative advantage in each group, and aggregating those advantages with adaptive weights, thereby targeting a favorable bias-variance trade-off without extra models or rollouts (He et al., 26 Feb 2026).

1. Long-horizon agentic RL and the motivation for hierarchical groups

HGPO is formulated for long-horizon interaction settings such as ALFWorld and WebShop, where an LLM agent πθ\pi_\theta receives a task example x\bm{x}, observes states stS\bm{s}_t \in \mathcal{S}, generates textual actions atVn\bm{a}_t \in \mathcal{V}^n, and accumulates sparse delayed reward over a trajectory

τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.

In a trajectory-wise formulation, the policy conditions on the full history,

πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),

but as the horizon grows this creates context explosion: prompt length grows with the number of turns, making RL training expensive and less scalable. Recent stepwise methods avoid that by using a memory module that retains only the most recent KTK \ll T interactions, so the prompt length remains approximately bounded (He et al., 26 Feb 2026).

The central observation behind HGPO is that this memory mechanism changes what it means for two decision points to be comparable. In a stepwise setup, the effective conditioning context is not only the current observation st\bm{s}_t, but also the recent historical interactions preserved in memory. Two steps with the same current state may therefore correspond to different effective prompts. HGPO treats this mismatch as a structural source of bias in stepwise group-based RL, rather than as a secondary implementation detail (He et al., 26 Feb 2026).

This positions HGPO within the broader trajectory of group-based RL for LLM agents. Standard group methods such as GRPO compare whole trajectories, while later stepwise methods compare steps that share a current state. HGPO keeps the critic-free, group-relative paradigm but makes grouping explicitly history-aware, which the paper presents as the key requirement for long-horizon agentic credit assignment (He et al., 26 Feb 2026).

2. Context inconsistency and the failure mode of naive stepwise grouping

The paper contrasts three comparison targets. At the coarsest level, a trajectory-level group GτG_\tau yields the trajectory-relative advantage

AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},

where x\bm{x}0 is the standard deviation of rewards within x\bm{x}1. This assigns the same advantage to every step of a trajectory and is therefore coarse for long-horizon credit assignment (He et al., 26 Feb 2026).

A finer stepwise alternative groups together steps with the same current state. If x\bm{x}2 is the set of steps sharing state x\bm{x}3, the step-level relative advantage is

x\bm{x}4

This improves granularity but introduces the failure mode HGPO is designed to correct: steps can share a current state while differing in retained history, so the comparison group is not prompt-consistent (He et al., 26 Feb 2026).

The paper formalizes the ideal comparator as an Oracle group, whose members share both the same current state and identical historical context. Oracle groups produce the lowest-bias comparison, but they are empirically rare and small. Using only Oracle groups would therefore reduce bias at the cost of sharply increased variance. HGPO is built around this trade-off: ordinary state-only groups are large but biased, while Oracle-like groups are precise but sparse (He et al., 26 Feb 2026).

Empirically, the paper reports that both trajectory-level and state-only step-level estimators are biased relative to Oracle advantages, with trajectory-level bias substantially larger. This is the paper’s direct evidence that context inconsistency is not merely theoretical; it materially degrades policy optimization in long-horizon agents (He et al., 26 Feb 2026).

3. Formal construction of the hierarchy of groups

HGPO defines historical context through a x\bm{x}5-step context operator. For step x\bm{x}6,

x\bm{x}7

with x\bm{x}8, where x\bm{x}9 is the memory length. For stS\bm{s}_t \in \mathcal{S}0, this reduces to the current state alone; for larger stS\bm{s}_t \in \mathcal{S}1, it includes increasingly long suffixes of the recent state history (He et al., 26 Feb 2026).

Using this operator, the stS\bm{s}_t \in \mathcal{S}2-th hierarchical group for step stS\bm{s}_t \in \mathcal{S}3 is

stS\bm{s}_t \in \mathcal{S}4

with index set

stS\bm{s}_t \in \mathcal{S}5

Thus, a level-stS\bm{s}_t \in \mathcal{S}6 group contains all steps whose current state and previous stS\bm{s}_t \in \mathcal{S}7 states exactly match those of the anchor step (He et al., 26 Feb 2026).

These groups form a nested chain,

stS\bm{s}_t \in \mathcal{S}8

The hierarchy is therefore monotone in context specificity: increasing stS\bm{s}_t \in \mathcal{S}9 makes groups more history-consistent and smaller. In the paper’s interpretation, atVn\bm{a}_t \in \mathcal{V}^n0 reproduces ordinary state-only step grouping, while atVn\bm{a}_t \in \mathcal{V}^n1 approximates Oracle grouping under the retained memory budget (He et al., 26 Feb 2026).

A practical consequence is that HGPO uses exact equality of recent state sequences as its grouping rule. The implementation performs context-aware grouping offline via hashmap lookups over existing rollouts. The method does not rely on embedding similarity or approximate nearest-neighbor matching in its main formulation (He et al., 26 Feb 2026).

4. Hierarchical advantage aggregation and the optimization objective

At each hierarchy level atVn\bm{a}_t \in \mathcal{V}^n2, HGPO computes a group-relative advantage using the same centering-and-normalization template as GRPO-style methods: atVn\bm{a}_t \in \mathcal{V}^n3 Here atVn\bm{a}_t \in \mathcal{V}^n4 is the per-step return-like quantity used for comparison, and atVn\bm{a}_t \in \mathcal{V}^n5 is the standard deviation within the atVn\bm{a}_t \in \mathcal{V}^n6-th hierarchical group (He et al., 26 Feb 2026).

The final HGPO advantage is an adaptive weighted combination of these level-specific estimates: atVn\bm{a}_t \in \mathcal{V}^n7 with

atVn\bm{a}_t \in \mathcal{V}^n8

The implementation additionally omits groups with zero advantage in this aggregation. When atVn\bm{a}_t \in \mathcal{V}^n9, the estimator averages uniformly across hierarchy levels; increasing τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.0 shifts weight toward deeper, more context-consistent groups (He et al., 26 Feb 2026).

This aggregated advantage enters a PPO/GRPO-style clipped objective with KL regularization: τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.1 with importance ratio

τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.2

The method therefore remains critic-free and uses the same rollouts as GRPO/GiGPO, changing only how stepwise advantages are constructed (He et al., 26 Feb 2026).

The algorithmic loop is correspondingly simple. Each iteration copies the current policy to τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.3, samples a task and τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.4 parallel environments, rolls out trajectories, performs context-aware hierarchical grouping by hash lookup, computes τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.5 and then τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.6, and finally updates the policy by maximizing τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.7. The paper reports no extra value model, no auxiliary estimator, and no extra rollouts (He et al., 26 Feb 2026).

5. Bias-variance trade-off, empirical behavior, and benchmark results

The paper’s main formal result is Proposition 1, which states a bias-variance trade-off for the hierarchical estimator. Let τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.8 and τ={(s1,a1),,(sT,aT)}.\tau = \{(\bm{s}_{1}, \bm{a}_{1}), \ldots, (\bm{s}_{T}, \bm{a}_{T})\}.9. Under the assumed monotonicity conditions

πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),0

and

πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),1

the aggregated estimator satisfies

πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),2

together with

πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),3

The result is not a convergence theorem; it is a trade-off justification showing that HGPO interpolates between the biased low-variance estimator at πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),4 and the lower-bias higher-variance estimator at large πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),5 (He et al., 26 Feb 2026).

The empirical evaluation uses ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, under the same computational constraints as the baselines. For Qwen2.5-1.5B-Instruct, HGPO with πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),6 improves over GiGPO from πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),7 to πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),8 on ALFWorld and from πθ(ats0:t,x),\pi_{\theta}(\bm{a}_{t}|\bm{s}_{0:t}, \bm{x}),9 to KTK \ll T0 on WebShop; with KTK \ll T1, it improves from KTK \ll T2 to KTK \ll T3 on ALFWorld and from KTK \ll T4 to KTK \ll T5 on WebShop. The paper summarizes average improvements over GiGPO for the 1.5B model as KTK \ll T6 on ALFWorld with KTK \ll T7, KTK \ll T8 on ALFWorld with KTK \ll T9, st\bm{s}_t0 on WebShop with st\bm{s}_t1, and st\bm{s}_t2 on WebShop with st\bm{s}_t3 (He et al., 26 Feb 2026).

For Qwen2.5-7B-Instruct, gains are smaller but generally favorable, especially in success rate. The paper explicitly notes that HGPO helps smaller models more, attributing this to their longer, more redundant, and noisier rollouts, which intensify context-inconsistency bias (He et al., 26 Feb 2026).

The ablations are especially diagnostic. Removing higher-level groups st\bm{s}_t4 and using only st\bm{s}_t5 yields notable degradation, especially on ALFWorld, which the paper reports as around st\bm{s}_t6 drop in in-distribution success and around st\bm{s}_t7 drop in out-of-distribution success. Uniform weighting (st\bm{s}_t8) can be competitive when st\bm{s}_t9, but becomes worse at larger GτG_\tau0, while GτG_\tau1 can overemphasize small high-context groups and hurt performance. Adding the trajectory-level advantage of Eq. GτG_\tau2 mostly hurts, which the paper interprets as evidence that trajectory-level estimates are too biased to be useful. Oracle-only training also fails, reinforcing the claim that pure low-bias grouping is not enough because variance becomes too severe (He et al., 26 Feb 2026).

The practical overhead is reported as minimal. HGPO uses group size GτG_\tau3, GτG_\tau4 rollout groups per rollout, hence GτG_\tau5 environments in total; maximum steps are GτG_\tau6 for ALFWorld and GτG_\tau7 for WebShop; the default settings are GτG_\tau8, GτG_\tau9, AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},0, rollout temperature AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},1, validation temperature AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},2, and prompt history length AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},3. The paper reports average runtime overhead of about AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},4 s over GRPO and AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},5 s over GiGPO, described as less than AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},6 of total execution time, with only slight memory overhead from additional hashing lookups (He et al., 26 Feb 2026).

6. Relation to adjacent group-based methods, nomenclature, and limitations

HGPO sits within a rapidly growing family of group-based policy optimizers, but its defining move is specifically the use of historical-context-consistent hierarchical groups. GRPO provides trajectory-level grouping, and later theoretical work shows that the GRPO policy gradient is a second-order U-statistic, with a finite-sample error analysis and a universal group-size scaling law; that result is about flat prompt-local grouping rather than history-aware hierarchical step grouping (Zhou et al., 1 Mar 2026). GiGPO introduces a two-level structure for LLM agents—episode-level groups plus repeated-state step-level groups—but it groups steps by anchor state rather than by explicit historical-context depth (Feng et al., 16 May 2025). HGPO can therefore be understood as a history-aware generalization of stepwise group-relative advantage estimation (He et al., 26 Feb 2026).

Other neighboring methods use “groups” in different senses. HAPO is directly relevant to group-relative RL in sparse-reward RLVR, but it is explicitly a single-level group-relative optimization framework with conditional teacher injection, not a hierarchy-of-groups method (Wu et al., 11 Mar 2026). "Fibration Policy Optimization" develops a broader multi-scale stability-control framework—domain, prompt group, trajectory, token—through Fiber Bundle Gating and the Fibration Gating Hierarchy, which is closely related in spirit to hierarchical grouping but is framed algebraically rather than through historical-context-matched step groups (Li et al., 9 Mar 2026). By contrast, "Harmonized Group Policy Optimization" in graph recommendation uses the same acronym HGPO but denotes degree-based group-relative optimization with a cross-group variance penalty, not hierarchy-of-groups policy optimization (Luo et al., 18 May 2025).

The limitations of HGPO are immediate from its construction. It assumes exact matching of recent raw states; if an agent uses summarized memory rather than raw divisible history, straightforward hierarchical grouping becomes intractable. The paper therefore suggests future grouping based on embedding similarity of memory for summarized-memory agents. It also uses a simple deterministic weighting rule AT(τi)=(R(τi)1GτjGτR(τj))/σGτ,A^{T}(\tau_i) = \left( R({\tau_i}) - \frac{1}{|G_{\tau}|} \sum\nolimits_{j\in G_{\tau}} R({\tau_j}) \right)/\sigma_{G_{\tau}},7, leaving uncertainty-aware or learned weighting open. Finally, its theory is a bias-variance proposition rather than a full convergence analysis, and some notation—especially the precise semantics of per-step return symbols—remains less fully formalized than the grouping mechanism itself (He et al., 26 Feb 2026).

In that sense, HGPO is best regarded as a concrete answer to a specific pathology of long-horizon agentic RL: group-relative comparisons must be aligned with the actual conditioning context of the policy. Its hierarchy is not an architectural hierarchy of managers and subpolicies, but a hierarchy of comparability classes indexed by shared historical context. That choice of hierarchy is the method’s central technical contribution (He et al., 26 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchy-of-Groups Policy Optimization (HGPO).