Papers
Topics
Authors
Recent
Search
2000 character limit reached

Implicit Hierarchical GRPO

Updated 18 June 2026
  • Implicit Hierarchical GRPO is a framework for reinforcement learning that decouples tool invocation from execution, enhancing hierarchical reasoning and credit assignment.
  • It employs a two-level policy architecture that unifies high-level decision making with low-level token prediction through explicit and implicit policy formulations.
  • Empirical results demonstrate improved mathematical reasoning and agentic task performance, with reduced inference interruptions and enhanced generalization.

Implicit Hierarchical Group-Relative Policy Optimization (IH-GRPO) is a framework for reinforcement learning (RL) in LLMs that enables hierarchical reasoning and improved credit assignment during tool-integrated tasks such as mathematical problem solving and long-horizon agentic tasks. Central to IH-GRPO is the decoupling of tool invocation from actual execution, a delayed-execution formalism that enhances reasoning coherence and expressivity by allowing models to control when contextually generated code is executed. IH-GRPO leverages hierarchical control to unify high-level decision-making (whether to continue reasoning or invoke tool execution) with low-level token prediction, while supporting bias-robust RL updates through an implicit loss that matches explicit hierarchical structures.

1. Problem Formulation: Decoupling Tool Invocation from Execution

Traditional LLM tool integrations immediately execute code upon generation, leading to potential disruptions in the logical structure of multi-step reasoning. IH-GRPO formalizes a trajectory as a sequence that interleaves text tokens tlit_{li}, code tokens clic_{li}, and sparse execution signals ele_l. Rather than execute each code block immediately, execution is deferred until the model emits a special “tool-execute” token. The delayed-execution trajectory for an input qq is

τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})

where ele_l batches all unexecuted code since the last execution trigger, runs it, and appends the observed result olo_l to the context. The policy πθ\pi_\theta induces a conditional likelihood as: \begin{equation} P_{\pi_\theta}(\tilde\tau\mid q) = \prod_{l=1}n \left[\prod_{i=1}{s_l} \pi_\theta(t_{li} \mid q, \mathcal H{<t_{li}}) \pi_\theta(c_{li} \mid q, t_{li}, \mathcal H{<c_{li}}) \right]\, \pi_\theta(e_l\mid q, \mathcal H{<e_l}), \label{eq:delayed_traj} \end{equation} making execution orchestration an explicit component of the learned policy (Wang et al., 18 May 2026).

2. Hierarchical Policy Structure

IH-GRPO introduces an explicit two-level policy, consisting of:

  • High-level policy πhigh(ahighS)\pi_{\rm high}(a_{\rm high}\mid S): chooses between “continue” (CC) and “execute” (clic_{li}0) actions, parameterized by clic_{li}1 with

clic_{li}2

where clic_{li}3 denotes the sigmoid function.

  • Low-level policy clic_{li}4: governs autoregressive token emission for either text or code, with vocabulary size clic_{li}5,

clic_{li}6

The overall explicit-joint probability is given by: \begin{equation} \pi_E(i\mid S) = \begin{cases} \sigma(\theta_0)\,\frac{\exp(\theta_i)}{\sum_{u=1}V\exp(\theta_u)}, & i\in{1,\dots,V} \ (\text{continue})\ 1-\sigma(\theta_0), & i=0\ (\text{execute}) \end{cases} \label{eq:explicit_joint} \end{equation} To avoid architectural modifications and maintain single-stream autoregression, IH-GRPO parameterizes an “implicit” policy as: \begin{equation} \pi_I(i\mid S) = \frac{\exp(\beta_i)}{\sum_{s=0}V\exp(\beta_s)} \quad (i=0,\ldots,V) \label{eq:implicit_joint} \end{equation} and seeks to match its behavior to the explicit hierarchical policy.

3. Surrogate Loss and Implicit Hierarchy

To achieve dynamic equivalence between the implicit and explicit policies, IH-GRPO introduces a surrogate token-wise loss clic_{li}7 that incorporates a correction term based on the softmax-normalized parameters: \begin{align} L_I'(\beta_i) =\;& -A\Bigl[\beta_i-\log\sum_{s=0}V e{\beta_s}\Bigr] -A\,\mathrm{sg}(\gamma)\,\log Z +\bigl(A\log Z - f\cdot\beta_0\bigr)\,\mathbb{I}{i\ge1}, \label{eq:surrogate_loss} \end{align} with definitions clic_{li}8, clic_{li}9, and ele_l0. The loss ensures that a single update step yields the same post-update distribution ele_l1 as a policy-gradient update to ele_l2, preserving the effectiveness of hierarchical control without needing to design a multi-head LLM or two-module architecture (Wang et al., 18 May 2026).

4. Algorithmic Structure and Implementation

IH-GRPO is embedded within a Group-Relative Proximal Policy Optimization (GRPO) loop. The key steps are:

  1. Generate batches of rollouts, interleaving text/code tokens and execute signals.
  2. Calculate rewards using external verifiers.
  3. Compute group-relative advantages, filter ineffective turns.
  4. Update policy parameters by maximizing an objective that combines the GRPO clipped-advantage loss and the implicit surrogate correction weighted by ele_l3:

ele_l4

Details:

  • No architectural changes: only an extra loss term is introduced.
  • PPO with AdamW (ele_l5, ele_l6), batch size 16, minibatch 128, learning rate ele_l7, surrogate weight ele_l8/ele_l9.
  • Open-source backbones: Qwen3-1.7B, 4B, and 8B.
  • Up to 5 interaction turns, 8K output tokens, 16K prompt tokens.
  • Training performed on 8×H20 GPUs with vLLM for autoregressive sampling (Wang et al., 18 May 2026).

5. Hierarchical Grouping in Long-Horizon Tasks

A related but distinct use of implicit hierarchy in GRPO is for long-horizon agentic tasks, wherein “Hierarchy-of-Groups Policy Optimization” (HGPO) constructs qq0 nested groups per step, indexed by historical context length qq1…qq2. For each rollout trajectory qq3, and a fixed memory length qq4, a qq5-step context operator qq6 defines the grouping: qq7 Group qq8 clusters all step indices with matching qq9. Relative advantages τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})0 (Eq. 4) are aggregated with adaptive weights to obtain low-bias, low-variance credit for updates (He et al., 26 Feb 2026).

6. Empirical Evaluation and Benchmarks

IH-GRPO demonstrates marked improvements on out-of-domain mathematical reasoning and general benchmarks. On six math sets (AIME 24/25, MATH500, AMC 23, HMMT Feb 25, Olympiad), with average@8 (or @32) evaluation:

  • Absolute gains over the strongest baseline (SimpleTIR C/D, Dr.GRPO D, DAPO D, EH-GRPO D):
    • Qwen3-1.7B: τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})1 (44.77τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})246.64)
    • Qwen3-4B: τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})3 (61.60τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})463.76)
    • Qwen3-8B: τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})5 (63.75τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})666.28)
  • On zero-tool general tasks (MMLU-Pro, LogiQA, Date Understanding, Formal Fallacies, Logical Deduction), Qwen3-8B rises from 62.9% to 79.4%.

For hierarchical group-based variants in agentic tasks (ALFWorld, WebShop), gains of 2–4% absolute were observed over GRPO, RLOO, PPO, and GiGPO, with enhanced generalization and reduced performance degradation out-of-distribution. Notably, ablation shows that removing hierarchical grouping degrades ALFWorld by up to τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})7 (He et al., 26 Feb 2026).

7. Ablations, Insights, and Limitations

Empirical ablations show:

  • Decoupled tool execution leads to a sharp reduction in inference interruptions (4.98τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})80.68\%) and more diverse tool-usage patterns.
  • IH-GRPO is robust to τ~=(t11,c11,,t1s,c1s,e1,o1,,tn1)\tilde\tau = (t_{11}, c_{11}, \ldots, t_{1s}, c_{1s}, e_1, o_1, \ldots, t_{n1})9 in ele_l0.
  • Tool-token and prompt format are immaterial to performance.
  • Increasing turn and data budgets yields monotonic improvement.
  • Compared to SimpleTIR, IH-GRPO attains higher accuracy with fewer invalid tool runs.

Limitations and open problems include:

  • High-k hierarchical groups in agentic tasks can be small, introducing variance.
  • Fixed adaptive weighting schemes may not fully capture true bias/variance; on-line learned weights or extensions to summarized-memory agents are plausible avenues.
  • Hierarchical grouping assumes raw context tokens are accessible; alternative strategies are needed for summarized or embedding-based memories (Wang et al., 18 May 2026, He et al., 26 Feb 2026).

IH-GRPO provides a lightweight, policy-compatible approach to hierarchical tool execution and credit assignment. Through explicit decoupling of invocation and execution, and implicit hierarchical surrogate losses, it achieves strong, robust improvements on mathematical and agentic LLM tasks without incurring the complexity of multi-controller or external planner systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Implicit Hierarchical GRPO (IH-GRPO).