Papers
Topics
Authors
Recent
2000 character limit reached

Hi-GRPO: Hierarchical Group-Relative Policy

Updated 12 December 2025
  • Hi-GRPO is a reinforcement learning framework that integrates group-relative advantage estimation with hierarchical decomposition to achieve interpretable policy optimization.
  • It employs a two-stage content moderation pipeline and a multi-level soft-margin reward design to enhance classification accuracy and fine-grained credit assignment.
  • For long-horizon RL, Hi-GRPO uses a group-in-group strategy to balance macro and micro-level grouping, improving global coherence without requiring an auxiliary critic.

Hierarchical Group-Relative Policy Optimization (Hi-GRPO) refers to a class of reinforcement learning (RL) algorithms and moderation frameworks that combine group-based policy gradient optimization with hierarchical structural decomposition—either in the label space, the policy evaluation process, or both. Two principal settings for Hi-GRPO have been established: (1) multimodal content moderation with hierarchical labeling and rule-aligned reasoning (Li et al., 5 Aug 2025), and (2) long-horizon agent RL with nested episode- and step-level grouping for fine-grained credit assignment (Feng et al., 16 May 2025). Both share core design features: group-relative advantage estimation, hierarchical decomposition, critic-free optimization, and improved interpretability or credit assignment.

1. Hierarchical Labeling and Moderation Pipeline

In the context of multimodal content moderation, Hi-GRPO supports a two-stage hierarchical moderation workflow. The input is a multimodal note N=(T,V)N = (T, V), comprising both text and visual elements. The initial stage uses a lightweight binary classifier f1(N;θ1)f_1(N;\theta_1) to distinguish between "safe" and "risky" notes, optimizing the following supervised cross-entropy loss: LSFT=E(N,s)D[logPθ1(sN)].\mathcal{L}_{\mathrm{SFT}} = -\mathbb{E}_{(N,s)\sim\mathcal{D}} \left[\log P_{\theta_1}(s\mid N)\right]. Stage 1 is calibrated for high recall on risky content, rapidly excluding approximately 80% of safe notes.

Only notes labeled "risky" proceed to Stage 2, which employs a stronger multimodal LLM (e.g., Qwen2-VL-7B) to execute a hierarchical, path-based classification over an LL-level taxonomy. For taxonomy level ll (1lL1 \leq l \leq L), the classifier predicts a child category based on the parent from the previous level, supplied with concatenated rule definitions for the current scope. The output specification—enforced as

1
<think>…CoT reasoning…</think><answer>full path or No Risk</answer>
—guarantees interpretability and facilitates human review (Li et al., 5 Aug 2025).

2. Multi-Level Soft-Margin Reward Design

Hi-GRPO integrates a structure-aware, multi-level soft-margin reward to reflect the granularity and semantic proximity of hierarchical misclassifications. For a predicted path P^=(y^(1),,y^(L))\hat P = (\hat y^{(1)}, \dots, \hat y^{(L)}) with ground-truth P=(y(1),,y(L))P = (y^{(1)}, \dots, y^{(L)}), the per-level reward at stage ll is defined as: Racc(l)={+1,y^(l)=y(l) 2l1,y^(l)sibling(y(l)) 0,otherwiseR_{\mathrm{acc}}^{(l)} = \begin{cases} +1, & \hat y^{(l)}=y^{(l)} \ -2^{l-1}, & \hat y^{(l)} \in \mathrm{sibling}(y^{(l)}) \ 0, & \text{otherwise} \end{cases} Sibling category errors are penalized with exponentially increasing severity for finer hierarchy levels (e.g., 8-8 at l=4l=4). Aggregated rewards are averaged across levels: Racc=1Ll=1LRacc(l)R_{\mathrm{acc}} = \frac{1}{L} \sum_{l=1}^L R_{\mathrm{acc}}^{(l)} and combined with a format reward (for correct output structure) to obtain the final composite reward: Rfinal=Racc+RformatR_{\mathrm{final}} = R_{\mathrm{acc}} + R_{\mathrm{format}} This reward shaping is instrumental in promoting both taxonomic fidelity and interpretable rationales (Li et al., 5 Aug 2025).

3. Group-Relative Policy Optimization: Formulation and Training

Hi-GRPO employs group-relative advantage to optimize the classification policy, eliminating the need for an auxiliary critic. For each input NN, GG candidate paths {P^i}i=1G\{\hat P_i\}_{i=1}^G are sampled; corresponding rewards RiR_i are computed and intra-group normalized: Ai=RiμRσR,μR=1Gj=1GRj,σR=1Gj=1G(RjμR)2A_i = \frac{R_i - \mu_R}{\sigma_R}, \quad \mu_R = \tfrac{1}{G} \sum_{j=1}^G R_j, \quad \sigma_R = \sqrt{\tfrac{1}{G} \sum_{j=1}^G (R_j-\mu_R)^2} The policy objective maximizes the expected advantage-weighted log-probability: LGRPO=END[1Gi=1GAilogπθ2(P^iN)]\mathcal{L}_{\mathrm{GRPO}} = -\,\mathbb{E}_{N\sim\mathcal{D}} \left[ \frac{1}{G}\sum_{i=1}^G A_i \log \pi_{\theta_2}(\hat P_i\mid N) \right] Optimization proceeds via AdamW over one or more epochs. The process is summarized below:

1
2
3
4
5
6
7
8
9
for epoch in {1, ..., E}:
  for each batch of notes N in D_stage2:
    for each note N in batch:
      generate G candidate paths {P_i} ~ πθ2(·|N)
      compute rewards R_i for each P_i
      compute group mean μ_R and std σ_R
      compute advantages A_i = (R_i-μ_R)/σ_R
      accumulate loss ℓ += - (1/G) * _i A_i * log πθ2(P_i|N)
    update θ2  θ2 - α · _θ2 ℓ
Group normalization serves as a variance-reducing baseline, and no value function is learned (Li et al., 5 Aug 2025).

4. Two-Level Grouping for Long-Horizon RL (GiGPO Setting)

Hi-GRPO extends to RL for long-horizon LLM agents via a two-level grouping mechanism—denoted "Group-in-Group Policy Optimization" (GiGPO) (Feng et al., 16 May 2025). Macro grouping is performed over N parallel full-length trajectories τi={(st(i),at(i),rt(i))}t=1T\tau_i=\{(s^{(i)}_t, a^{(i)}_t, r^{(i)}_t)\}_{t=1}^T started from the same initial state, with trajectory-level return: R(τi)=t=1Trt(i)R(\tau_i) = \sum_{t=1}^T r_t^{(i)} and episode-wise normalized advantage: AE(τi)=R(τi)mean({R(τj)}j=1N)Fnorm({R(τj)}j=1N)A^E(\tau_i) = \frac{R(\tau_i) - \mathrm{mean}(\{R(\tau_j)\}_{j=1}^N)}{F_{\mathrm{norm}}(\{R(\tau_j)\}_{j=1}^N)} Micro (step-level) grouping uses anchor states s~\tilde s shared across trajectories, forming sets GS(s~)G^S(\tilde s) of all action/discounted-return pairs at state s~\tilde s: AS(at(i))=Rt(i)mean({Rt(j)})Fnorm({Rt(j)})A^S(a_t^{(i)}) = \frac{R_t^{(i)} - \mathrm{mean}(\{R_t^{(j)}\})}{F_{\mathrm{norm}}(\{R_t^{(j)}\})} with Rt(i)=k=tTγktrk(i)R_t^{(i)} = \sum_{k=t}^T \gamma^{k-t} r_k^{(i)}. The final advantage combines both: A(at(i))=AE(τi)+ωAS(at(i)),ω0A(a_t^{(i)}) = A^E(\tau_i) + \omega\,A^S(a_t^{(i)}), \quad \omega\geq 0 The update objective resembles PPO but uses these advantages and operates without a critic network. The procedure introduces negligible overhead (<0.2% per training iteration) and degrades gracefully to GRPO if anchor state redundancy vanishes (Feng et al., 16 May 2025).

5. Policy Alignment via Rule-Based Prompting

To ensure output alignment with evolving platform-level moderation policies, Hi-GRPO incorporates full rule definitions for each taxonomy category (and its siblings) into the prompt at each hierarchical stage. This enables the model's chain-of-thought to reference the precise policy language at inference time. The prompt template structures reasoning and output format:

1
2
3
4
5
6
7
8
System: Given X, choose most appropriate path from taxonomy.
Category Taxonomy & Rule Definitions:
  Level-1: ... definition ...
  Level-2: ... definition ...
  ...
Instructions:
  • Output <think>…</think> <answer>…</answer>.
User: Content = {Image+Text}
This approach ensures explanations and predictions are robust to policy changes and transparent for human review (Li et al., 5 Aug 2025).

6. Empirical Evaluation and Properties

Empirical studies demonstrate significant advances in both content moderation and RL agent training. In moderation:

  • Hi-Guard achieves 84.11% classification accuracy on generalization sets—outperforming baseline SFT by +12.13 percentage points (Li et al., 5 Aug 2025).
  • Ablation studies show substantial incremental gains from hierarchical labeling, rule-based prompting, and soft-margin rewards, with cumulative GPU time reduced by 22.7%.
  • Human moderation preference for Hi-Guard’s chain-of-thought outputs reaches 73.3% versus 15.4% for RLVR.
  • Online deployment with 10% traffic yields 79.14% recall, 51.09% precision, and a 56.38% reduction in manual review rates; final human review required for 0.24% of content.

In long-horizon agent RL:

  • GiGPO (Hi-GRPO) delivers >12% success rate improvement on ALFWorld and >9% on WebShop benchmarks over GRPO, with critically improved per-step credit assignment and low memory overhead (Feng et al., 16 May 2025).
  • Ablations establish that step-level grouping (AS) is essential for complex tasks, while episode-level (AE) ensures global policy coherence.

Hi-GRPO thus generalizes group-based critic-free policy optimization to both hierarchical label spaces and temporally extended agent environments, establishing a paradigm for scalable, interpretable, and policy-aligned RL and content moderation (Li et al., 5 Aug 2025, Feng et al., 16 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hierarchical Group-Relative Policy Optimization (Hi-GRPO).