Hi-GRPO: Hierarchical Group-Relative Policy

Updated 12 December 2025

Hi-GRPO is a reinforcement learning framework that integrates group-relative advantage estimation with hierarchical decomposition to achieve interpretable policy optimization.
It employs a two-stage content moderation pipeline and a multi-level soft-margin reward design to enhance classification accuracy and fine-grained credit assignment.
For long-horizon RL, Hi-GRPO uses a group-in-group strategy to balance macro and micro-level grouping, improving global coherence without requiring an auxiliary critic.

Hierarchical Group-Relative Policy Optimization (Hi-GRPO) refers to a class of reinforcement learning (RL) algorithms and moderation frameworks that combine group-based policy gradient optimization with hierarchical structural decomposition—either in the label space, the policy evaluation process, or both. Two principal settings for Hi-GRPO have been established: (1) multimodal content moderation with hierarchical labeling and rule-aligned reasoning (Li et al., 5 Aug 2025), and (2) long-horizon agent RL with nested episode- and step-level grouping for fine-grained credit assignment (Feng et al., 16 May 2025). Both share core design features: group-relative advantage estimation, hierarchical decomposition, critic-free optimization, and improved interpretability or credit assignment.

1. Hierarchical Labeling and Moderation Pipeline

In the context of multimodal content moderation, Hi-GRPO supports a two-stage hierarchical moderation workflow. The input is a multimodal note $N = (T, V)$ , comprising both text and visual elements. The initial stage uses a lightweight binary classifier $f_1(N;\theta_1)$ to distinguish between "safe" and "risky" notes, optimizing the following supervised cross-entropy loss: $\mathcal{L}_{\mathrm{SFT}} = -\mathbb{E}_{(N,s)\sim\mathcal{D}} \left[\log P_{\theta_1}(s\mid N)\right].$ Stage 1 is calibrated for high recall on risky content, rapidly excluding approximately 80% of safe notes.

Only notes labeled "risky" proceed to Stage 2, which employs a stronger multimodal LLM (e.g., Qwen2-VL-7B) to execute a hierarchical, path-based classification over an $L$ -level taxonomy. For taxonomy level $l$ ( $1 \leq l \leq L$ ), the classifier predicts a child category based on the parent from the previous level, supplied with concatenated rule definitions for the current scope. The output specification—enforced as

1	<think>…CoT reasoning…</think><answer>full path or No Risk</answer>

—guarantees interpretability and facilitates human review (Li et al., 5 Aug 2025).

2. Multi-Level Soft-Margin Reward Design

Hi-GRPO integrates a structure-aware, multi-level soft-margin reward to reflect the granularity and semantic proximity of hierarchical misclassifications. For a predicted path $\hat P = (\hat y^{(1)}, \dots, \hat y^{(L)})$ with ground-truth $P = (y^{(1)}, \dots, y^{(L)})$ , the per-level reward at stage $l$ is defined as: $R_{\mathrm{acc}}^{(l)} = \begin{cases} +1, & \hat y^{(l)}=y^{(l)} \ -2^{l-1}, & \hat y^{(l)} \in \mathrm{sibling}(y^{(l)}) \ 0, & \text{otherwise} \end{cases}$ Sibling category errors are penalized with exponentially increasing severity for finer hierarchy levels (e.g., $-8$ at $l=4$ ). Aggregated rewards are averaged across levels: $R_{\mathrm{acc}} = \frac{1}{L} \sum_{l=1}^L R_{\mathrm{acc}}^{(l)}$ and combined with a format reward (for correct output structure) to obtain the final composite reward: $R_{\mathrm{final}} = R_{\mathrm{acc}} + R_{\mathrm{format}}$ This reward shaping is instrumental in promoting both taxonomic fidelity and interpretable rationales (Li et al., 5 Aug 2025).

3. Group-Relative Policy Optimization: Formulation and Training

Hi-GRPO employs group-relative advantage to optimize the classification policy, eliminating the need for an auxiliary critic. For each input $N$ , $G$ candidate paths $\{\hat P_i\}_{i=1}^G$ are sampled; corresponding rewards $R_i$ are computed and intra-group normalized: $A_i = \frac{R_i - \mu_R}{\sigma_R}, \quad \mu_R = \tfrac{1}{G} \sum_{j=1}^G R_j, \quad \sigma_R = \sqrt{\tfrac{1}{G} \sum_{j=1}^G (R_j-\mu_R)^2}$ The policy objective maximizes the expected advantage-weighted log-probability: $\mathcal{L}_{\mathrm{GRPO}} = -\,\mathbb{E}_{N\sim\mathcal{D}} \left[ \frac{1}{G}\sum_{i=1}^G A_i \log \pi_{\theta_2}(\hat P_i\mid N) \right]$ Optimization proceeds via AdamW over one or more epochs. The process is summarized below:

for epoch in {1, ..., E}:
  for each batch of notes N in D_stage2:
    for each note N in batch:
      generate G candidate paths {P_i} ~ πθ2(·|N)
      compute rewards R_i for each P_i
      compute group mean μ_R and std σ_R
      compute advantages A_i = (R_i-μ_R)/σ_R
      accumulate loss ℓ += - (1/G) * ∑_i A_i * log πθ2(P_i|N)
    update θ2 ← θ2 - α · ∇_θ2 ℓ

Group normalization serves as a variance-reducing baseline, and no value function is learned (Li et al., 5 Aug 2025).

4. Two-Level Grouping for Long-Horizon RL (GiGPO Setting)

Hi-GRPO extends to RL for long-horizon LLM agents via a two-level grouping mechanism—denoted "Group-in-Group Policy Optimization" (GiGPO) (Feng et al., 16 May 2025). Macro grouping is performed over N parallel full-length trajectories $\tau_i=\{(s^{(i)}_t, a^{(i)}_t, r^{(i)}_t)\}_{t=1}^T$ started from the same initial state, with trajectory-level return: $R(\tau_i) = \sum_{t=1}^T r_t^{(i)}$ and episode-wise normalized advantage: $A^E(\tau_i) = \frac{R(\tau_i) - \mathrm{mean}(\{R(\tau_j)\}_{j=1}^N)}{F_{\mathrm{norm}}(\{R(\tau_j)\}_{j=1}^N)}$ Micro (step-level) grouping uses anchor states $\tilde s$ shared across trajectories, forming sets $G^S(\tilde s)$ of all action/discounted-return pairs at state $\tilde s$ : $A^S(a_t^{(i)}) = \frac{R_t^{(i)} - \mathrm{mean}(\{R_t^{(j)}\})}{F_{\mathrm{norm}}(\{R_t^{(j)}\})}$ with $R_t^{(i)} = \sum_{k=t}^T \gamma^{k-t} r_k^{(i)}$ . The final advantage combines both: $A(a_t^{(i)}) = A^E(\tau_i) + \omega\,A^S(a_t^{(i)}), \quad \omega\geq 0$ The update objective resembles PPO but uses these advantages and operates without a critic network. The procedure introduces negligible overhead (<0.2% per training iteration) and degrades gracefully to GRPO if anchor state redundancy vanishes (Feng et al., 16 May 2025).

5. Policy Alignment via Rule-Based Prompting

To ensure output alignment with evolving platform-level moderation policies, Hi-GRPO incorporates full rule definitions for each taxonomy category (and its siblings) into the prompt at each hierarchical stage. This enables the model's chain-of-thought to reference the precise policy language at inference time. The prompt template structures reasoning and output format:

System: Given X, choose most appropriate path from taxonomy.
Category Taxonomy & Rule Definitions:
  Level-1: ... definition ...
  Level-2: ... definition ...
  ...
Instructions:
  • Output <think>…</think> <answer>…</answer>.
User: Content = {Image+Text}

This approach ensures explanations and predictions are robust to policy changes and transparent for human review (Li et al., 5 Aug 2025).

6. Empirical Evaluation and Properties

Empirical studies demonstrate significant advances in both content moderation and RL agent training. In moderation:

Hi-Guard achieves 84.11% classification accuracy on generalization sets—outperforming baseline SFT by +12.13 percentage points (Li et al., 5 Aug 2025).
Ablation studies show substantial incremental gains from hierarchical labeling, rule-based prompting, and soft-margin rewards, with cumulative GPU time reduced by 22.7%.
Human moderation preference for Hi-Guard’s chain-of-thought outputs reaches 73.3% versus 15.4% for RLVR.
Online deployment with 10% traffic yields 79.14% recall, 51.09% precision, and a 56.38% reduction in manual review rates; final human review required for 0.24% of content.

In long-horizon agent RL:

GiGPO (Hi-GRPO) delivers >12% success rate improvement on ALFWorld and >9% on WebShop benchmarks over GRPO, with critically improved per-step credit assignment and low memory overhead (Feng et al., 16 May 2025).
Ablations establish that step-level grouping (A^S) is essential for complex tasks, while episode-level (A^E) ensures global policy coherence.

Hi-GRPO thus generalizes group-based critic-free policy optimization to both hierarchical label spaces and temporally extended agent environments, establishing a paradigm for scalable, interpretable, and policy-aligned RL and content moderation (Li et al., 5 Aug 2025, Feng et al., 16 May 2025).