Hi-GRPO: Hierarchical Group-Relative Policy
- Hi-GRPO is a reinforcement learning framework that integrates group-relative advantage estimation with hierarchical decomposition to achieve interpretable policy optimization.
- It employs a two-stage content moderation pipeline and a multi-level soft-margin reward design to enhance classification accuracy and fine-grained credit assignment.
- For long-horizon RL, Hi-GRPO uses a group-in-group strategy to balance macro and micro-level grouping, improving global coherence without requiring an auxiliary critic.
Hierarchical Group-Relative Policy Optimization (Hi-GRPO) refers to a class of reinforcement learning (RL) algorithms and moderation frameworks that combine group-based policy gradient optimization with hierarchical structural decomposition—either in the label space, the policy evaluation process, or both. Two principal settings for Hi-GRPO have been established: (1) multimodal content moderation with hierarchical labeling and rule-aligned reasoning (Li et al., 5 Aug 2025), and (2) long-horizon agent RL with nested episode- and step-level grouping for fine-grained credit assignment (Feng et al., 16 May 2025). Both share core design features: group-relative advantage estimation, hierarchical decomposition, critic-free optimization, and improved interpretability or credit assignment.
1. Hierarchical Labeling and Moderation Pipeline
In the context of multimodal content moderation, Hi-GRPO supports a two-stage hierarchical moderation workflow. The input is a multimodal note , comprising both text and visual elements. The initial stage uses a lightweight binary classifier to distinguish between "safe" and "risky" notes, optimizing the following supervised cross-entropy loss: Stage 1 is calibrated for high recall on risky content, rapidly excluding approximately 80% of safe notes.
Only notes labeled "risky" proceed to Stage 2, which employs a stronger multimodal LLM (e.g., Qwen2-VL-7B) to execute a hierarchical, path-based classification over an -level taxonomy. For taxonomy level (), the classifier predicts a child category based on the parent from the previous level, supplied with concatenated rule definitions for the current scope. The output specification—enforced as
1 |
<think>…CoT reasoning…</think><answer>full path or No Risk</answer> |
2. Multi-Level Soft-Margin Reward Design
Hi-GRPO integrates a structure-aware, multi-level soft-margin reward to reflect the granularity and semantic proximity of hierarchical misclassifications. For a predicted path with ground-truth , the per-level reward at stage is defined as: Sibling category errors are penalized with exponentially increasing severity for finer hierarchy levels (e.g., at ). Aggregated rewards are averaged across levels: and combined with a format reward (for correct output structure) to obtain the final composite reward: This reward shaping is instrumental in promoting both taxonomic fidelity and interpretable rationales (Li et al., 5 Aug 2025).
3. Group-Relative Policy Optimization: Formulation and Training
Hi-GRPO employs group-relative advantage to optimize the classification policy, eliminating the need for an auxiliary critic. For each input , candidate paths are sampled; corresponding rewards are computed and intra-group normalized: The policy objective maximizes the expected advantage-weighted log-probability: Optimization proceeds via AdamW over one or more epochs. The process is summarized below:
1 2 3 4 5 6 7 8 9 |
for epoch in {1, ..., E}: for each batch of notes N in D_stage2: for each note N in batch: generate G candidate paths {P_i} ~ πθ2(·|N) compute rewards R_i for each P_i compute group mean μ_R and std σ_R compute advantages A_i = (R_i-μ_R)/σ_R accumulate loss ℓ += - (1/G) * ∑_i A_i * log πθ2(P_i|N) update θ2 ← θ2 - α · ∇_θ2 ℓ |
4. Two-Level Grouping for Long-Horizon RL (GiGPO Setting)
Hi-GRPO extends to RL for long-horizon LLM agents via a two-level grouping mechanism—denoted "Group-in-Group Policy Optimization" (GiGPO) (Feng et al., 16 May 2025). Macro grouping is performed over N parallel full-length trajectories started from the same initial state, with trajectory-level return: and episode-wise normalized advantage: Micro (step-level) grouping uses anchor states shared across trajectories, forming sets of all action/discounted-return pairs at state : with . The final advantage combines both: The update objective resembles PPO but uses these advantages and operates without a critic network. The procedure introduces negligible overhead (<0.2% per training iteration) and degrades gracefully to GRPO if anchor state redundancy vanishes (Feng et al., 16 May 2025).
5. Policy Alignment via Rule-Based Prompting
To ensure output alignment with evolving platform-level moderation policies, Hi-GRPO incorporates full rule definitions for each taxonomy category (and its siblings) into the prompt at each hierarchical stage. This enables the model's chain-of-thought to reference the precise policy language at inference time. The prompt template structures reasoning and output format:
1 2 3 4 5 6 7 8 |
System: Given X, choose most appropriate path from taxonomy.
Category Taxonomy & Rule Definitions:
Level-1: ... definition ...
Level-2: ... definition ...
...
Instructions:
• Output <think>…</think> <answer>…</answer>.
User: Content = {Image+Text} |
6. Empirical Evaluation and Properties
Empirical studies demonstrate significant advances in both content moderation and RL agent training. In moderation:
- Hi-Guard achieves 84.11% classification accuracy on generalization sets—outperforming baseline SFT by +12.13 percentage points (Li et al., 5 Aug 2025).
- Ablation studies show substantial incremental gains from hierarchical labeling, rule-based prompting, and soft-margin rewards, with cumulative GPU time reduced by 22.7%.
- Human moderation preference for Hi-Guard’s chain-of-thought outputs reaches 73.3% versus 15.4% for RLVR.
- Online deployment with 10% traffic yields 79.14% recall, 51.09% precision, and a 56.38% reduction in manual review rates; final human review required for 0.24% of content.
In long-horizon agent RL:
- GiGPO (Hi-GRPO) delivers >12% success rate improvement on ALFWorld and >9% on WebShop benchmarks over GRPO, with critically improved per-step credit assignment and low memory overhead (Feng et al., 16 May 2025).
- Ablations establish that step-level grouping (AS) is essential for complex tasks, while episode-level (AE) ensures global policy coherence.
Hi-GRPO thus generalizes group-based critic-free policy optimization to both hierarchical label spaces and temporally extended agent environments, establishing a paradigm for scalable, interpretable, and policy-aligned RL and content moderation (Li et al., 5 Aug 2025, Feng et al., 16 May 2025).