HC-GRPO: Cost-Aware Group Policy Optimization
- The paper introduces HC-GRPO, an RL algorithm that uses group-relative performance to eliminate the need for a learned value critic in cost-aware policy adaptation.
- HC-GRPO integrates heterogeneous costs from navigation, queries, and memory retrieval to balance operational efficiency in high-dimensional, partially observable environments.
- Empirical outcomes show HC-GRPO reduces task costs and improves success rates, outperforming standard PPO-based methods in simulated embodied search tasks.
HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization) is a reinforcement learning (RL) algorithm designed for optimizing multimodal LLM (MLLM) agents engaged in complex embodied search tasks. Unlike traditional Proximal Policy Optimization (PPO), HC-GRPO operates by grouping trajectory rollouts per instruction, exploiting relative performance among these rollouts to eliminate the necessity of a learned value critic. This mechanism facilitates efficient, cost-aware policy adaptation, focusing on the optimal navigation of heterogeneous operational costs—including physical movement, social interaction via queries, and cognitive memory retrieval—within high-dimensional, partially observable environments (Zhou et al., 21 Dec 2025).
1. Problem Formulation
HC-GRPO addresses the challenge of reasoning under ambiguous instructions by integrating heterogeneous actions and their associated costs into a unified RL framework. The state space is implicitly defined via the MLLM’s internal context vector; the multimodal history at timestep is . The action space comprises:
- : Physical relocation to location ,
- : Clarification question to the user,
- : Episodic memory retrieval,
- : Terminal "I found it" action.
The cost function , reflecting the heterogeneity of physical, cognitive, and social acts, is as follows:
Here, is the physical distance, counts prior queries, and cost coefficients () reflect operational trade-offs. The core objective maximizes the expected net return over trajectories, combining sparse success rewards with the weighted sum of cumulative costs:
with hyperparameter mediating efficiency-versus-success.
2. HC-GRPO Algorithmic Framework
Distinguishing itself from single-trajectory RL paradigms, HC-GRPO samples trajectories per instruction, forming a group-based performance context. For a fixed query , trajectories yield rewards , with group mean and standard deviation . The relative advantage for each sample is:
ensuring for unbiasedness.
The policy optimization objective uses the PPO-style clipped surrogate, where the importance sampling ratio is :
denotes the frozen supervised fine-tuning (SFT) policy, which regularizes KL divergence to maintain proximity to the initialization.
Pseudocode Overview
The algorithm iterates through RL epochs; for each instruction in a batch, rollouts are generated and their group-relative advantages computed. Policy parameters are updated with gradient ascent on the loss , with no requirement for a separate value function network.
3. Theoretical Properties
Key theoretical attributes of HC-GRPO include:
- Unbiased Baseline: The empirical group mean provides a baseline such that , preserving the unbiasedness of policy gradients.
- Variance Reduction: Using per-group normalization (relative to single-trajectory critic estimates), the method attains lower variance gradient estimates, which is especially beneficial in the high-dimensional chain-of-thought (CoT) spaces engendered by MLLMs.
- KL-Regularization: Explicit KL divergence between the current policy and the SFT reference () ensures stable policy updates within a trust region, following the theoretical underpinnings of PPO with KL-constraint.
- Critic Elimination: HC-GRPO dispenses with the value network , traditionally employed in PPO, substituting it with group empirical baselines.
4. Implementation Protocols and Hyperparameters
The backbone MLLM is Qwen2.5-VL-7B. The training protocol comprises two stages:
- SFT Stage: AdamW optimizer, learning rate , batch size 16, 1 epoch, cosine learning rate decay (min 0.1).
- HC-GRPO Stage: Learning rate , batch size 8, 3 epochs, discount factor , KL penalty , PPO clip , group size , cost tradeoff .
Cost parameters are , , , query fatigue . Rewards are , . Additional scheme includes an entropy bonus and a format-penalty cost $0.1$.
5. Empirical Outcomes
HC-GRPO’s efficacy is substantiated via extensive experiments in the AI2-THOR simulated environment. ESearch-R1, trained using HC-GRPO, achieves a success rate of 61.5%, surpassing the best ReAct baseline (60.0%). The mean total task cost (TTC) is halved (from approximately 3.3 to 1.6), and the success-weighted-by-cost (SwC) metric increases from 0.36 to 0.59, demonstrating marked improvements in operational efficiency.
Ablative analyses indicate that omitting dialogue reduces the success rate (SR) to 10.5%, while excluding memory components yields SR 52.0% and raises TTC to 2.3. Training without HC-GRPO (SFT only) results in SR 59.2% and TTC 2.3.
Sensitivity studies confirm that the policy retains superior cost-weighted performance across broad ranges of and , highlighting meta-policy generalization. Qualitatively, emergent strategies prioritize minimal, targeted disambiguation (one Ask or memory lookup) before movement—reflecting an efficient, human-like cost-aware search heuristic.
6. Context and Significance
HC-GRPO constitutes a substantial departure from critic-based on-policy RL in high-cost, multimodal domains. By aligning optimization with the relative efficacy of reasoning-action trajectories under explicit cost structures, the method advances the ability of MLLM agents to operate strategically under real-world constraints. This innovation is particularly salient given the operational asymmetry and cost diversity inherent in embodied instruction-following tasks.
Validations in the ESearch-R1 system demonstrate considerable practical gains, underscoring the robustness of group-relative baseline techniques and their suitability for RL fine-tuning of large, generative multimodal models in interactive, physical contexts (Zhou et al., 21 Dec 2025).