HC-GRPO: Cost-Aware Group Policy Optimization

Updated 28 December 2025

The paper introduces HC-GRPO, an RL algorithm that uses group-relative performance to eliminate the need for a learned value critic in cost-aware policy adaptation.
HC-GRPO integrates heterogeneous costs from navigation, queries, and memory retrieval to balance operational efficiency in high-dimensional, partially observable environments.
Empirical outcomes show HC-GRPO reduces task costs and improves success rates, outperforming standard PPO-based methods in simulated embodied search tasks.

HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization) is a reinforcement learning (RL) algorithm designed for optimizing multimodal LLM (MLLM) agents engaged in complex embodied search tasks. Unlike traditional Proximal Policy Optimization (PPO), HC-GRPO operates by grouping trajectory rollouts per instruction, exploiting relative performance among these rollouts to eliminate the necessity of a learned value critic. This mechanism facilitates efficient, cost-aware policy adaptation, focusing on the optimal navigation of heterogeneous operational costs—including physical movement, social interaction via queries, and cognitive memory retrieval—within high-dimensional, partially observable environments (Zhou et al., 21 Dec 2025).

1. Problem Formulation

HC-GRPO addresses the challenge of reasoning under ambiguous instructions by integrating heterogeneous actions and their associated costs into a unified RL framework. The state space is implicitly defined via the MLLM’s internal context vector; the multimodal history at timestep $t$ is $h_t = (\text{observations}_{0:t}, \text{CoT}_{0:t-1}, \text{actions}_{0:t-1})$ . The action space $\mathcal{A}$ comprises:

$\operatorname{Navigate}(\ell)$ : Physical relocation to location $\ell$ ,
$\operatorname{Ask}(q)$ : Clarification question to the user,
$\operatorname{GetMemory}(k)$ : Episodic memory retrieval,
$\operatorname{Found}(\text{target})$ : Terminal "I found it" action.

The cost function $C(a_t)$ , reflecting the heterogeneity of physical, cognitive, and social acts, is as follows:

$C(a_t) = \begin{cases} c_{\mathrm{nav}}\,d(p_t, p_{t+1}) & \text{if }a_t = \mathrm{Navigate}, \ c_{\mathrm{ask}} (1 + \alpha N_{\mathrm{ask}}(t)) & \text{if }a_t = \mathrm{Ask}, \ c_{\mathrm{mem}} & \text{if }a_t = \mathrm{GetMemory}, \ 0 & \text{otherwise}. \end{cases}$

Here, $d(\cdot, \cdot)$ is the physical distance, $N_\text{ask}(t)$ counts prior queries, and cost coefficients ( $c_{\mathrm{nav}} > c_{\mathrm{ask}} \gg c_{\mathrm{mem}}$ ) reflect operational trade-offs. The core objective maximizes the expected net return over trajectories, combining sparse success rewards $G(\tau) = R_\mathrm{task}(\tau)$ with the weighted sum of cumulative costs:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ G(\tau) - \lambda C(\tau) \right]$

with hyperparameter $\lambda$ mediating efficiency-versus-success.

2. HC-GRPO Algorithmic Framework

Distinguishing itself from single-trajectory RL paradigms, HC-GRPO samples $G$ trajectories per instruction, forming a group-based performance context. For a fixed query $q$ , $G$ trajectories $\{\tau_i\}$ yield rewards $r_i = R_\mathrm{task}(\tau_i) - \lambda C(\tau_i)$ , with group mean $\mu_r$ and standard deviation $\sigma_r$ . The relative advantage for each sample is:

$A_i = \frac{r_i - \mu_r}{\sigma_r + \varepsilon}$

ensuring $\mathbb{E}[A_i] = 0$ for unbiasedness.

The policy optimization objective uses the PPO-style clipped surrogate, where the importance sampling ratio is $\rho_i = \frac{\pi_\theta(\tau_i|q)}{\pi_\mathrm{old}(\tau_i|q)}$ :

$\mathcal{L}(\theta) = \mathbb{E}_{q \sim \mathcal{D}} \left[ \frac{1}{G} \sum_{i=1}^G \min\left( \rho_i A_i,\, \mathrm{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i \right) \right] - \beta D_\mathrm{KL}\left[ \pi_\theta(\cdot|q) \| \pi_\mathrm{ref}(\cdot|q) \right]$

$\pi_\mathrm{ref}$ denotes the frozen supervised fine-tuning (SFT) policy, which regularizes KL divergence to maintain proximity to the initialization.

Pseudocode Overview

The algorithm iterates through RL epochs; for each instruction in a batch, $G$ rollouts are generated and their group-relative advantages computed. Policy parameters are updated with gradient ascent on the loss $\mathcal{L}(\theta)$ , with no requirement for a separate value function network.

3. Theoretical Properties

Key theoretical attributes of HC-GRPO include:

Unbiased Baseline: The empirical group mean provides a baseline such that $\mathbb{E}_i[A_i]=0$ , preserving the unbiasedness of policy gradients.
Variance Reduction: Using per-group normalization (relative to single-trajectory critic estimates), the method attains lower variance gradient estimates, which is especially beneficial in the high-dimensional chain-of-thought (CoT) spaces engendered by MLLMs.
KL-Regularization: Explicit KL divergence between the current policy and the SFT reference ( $\pi_\mathrm{ref}$ ) ensures stable policy updates within a trust region, following the theoretical underpinnings of PPO with KL-constraint.
Critic Elimination: HC-GRPO dispenses with the value network $V_\phi(h)$ , traditionally employed in PPO, substituting it with group empirical baselines.

4. Implementation Protocols and Hyperparameters

The backbone MLLM is Qwen2.5-VL-7B. The training protocol comprises two stages:

SFT Stage: AdamW optimizer, learning rate $1 \times 10^{-5}$ , batch size 16, 1 epoch, cosine learning rate decay (min 0.1).
HC-GRPO Stage: Learning rate $2 \times 10^{-6}$ , batch size 8, 3 epochs, discount factor $\gamma=0.99$ , KL penalty $\beta=0.1$ , PPO clip $\epsilon=0.2$ , group size $G=8$ , cost tradeoff $\lambda=1.0$ .

Cost parameters are $c_{\mathrm{nav}}=1.0$ , $c_{\mathrm{ask}}=0.5$ , $c_{\mathrm{mem}}=0.01$ , query fatigue $\alpha = 0.2$ . Rewards are $R_\mathrm{success} = +1.0$ , $R_\mathrm{fail} = -0.1$ . Additional scheme includes an entropy bonus $c_2 = 0.01$ and a format-penalty cost $0.1$.

5. Empirical Outcomes

HC-GRPO’s efficacy is substantiated via extensive experiments in the AI2-THOR simulated environment. ESearch-R1, trained using HC-GRPO, achieves a success rate of 61.5%, surpassing the best ReAct baseline (60.0%). The mean total task cost (TTC) is halved (from approximately 3.3 to 1.6), and the success-weighted-by-cost (SwC) metric increases from 0.36 to 0.59, demonstrating marked improvements in operational efficiency.

Ablative analyses indicate that omitting dialogue reduces the success rate (SR) to 10.5%, while excluding memory components yields SR 52.0% and raises TTC to 2.3. Training without HC-GRPO (SFT only) results in SR 59.2% and TTC 2.3.

Sensitivity studies confirm that the policy retains superior cost-weighted performance across broad ranges of $c_{\mathrm{nav}}$ and $c_{\mathrm{ask}}$ , highlighting meta-policy generalization. Qualitatively, emergent strategies prioritize minimal, targeted disambiguation (one Ask or memory lookup) before movement—reflecting an efficient, human-like cost-aware search heuristic.

6. Context and Significance

HC-GRPO constitutes a substantial departure from critic-based on-policy RL in high-cost, multimodal domains. By aligning optimization with the relative efficacy of reasoning-action trajectories under explicit cost structures, the method advances the ability of MLLM agents to operate strategically under real-world constraints. This innovation is particularly salient given the operational asymmetry and cost diversity inherent in embodied instruction-following tasks.

Validations in the ESearch-R1 system demonstrate considerable practical gains, underscoring the robustness of group-relative baseline techniques and their suitability for RL fine-tuning of large, generative multimodal models in interactive, physical contexts (Zhou et al., 21 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization).