HC-GRPO: Heterogeneous Cost-Aware Policy Optimization

Updated 26 February 2026

The paper introduces HC-GRPO, a critic-free, cost-aware reinforcement learning framework that integrates diverse, scaled costs directly into policy updates.
It employs group relative advantage estimation and per-component normalization to enhance task success and cost efficiency in embodied AI applications.
Empirical results demonstrate substantial improvements, including halved operational costs and elevated success-weighted-by-cost metrics in complex multi-modal settings.

Heterogeneous Cost-Aware Group Relative Policy Optimization (HC-GRPO) is a reinforcement learning (RL) paradigm designed to optimize complex, often multi-modal agents—such as multimodal LLMs (MLLMs)—in environments where actions incur heterogeneous, non-uniform costs. The framework generalizes Group Relative Policy Optimization (GRPO) by explicitly incorporating multiple, differently scaled cost signals into the policy update, enabling RL agents to rationally trade off reward maximization against a portfolio of real-world constraints and operational costs. HC-GRPO was introduced to address challenges in embodied AI, particularly interactive reasoning and planning under ambiguous instructions, where agents must efficiently combine physical action, dialogue, and memory access—all bearing sharply differing costs (Zhou et al., 21 Dec 2025). Its design synthesizes advances from constrained RL (Girgis et al., 5 Feb 2026) and emphasizes critic-free, group-based advantage estimation, providing stability and scalability in settings with long reasoning traces and high-dimensional policies.

1. Motivation and Problem Context

HC-GRPO is motivated by embodied agents tasked with cost-sensitive decision making in partially observed, ambiguous environments. For example, in embodied search, an agent receives open-ended instructions (e.g., “Find the red cup”) and must decide whether to physically explore (expensive), ask clarifying questions (moderate human attention cost), or consult internal memory (low cost). Standard reinforcement learning approaches, such as PPO with value critics, are often unstable or inefficient for such domains, especially for long-horizon, chain-of-thought reasoning with large models. HC-GRPO is formulated to:

Directly integrate heterogeneous, per-action costs (e.g., navigation distance, number of dialogue queries, memory retrievals) into the reward structure.
Encourage emergence of cost-efficient, succinct reasoning by rewarding trajectories that achieve task success with minimal cumulative cost.
Eliminate the need for separate critic networks by leveraging relative groupwise returns, improving optimization stability in MLLMs (Zhou et al., 21 Dec 2025, Girgis et al., 5 Feb 2026).

2. Heterogeneous Cost Model and POMDP Formulation

In HC-GRPO, the environment is modeled as a Partially Observable Markov Decision Process (POMDP):

State: $s_t$ (e.g., world configuration, hidden).
Observation: $o_t$ (egocentric vision, dialogue, episodic memory).
Action: $a_t \in \{\text{Navigate(loc)}, \text{Ask(query)}, \text{GetMemory(key)}, \text{Found(target)}\}$ .
Transition: Deterministic or stochastic, depending on action type.
Reward and Costs: The agent collects a sparse reward for task completion and incurs explicit heterogeneous costs per action:
- Navigation: $c_\mathrm{nav} \cdot d(p_t, p_{t+1})$ ,
- Dialogue: $c_\mathrm{ask} \cdot (1 + \alpha N_\mathrm{ask})$ ,
- Memory: $c_\mathrm{mem}$ .

The expected return is:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)],$

where $R(\tau) = R_\mathrm{task} - \lambda \sum_{t=0}^T C(a_t)$ . Heterogeneity is enforced through $c_\mathrm{nav} \gg c_\mathrm{ask} \gg c_\mathrm{mem}$ (e.g., $c_\mathrm{nav}=1.0$ , $c_\mathrm{ask}=0.5$ , $c_\mathrm{mem}=0.01$ ) (Zhou et al., 21 Dec 2025).

3. Critic-Free Group Relative Advantage Estimation

A central feature of HC-GRPO is its use of group relative advantage estimation. For each instruction in a minibatch, $G$ full trajectory rollouts are sampled under the current policy $\pi_\theta$ :

Each trajectory $i$ yields a hybrid cost-sensitive reward $r_i = R_\mathrm{task}(\tau_i) - \lambda \sum_t C(a_t^i)$ .
Compute group mean $\mu_R$ and standard deviation $\sigma_R$ over $\{r_i\}$ .
Group-relative advantage for trajectory $i$ :

$A_i = \frac{r_i - \mu_R}{\sigma_R + \epsilon}$

with $\epsilon$ added for numerical stability.

This approach directly replaces the need for a value critic and supports stable optimization even for long-horizon reasoning chains in LLMs (Zhou et al., 21 Dec 2025, Girgis et al., 5 Feb 2026).

4. Scalarized Advantage Construction for Heterogeneous Costs

In environments with multiple, variably scaled costs, naive scalarization or component-wise standardization can corrupt the intended trade-off between reward and constraint satisfaction due to variance-induced advantage warping (Girgis et al., 5 Feb 2026). HC-GRPO employs the following construction:

For each trajectory, standardize the reward and each cost component individually within the group:

$Z_R(\tau) = \frac{R(\tau) - \bar{R}}{\sigma_R}, \quad Z_{C_i}(\tau) = \frac{C_i(\tau) - \bar{C}_i}{\sigma_{C_i}}$

Scalarized group advantage:

$\hat{A}(\tau) = Z_R(\tau) - \sum_{i=1}^{K} \lambda_i Z_{C_i}(\tau)$

Each $Z_j$ has unit variance, preserving the semantics of the Lagrange multipliers $\lambda_i$ regardless of raw value scales.

A plausible implication is that per-component normalization—potentially with clipping, smoothing, and individualized learning rates—is essential to avoid optimization pathologies when cost signals differ in scale, sparsity, or stationarity (Girgis et al., 5 Feb 2026).

5. Policy Optimization and Dual Updates

Using scalarized groupwise advantages, HC-GRPO updates the policy and multipliers as follows:

Policy update: Employs a PPO-style surrogate loss, using the reference policy (e.g., SFT checkpoint) for KL regularization:

$L(\theta) = \mathbb{E}_{q \sim D}\left[\frac{1}{G} \sum_{i=1}^G \min(\rho_i A_i, \mathrm{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i) - \beta D_\mathrm{KL}(\pi_\theta(\cdot|q) \| \pi_\mathrm{ref}(\cdot|q))\right]$

Multiplier (dual) updates: For each constraint $i$ :

$\lambda_i \leftarrow \max\left\{0, \lambda_i + \alpha_{\lambda} (\mathbb{E}[C_i(s, a)] - d_i)\right\}$

Adapting $\alpha_{\lambda_i}$ per constraint ensures stability against heterogeneity in constraint scales.

Pseudocode for the complete update loop is provided in both (Girgis et al., 5 Feb 2026) (for multi-constraint environments) and (Zhou et al., 21 Dec 2025) (for the cost-aware embodied search scenario).

6. Empirical and Practical Impact

HC-GRPO demonstrates several empirical benefits over standard PPO and actor-critic methods in cost-sensitive reasoning and embodied AI:

In the ESearch-R1 agent for AI2-THOR, HC-GRPO improves task success rates (61.5% vs. ~60% for strong ReAct agents) and halves total operational cost (from 3.3 to 1.6 on average; >50% reduction) (Zhou et al., 21 Dec 2025).
Success-weighted-by-cost (SwC) rises to 0.59 vs. 0.36 for conventional ReAct, confirming improved cost efficiency.
Ablations show that the availability of all cost pathways (Ask, GetMemory, Navigate) and online cost-aware tuning are critical for achieving maximum efficiency.
Qualitative analysis reveals that the agent prioritizes zero-cost reasoning modes (memory recall), resorts to targeted human queries, and only invokes costly navigation as a last resort, confirming that HC-GRPO induces a "think before you move" reasoning style.

In robotics and gridworld settings, the same per-component normalization and scalarized advantage machinery is necessary for both stable constraint enforcement and increased task success under explicit cost constraints (Girgis et al., 5 Feb 2026).

GRPO and its cost/constrained variants, such as HC-GRPO, have prompted further advances in policy optimization for reasoning and alignment in LLMs. Group Distributionally Robust Optimization (GDRO) introduces adversarial data and compute allocation mechanisms, dynamically reallocating sampling and rollout budgets to difficult reasoning groups, yielding additional gains in pass@k accuracy and robustness without increasing overall sample complexity (Panaganti et al., 27 Jan 2026). Both Prompt-GDRO and Rollout-GDRO demonstrate that principled dynamic grouping and resource control synergize with the group-based advantage estimation of (HC-)GRPO, supporting curricula that focus optimization on the evolving "reasoning frontier."

In summary, HC-GRPO is a scalable, principled framework for critic-free RL in domains with heterogeneous cost signals and explicit behavioral constraints, providing robust cost-sensitive reasoning in large-scale embodied and reasoning agents (Zhou et al., 21 Dec 2025, Girgis et al., 5 Feb 2026, Panaganti et al., 27 Jan 2026).