Papers
Topics
Authors
Recent
Search
2000 character limit reached

HC-GRPO: Cost-Aware Group Policy Optimization

Updated 28 December 2025
  • The paper introduces HC-GRPO, an RL algorithm that uses group-relative performance to eliminate the need for a learned value critic in cost-aware policy adaptation.
  • HC-GRPO integrates heterogeneous costs from navigation, queries, and memory retrieval to balance operational efficiency in high-dimensional, partially observable environments.
  • Empirical outcomes show HC-GRPO reduces task costs and improves success rates, outperforming standard PPO-based methods in simulated embodied search tasks.

HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization) is a reinforcement learning (RL) algorithm designed for optimizing multimodal LLM (MLLM) agents engaged in complex embodied search tasks. Unlike traditional Proximal Policy Optimization (PPO), HC-GRPO operates by grouping trajectory rollouts per instruction, exploiting relative performance among these rollouts to eliminate the necessity of a learned value critic. This mechanism facilitates efficient, cost-aware policy adaptation, focusing on the optimal navigation of heterogeneous operational costs—including physical movement, social interaction via queries, and cognitive memory retrieval—within high-dimensional, partially observable environments (Zhou et al., 21 Dec 2025).

1. Problem Formulation

HC-GRPO addresses the challenge of reasoning under ambiguous instructions by integrating heterogeneous actions and their associated costs into a unified RL framework. The state space is implicitly defined via the MLLM’s internal context vector; the multimodal history at timestep tt is ht=(observations0:t,CoT0:t1,actions0:t1)h_t = (\text{observations}_{0:t}, \text{CoT}_{0:t-1}, \text{actions}_{0:t-1}). The action space A\mathcal{A} comprises:

  • Navigate()\operatorname{Navigate}(\ell): Physical relocation to location \ell,
  • Ask(q)\operatorname{Ask}(q): Clarification question to the user,
  • GetMemory(k)\operatorname{GetMemory}(k): Episodic memory retrieval,
  • Found(target)\operatorname{Found}(\text{target}): Terminal "I found it" action.

The cost function C(at)C(a_t), reflecting the heterogeneity of physical, cognitive, and social acts, is as follows:

C(at)={cnavd(pt,pt+1)if at=Navigate, cask(1+αNask(t))if at=Ask, cmemif at=GetMemory, 0otherwise.C(a_t) = \begin{cases} c_{\mathrm{nav}}\,d(p_t, p_{t+1}) & \text{if }a_t = \mathrm{Navigate}, \ c_{\mathrm{ask}} (1 + \alpha N_{\mathrm{ask}}(t)) & \text{if }a_t = \mathrm{Ask}, \ c_{\mathrm{mem}} & \text{if }a_t = \mathrm{GetMemory}, \ 0 & \text{otherwise}. \end{cases}

Here, d(,)d(\cdot, \cdot) is the physical distance, Nask(t)N_\text{ask}(t) counts prior queries, and cost coefficients (cnav>caskcmemc_{\mathrm{nav}} > c_{\mathrm{ask}} \gg c_{\mathrm{mem}}) reflect operational trade-offs. The core objective maximizes the expected net return over trajectories, combining sparse success rewards G(τ)=Rtask(τ)G(\tau) = R_\mathrm{task}(\tau) with the weighted sum of cumulative costs:

J(θ)=Eτπθ[G(τ)λC(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ G(\tau) - \lambda C(\tau) \right]

with hyperparameter λ\lambda mediating efficiency-versus-success.

2. HC-GRPO Algorithmic Framework

Distinguishing itself from single-trajectory RL paradigms, HC-GRPO samples GG trajectories per instruction, forming a group-based performance context. For a fixed query qq, GG trajectories {τi}\{\tau_i\} yield rewards ri=Rtask(τi)λC(τi)r_i = R_\mathrm{task}(\tau_i) - \lambda C(\tau_i), with group mean μr\mu_r and standard deviation σr\sigma_r. The relative advantage for each sample is:

Ai=riμrσr+εA_i = \frac{r_i - \mu_r}{\sigma_r + \varepsilon}

ensuring E[Ai]=0\mathbb{E}[A_i] = 0 for unbiasedness.

The policy optimization objective uses the PPO-style clipped surrogate, where the importance sampling ratio is ρi=πθ(τiq)πold(τiq)\rho_i = \frac{\pi_\theta(\tau_i|q)}{\pi_\mathrm{old}(\tau_i|q)}:

L(θ)=EqD[1Gi=1Gmin(ρiAi,clip(ρi,1ϵ,1+ϵ)Ai)]βDKL[πθ(q)πref(q)]\mathcal{L}(\theta) = \mathbb{E}_{q \sim \mathcal{D}} \left[ \frac{1}{G} \sum_{i=1}^G \min\left( \rho_i A_i,\, \mathrm{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i \right) \right] - \beta D_\mathrm{KL}\left[ \pi_\theta(\cdot|q) \| \pi_\mathrm{ref}(\cdot|q) \right]

πref\pi_\mathrm{ref} denotes the frozen supervised fine-tuning (SFT) policy, which regularizes KL divergence to maintain proximity to the initialization.

Pseudocode Overview

The algorithm iterates through RL epochs; for each instruction in a batch, GG rollouts are generated and their group-relative advantages computed. Policy parameters are updated with gradient ascent on the loss L(θ)\mathcal{L}(\theta), with no requirement for a separate value function network.

3. Theoretical Properties

Key theoretical attributes of HC-GRPO include:

  • Unbiased Baseline: The empirical group mean provides a baseline such that Ei[Ai]=0\mathbb{E}_i[A_i]=0, preserving the unbiasedness of policy gradients.
  • Variance Reduction: Using per-group normalization (relative to single-trajectory critic estimates), the method attains lower variance gradient estimates, which is especially beneficial in the high-dimensional chain-of-thought (CoT) spaces engendered by MLLMs.
  • KL-Regularization: Explicit KL divergence between the current policy and the SFT reference (πref\pi_\mathrm{ref}) ensures stable policy updates within a trust region, following the theoretical underpinnings of PPO with KL-constraint.
  • Critic Elimination: HC-GRPO dispenses with the value network Vϕ(h)V_\phi(h), traditionally employed in PPO, substituting it with group empirical baselines.

4. Implementation Protocols and Hyperparameters

The backbone MLLM is Qwen2.5-VL-7B. The training protocol comprises two stages:

  • SFT Stage: AdamW optimizer, learning rate 1×1051 \times 10^{-5}, batch size 16, 1 epoch, cosine learning rate decay (min 0.1).
  • HC-GRPO Stage: Learning rate 2×1062 \times 10^{-6}, batch size 8, 3 epochs, discount factor γ=0.99\gamma=0.99, KL penalty β=0.1\beta=0.1, PPO clip ϵ=0.2\epsilon=0.2, group size G=8G=8, cost tradeoff λ=1.0\lambda=1.0.

Cost parameters are cnav=1.0c_{\mathrm{nav}}=1.0, cask=0.5c_{\mathrm{ask}}=0.5, cmem=0.01c_{\mathrm{mem}}=0.01, query fatigue α=0.2\alpha = 0.2. Rewards are Rsuccess=+1.0R_\mathrm{success} = +1.0, Rfail=0.1R_\mathrm{fail} = -0.1. Additional scheme includes an entropy bonus c2=0.01c_2 = 0.01 and a format-penalty cost $0.1$.

5. Empirical Outcomes

HC-GRPO’s efficacy is substantiated via extensive experiments in the AI2-THOR simulated environment. ESearch-R1, trained using HC-GRPO, achieves a success rate of 61.5%, surpassing the best ReAct baseline (60.0%). The mean total task cost (TTC) is halved (from approximately 3.3 to 1.6), and the success-weighted-by-cost (SwC) metric increases from 0.36 to 0.59, demonstrating marked improvements in operational efficiency.

Ablative analyses indicate that omitting dialogue reduces the success rate (SR) to 10.5%, while excluding memory components yields SR 52.0% and raises TTC to 2.3. Training without HC-GRPO (SFT only) results in SR 59.2% and TTC 2.3.

Sensitivity studies confirm that the policy retains superior cost-weighted performance across broad ranges of cnavc_{\mathrm{nav}} and caskc_{\mathrm{ask}}, highlighting meta-policy generalization. Qualitatively, emergent strategies prioritize minimal, targeted disambiguation (one Ask or memory lookup) before movement—reflecting an efficient, human-like cost-aware search heuristic.

6. Context and Significance

HC-GRPO constitutes a substantial departure from critic-based on-policy RL in high-cost, multimodal domains. By aligning optimization with the relative efficacy of reasoning-action trajectories under explicit cost structures, the method advances the ability of MLLM agents to operate strategically under real-world constraints. This innovation is particularly salient given the operational asymmetry and cost diversity inherent in embodied instruction-following tasks.

Validations in the ESearch-R1 system demonstrate considerable practical gains, underscoring the robustness of group-relative baseline techniques and their suitability for RL fine-tuning of large, generative multimodal models in interactive, physical contexts (Zhou et al., 21 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization).