MHGPO: Multi-Agent Policy Optimization

Updated 27 November 2025

MHGPO is a reinforcement learning framework designed for optimizing LLM-based multi-agent systems using a critic-free, group-relative advantage estimation approach.
It employs innovative group rollout sampling strategies, such as Independent Sampling, Fork-on-First, and Round-Robin, to balance accuracy and computational overhead.
Empirical evaluations show MHGPO achieves superior F1 scores on tasks like HotpotQA while reducing memory usage and improving training stability.

Multi-Agent Heterogeneous Group Policy Optimization (MHGPO) is a reinforcement learning framework designed for optimizing LLM-based multi-agent systems (MAS) in a cooperative, parameter-shared, decentralized-execution setting. Unlike conventional multi-agent reinforcement learning (MARL) algorithms such as Multi-Agent Proximal Policy Optimization (MAPPO), which rely on critic networks to estimate value functions and facilitate policy updates, MHGPO eschews value-based critics altogether. Instead, it employs a group-based relative advantage estimation within batches of heterogeneous rollout trajectories, realizing a stable, memory-efficient, and scalable optimization regime. Empirical evidence demonstrates superior task performance and computational efficiency compared to MAPPO on LLM-based multi-agent search systems (Chen et al., 3 Jun 2025).

1. Formal Problem Definition

MHGPO formalizes the MAS setting as a parameter-shared, cooperative-training, decentralized-execution MARL problem, where $n$ heterogeneous agents $A_1, \dots, A_n$ are all instantiated from a single LLM backbone. System states $s \in S$ represent the current prompt delivered to the next pipeline agent, while each agent $A_k$ emits a variable-length token sequence $o_k$ , with actions $a^t_k = o^t_k \in \mathcal{V}$ composing the output. The deterministic transition function $P(s' \mid s, a_k)$ is determined by prompt concatenation and agent routing rules. Reward assignment consists of a global shared reward $R^{\rm shared}$ (e.g., F1 score against gold answers) retroactively distributed along the agent chain, plus step-level agent-specific penalties $R_k^{\rm spe}(q_k, o_k)$ for malformed outputs. All agents are parameterized by the joint policy $\pi_\theta(o \mid q)$ . The optimization objective is to maximize the cumulative expected (undiscounted) return:

$J(\theta)=\mathbb{E}_{q\sim D,\,o\sim\pi_\theta}\left[\sum_{k=1}^n\sum_{t=1}^{T_k}R^t_{k}\right]$

2. Critic-Free Algorithmic Framework

MHGPO eliminates the need for value-network or critic function approximation by employing a group-based advantage estimation (Group-Relative Advantage, GRA). Each mini-batch training iteration follows:

Group Rollout Sampling: For each input question $q$ , sample $G$ full trajectories $\{\mathcal{T}_i\}_{i=1}^G$ under a reference policy $\theta_{\rm ref}$ .
Backward Reward Propagation: Assign shared reward $R^{\rm shared}_i$ to each final answer and propagate rewards backward via averaging. Agent-specific penalties $R^{\rm spe}_{k,i}$ are added to obtain total $R_{k,i}$ per agent and trajectory.
Group-Relative Advantage Calculation:

$\Delta_{k,i} = \hat{A}_{k,i} = \frac{R_{k,i} - \mu_g}{\sigma_g}$

with

$\mu_g = \frac{1}{|g|} \sum_{(l,j)\in g} R_{l,j}\,, \quad \sigma_g = \sqrt{\frac{1}{|g|} \sum_{(l,j)\in g}(R_{l,j} - \mu_g)^2}\,,$

where group $g = m_{k,i}$ indexes the collection of rollouts that share a sampling context.

Critic-Free Policy Gradient Update: Using PPO-style clipping and reference policy KL-penalization, the loss is

$\mathcal{L}(\theta) = -\mathbb{E}_{q, \{o_{k,i}\}}\left[\frac{1}{n}\sum_{k=1}^n \frac{1}{G_k}\sum_{i=1}^{G_k} \sum_{t=1}^{T_{k,i}} \min\left(r^t_{k,i} \Delta^t_{k,i},\,\mathrm{clip}\left(r^t_{k,i},1-\epsilon,1+\epsilon\right)\Delta^t_{k,i}\right) \right] + \beta D_{KL}\left(\pi_\theta\|\pi_{\rm ref}\right)$

where $r^t_{k,i} = \frac{\pi_\theta(o^t_{k,i}\mid q,o^{<t}_{k,i})}{\pi_{\rm ref}(o^t_{k,i}\mid q,o^{<t}_{k,i})}$ .

3. Group Rollout Sampling Methodologies

MHGPO’s scalability and stability are shaped by the choice of group rollout sampling strategy. For each, a base fork utility is deployed to handle pipeline branching:

Independent Sampling (IS): For each agent $A_i$ and each $q$ , fork only at $A_i$ , producing $G$ continuations; repeated for $i = 1,\dots, n$ , resulting in homogeneous groups. Low variance per agent, no cross-agent coupling, $nG$ rollouts per sample.
Fork-on-First (FoF): Forks only at the first pipeline agent, so downstream agents receive $G$ distinct inputs, but groups are always size $G$ . Captures full pipeline dependencies, $nG$ agent calls per batch, higher ultimate accuracy.
Round-Robin (RR): Randomly selects the fork point per $q$ using categorical probabilities $p_1,\dots,p_n$ , with post-hoc re-grouping to guarantee equal group sizes. Allows tuning between compute cost and variance, with typical overhead of $(n-1)+G$ agent calls per sample.

Sampling strategy selection affects agent variance, inter-agent dependency modeling, and computational overhead.

4. Policy Update Mechanism

MHGPO combines the above mechanisms into a PPO-inspired gradient update step, eschewing the learning and maintenance of a value function:

$\theta \leftarrow \theta - \alpha \nabla_\theta \left[ -\frac{1}{n}\sum_{k,i,t} \min(r^t_{k,i}\Delta^t_{k,i},\,\mathrm{clip}(r^t_{k,i},1-\epsilon,1+\epsilon) \Delta^t_{k,i}) + \beta D_{KL} (\pi_\theta \| \pi_{\rm ref}) \right]$

The KL penalty $\beta$ and PPO clipping parameter $\epsilon$ serve the same stabilizing function as in single-agent PPO. The absence of a value-function network eliminates the need for memory-intensive critic computations.

5. Empirical and Theoretical Comparisons with MAPPO

Comparison with MAPPO—the primary baseline—highlights key theoretical and practical distinctions:

Algorithm	Critic Needed	Memory Usage	Training Stability	Compute Cost	Final F1 (HotpotQA)
MAPPO	Yes	High	Less stable	High	46.40%
MHGPO-IS	No	Low	Most stable, lower accuracy ceiling	Moderate	45.58%
MHGPO-FoF	No	Low	High, slower convergence	Moderate	49.43%
MHGPO-RR	No	Low	Balanced	Lowest	49.72%

MAPPO requires a full-size critic network for each agent, incurring up to 30–40% higher GPU memory usage and greater training instability, particularly in heterogeneous output regimes.
MHGPO’s critic-free paradigm not only reduces hardware demands but also delivers smoother convergence curves and less collapse across agents (Chen et al., 3 Jun 2025).

6. Experimental Protocol and Quantitative Outcomes

Evaluations were conducted in a multi-agent search system (MASS) with a three-agent pipeline: Rewriter (query generation), Reranker (snippet selection), and Answerer (final synthesis). The system used Contriever-based retrieval from Wikipedia, and was benchmarked on HotpotQA (in-domain), 2WikiMultiHopQA, and MuSiQue (out-of-domain) using metrics such as Accuracy, Exact Match (EM), and F1.

Key experimental hyperparameters included batch size 512, one RL epoch per step, PPO epochs = 1, group size $G=4$ , and RR rollout probabilities $(0.7, 0.1, 0.2)$ . Results demonstrated that all MHGPO variants outperformed MAPPO on in-domain and out-of-domain tasks, with MHGPO-FoF and MHGPO-RR achieving the highest F1 scores (49.43% and 49.72% respectively) on HotpotQA, compared to MAPPO’s 46.40%. The IS strategy converged fastest but to a lower final score, FoF reached the highest final accuracy with slower convergence, and RR offered near-FoF accuracy at 15–20% lower rollout overhead (Chen et al., 3 Jun 2025).

7. Ablation, Sensitivity, and Design Trade-Offs

Group Size ( $G$ ): Increasing $G$ reduces estimator variance at the expense of linear growth in rollout cost. $G=4$ was found to offer a balanced trade-off.
Sampling Strategy Effects: IS minimizes agent interference but fails to capture full pipeline dependencies. FoF maximizes global coupling for highest accuracy at moderate cost. RR strikes a balance between accuracy (near FoF) and rollout efficiency.
RR Probability Tuning ( $\{p_i\}$ ): Biasing RR toward early forks (e.g., $p_1=0.7$ ) stabilizes downstream agents by enforcing early coupling.

8. Scope, Limitations, and Prospects

Experiments to date focus on a fixed three-agent QA pipeline. Adaptation to larger agent pools, cyclic topologies, or dynamic agent assignment is not yet demonstrated. Training does not incorporate warm-up phases such as supervised fine-tuning (SFT); integration with SFT or direct preference optimization (DPO) is posited as a possible enhancement. Other prospective extensions include dynamic or hierarchical group sizing, handling branching MAS graph structures, and incorporating off-policy correction to leverage historical rollouts and improve sample efficiency.

A plausible implication is that the group advantage approach in MHGPO could generalize to MAS with more intricate coordination flows, variable agent roles, or open-ended dialogue architectures, contingent on future empirical validation.

MHGPO thus constitutes an efficient, scalable, and empirically validated alternative to critic-dependent MARL approaches in LLM-driven MAS, leveraging heterogeneity-aware group-based optimization principles for end-to-end policy improvement (Chen et al., 3 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Heterogeneous Group Policy Optimization (MHGPO).