Papers
Topics
Authors
Recent
2000 character limit reached

MHGPO: Multi-Agent Policy Optimization

Updated 27 November 2025
  • MHGPO is a reinforcement learning framework designed for optimizing LLM-based multi-agent systems using a critic-free, group-relative advantage estimation approach.
  • It employs innovative group rollout sampling strategies, such as Independent Sampling, Fork-on-First, and Round-Robin, to balance accuracy and computational overhead.
  • Empirical evaluations show MHGPO achieves superior F1 scores on tasks like HotpotQA while reducing memory usage and improving training stability.

Multi-Agent Heterogeneous Group Policy Optimization (MHGPO) is a reinforcement learning framework designed for optimizing LLM-based multi-agent systems (MAS) in a cooperative, parameter-shared, decentralized-execution setting. Unlike conventional multi-agent reinforcement learning (MARL) algorithms such as Multi-Agent Proximal Policy Optimization (MAPPO), which rely on critic networks to estimate value functions and facilitate policy updates, MHGPO eschews value-based critics altogether. Instead, it employs a group-based relative advantage estimation within batches of heterogeneous rollout trajectories, realizing a stable, memory-efficient, and scalable optimization regime. Empirical evidence demonstrates superior task performance and computational efficiency compared to MAPPO on LLM-based multi-agent search systems (Chen et al., 3 Jun 2025).

1. Formal Problem Definition

MHGPO formalizes the MAS setting as a parameter-shared, cooperative-training, decentralized-execution MARL problem, where nn heterogeneous agents A1,,AnA_1, \dots, A_n are all instantiated from a single LLM backbone. System states sSs \in S represent the current prompt delivered to the next pipeline agent, while each agent AkA_k emits a variable-length token sequence oko_k, with actions akt=oktVa^t_k = o^t_k \in \mathcal{V} composing the output. The deterministic transition function P(ss,ak)P(s' \mid s, a_k) is determined by prompt concatenation and agent routing rules. Reward assignment consists of a global shared reward RsharedR^{\rm shared} (e.g., F1 score against gold answers) retroactively distributed along the agent chain, plus step-level agent-specific penalties Rkspe(qk,ok)R_k^{\rm spe}(q_k, o_k) for malformed outputs. All agents are parameterized by the joint policy πθ(oq)\pi_\theta(o \mid q). The optimization objective is to maximize the cumulative expected (undiscounted) return:

J(θ)=EqD,oπθ[k=1nt=1TkRkt]J(\theta)=\mathbb{E}_{q\sim D,\,o\sim\pi_\theta}\left[\sum_{k=1}^n\sum_{t=1}^{T_k}R^t_{k}\right]

2. Critic-Free Algorithmic Framework

MHGPO eliminates the need for value-network or critic function approximation by employing a group-based advantage estimation (Group-Relative Advantage, GRA). Each mini-batch training iteration follows:

  • Group Rollout Sampling: For each input question qq, sample GG full trajectories {Ti}i=1G\{\mathcal{T}_i\}_{i=1}^G under a reference policy θref\theta_{\rm ref}.
  • Backward Reward Propagation: Assign shared reward RisharedR^{\rm shared}_i to each final answer and propagate rewards backward via averaging. Agent-specific penalties Rk,ispeR^{\rm spe}_{k,i} are added to obtain total Rk,iR_{k,i} per agent and trajectory.
  • Group-Relative Advantage Calculation:

Δk,i=A^k,i=Rk,iμgσg\Delta_{k,i} = \hat{A}_{k,i} = \frac{R_{k,i} - \mu_g}{\sigma_g}

with

μg=1g(l,j)gRl,j,σg=1g(l,j)g(Rl,jμg)2,\mu_g = \frac{1}{|g|} \sum_{(l,j)\in g} R_{l,j}\,, \quad \sigma_g = \sqrt{\frac{1}{|g|} \sum_{(l,j)\in g}(R_{l,j} - \mu_g)^2}\,,

where group g=mk,ig = m_{k,i} indexes the collection of rollouts that share a sampling context.

  • Critic-Free Policy Gradient Update: Using PPO-style clipping and reference policy KL-penalization, the loss is

L(θ)=Eq,{ok,i}[1nk=1n1Gki=1Gkt=1Tk,imin(rk,itΔk,it,clip(rk,it,1ϵ,1+ϵ)Δk,it)]+βDKL(πθπref)\mathcal{L}(\theta) = -\mathbb{E}_{q, \{o_{k,i}\}}\left[\frac{1}{n}\sum_{k=1}^n \frac{1}{G_k}\sum_{i=1}^{G_k} \sum_{t=1}^{T_{k,i}} \min\left(r^t_{k,i} \Delta^t_{k,i},\,\mathrm{clip}\left(r^t_{k,i},1-\epsilon,1+\epsilon\right)\Delta^t_{k,i}\right) \right] + \beta D_{KL}\left(\pi_\theta\|\pi_{\rm ref}\right)

where rk,it=πθ(ok,itq,ok,i<t)πref(ok,itq,ok,i<t)r^t_{k,i} = \frac{\pi_\theta(o^t_{k,i}\mid q,o^{<t}_{k,i})}{\pi_{\rm ref}(o^t_{k,i}\mid q,o^{<t}_{k,i})}.

3. Group Rollout Sampling Methodologies

MHGPO’s scalability and stability are shaped by the choice of group rollout sampling strategy. For each, a base fork utility is deployed to handle pipeline branching:

  • Independent Sampling (IS): For each agent AiA_i and each qq, fork only at AiA_i, producing GG continuations; repeated for i=1,,ni = 1,\dots, n, resulting in homogeneous groups. Low variance per agent, no cross-agent coupling, nGnG rollouts per sample.
  • Fork-on-First (FoF): Forks only at the first pipeline agent, so downstream agents receive GG distinct inputs, but groups are always size GG. Captures full pipeline dependencies, nGnG agent calls per batch, higher ultimate accuracy.
  • Round-Robin (RR): Randomly selects the fork point per qq using categorical probabilities p1,,pnp_1,\dots,p_n, with post-hoc re-grouping to guarantee equal group sizes. Allows tuning between compute cost and variance, with typical overhead of (n1)+G(n-1)+G agent calls per sample.

Sampling strategy selection affects agent variance, inter-agent dependency modeling, and computational overhead.

4. Policy Update Mechanism

MHGPO combines the above mechanisms into a PPO-inspired gradient update step, eschewing the learning and maintenance of a value function:

θθαθ[1nk,i,tmin(rk,itΔk,it,clip(rk,it,1ϵ,1+ϵ)Δk,it)+βDKL(πθπref)]\theta \leftarrow \theta - \alpha \nabla_\theta \left[ -\frac{1}{n}\sum_{k,i,t} \min(r^t_{k,i}\Delta^t_{k,i},\,\mathrm{clip}(r^t_{k,i},1-\epsilon,1+\epsilon) \Delta^t_{k,i}) + \beta D_{KL} (\pi_\theta \| \pi_{\rm ref}) \right]

The KL penalty β\beta and PPO clipping parameter ϵ\epsilon serve the same stabilizing function as in single-agent PPO. The absence of a value-function network eliminates the need for memory-intensive critic computations.

5. Empirical and Theoretical Comparisons with MAPPO

Comparison with MAPPO—the primary baseline—highlights key theoretical and practical distinctions:

Algorithm Critic Needed Memory Usage Training Stability Compute Cost Final F1 (HotpotQA)
MAPPO Yes High Less stable High 46.40%
MHGPO-IS No Low Most stable, lower accuracy ceiling Moderate 45.58%
MHGPO-FoF No Low High, slower convergence Moderate 49.43%
MHGPO-RR No Low Balanced Lowest 49.72%
  • MAPPO requires a full-size critic network for each agent, incurring up to 30–40% higher GPU memory usage and greater training instability, particularly in heterogeneous output regimes.
  • MHGPO’s critic-free paradigm not only reduces hardware demands but also delivers smoother convergence curves and less collapse across agents (Chen et al., 3 Jun 2025).

6. Experimental Protocol and Quantitative Outcomes

Evaluations were conducted in a multi-agent search system (MASS) with a three-agent pipeline: Rewriter (query generation), Reranker (snippet selection), and Answerer (final synthesis). The system used Contriever-based retrieval from Wikipedia, and was benchmarked on HotpotQA (in-domain), 2WikiMultiHopQA, and MuSiQue (out-of-domain) using metrics such as Accuracy, Exact Match (EM), and F1.

Key experimental hyperparameters included batch size 512, one RL epoch per step, PPO epochs = 1, group size G=4G=4, and RR rollout probabilities (0.7,0.1,0.2)(0.7, 0.1, 0.2). Results demonstrated that all MHGPO variants outperformed MAPPO on in-domain and out-of-domain tasks, with MHGPO-FoF and MHGPO-RR achieving the highest F1 scores (49.43% and 49.72% respectively) on HotpotQA, compared to MAPPO’s 46.40%. The IS strategy converged fastest but to a lower final score, FoF reached the highest final accuracy with slower convergence, and RR offered near-FoF accuracy at 15–20% lower rollout overhead (Chen et al., 3 Jun 2025).

7. Ablation, Sensitivity, and Design Trade-Offs

  • Group Size (GG): Increasing GG reduces estimator variance at the expense of linear growth in rollout cost. G=4G=4 was found to offer a balanced trade-off.
  • Sampling Strategy Effects: IS minimizes agent interference but fails to capture full pipeline dependencies. FoF maximizes global coupling for highest accuracy at moderate cost. RR strikes a balance between accuracy (near FoF) and rollout efficiency.
  • RR Probability Tuning ({pi}\{p_i\}): Biasing RR toward early forks (e.g., p1=0.7p_1=0.7) stabilizes downstream agents by enforcing early coupling.

8. Scope, Limitations, and Prospects

Experiments to date focus on a fixed three-agent QA pipeline. Adaptation to larger agent pools, cyclic topologies, or dynamic agent assignment is not yet demonstrated. Training does not incorporate warm-up phases such as supervised fine-tuning (SFT); integration with SFT or direct preference optimization (DPO) is posited as a possible enhancement. Other prospective extensions include dynamic or hierarchical group sizing, handling branching MAS graph structures, and incorporating off-policy correction to leverage historical rollouts and improve sample efficiency.

A plausible implication is that the group advantage approach in MHGPO could generalize to MAS with more intricate coordination flows, variable agent roles, or open-ended dialogue architectures, contingent on future empirical validation.

MHGPO thus constitutes an efficient, scalable, and empirically validated alternative to critic-dependent MARL approaches in LLM-driven MAS, leveraging heterogeneity-aware group-based optimization principles for end-to-end policy improvement (Chen et al., 3 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Heterogeneous Group Policy Optimization (MHGPO).