MHGPO: Multi-Agent Policy Optimization
- MHGPO is a reinforcement learning framework designed for optimizing LLM-based multi-agent systems using a critic-free, group-relative advantage estimation approach.
- It employs innovative group rollout sampling strategies, such as Independent Sampling, Fork-on-First, and Round-Robin, to balance accuracy and computational overhead.
- Empirical evaluations show MHGPO achieves superior F1 scores on tasks like HotpotQA while reducing memory usage and improving training stability.
Multi-Agent Heterogeneous Group Policy Optimization (MHGPO) is a reinforcement learning framework designed for optimizing LLM-based multi-agent systems (MAS) in a cooperative, parameter-shared, decentralized-execution setting. Unlike conventional multi-agent reinforcement learning (MARL) algorithms such as Multi-Agent Proximal Policy Optimization (MAPPO), which rely on critic networks to estimate value functions and facilitate policy updates, MHGPO eschews value-based critics altogether. Instead, it employs a group-based relative advantage estimation within batches of heterogeneous rollout trajectories, realizing a stable, memory-efficient, and scalable optimization regime. Empirical evidence demonstrates superior task performance and computational efficiency compared to MAPPO on LLM-based multi-agent search systems (Chen et al., 3 Jun 2025).
1. Formal Problem Definition
MHGPO formalizes the MAS setting as a parameter-shared, cooperative-training, decentralized-execution MARL problem, where heterogeneous agents are all instantiated from a single LLM backbone. System states represent the current prompt delivered to the next pipeline agent, while each agent emits a variable-length token sequence , with actions composing the output. The deterministic transition function is determined by prompt concatenation and agent routing rules. Reward assignment consists of a global shared reward (e.g., F1 score against gold answers) retroactively distributed along the agent chain, plus step-level agent-specific penalties for malformed outputs. All agents are parameterized by the joint policy . The optimization objective is to maximize the cumulative expected (undiscounted) return:
2. Critic-Free Algorithmic Framework
MHGPO eliminates the need for value-network or critic function approximation by employing a group-based advantage estimation (Group-Relative Advantage, GRA). Each mini-batch training iteration follows:
- Group Rollout Sampling: For each input question , sample full trajectories under a reference policy .
- Backward Reward Propagation: Assign shared reward to each final answer and propagate rewards backward via averaging. Agent-specific penalties are added to obtain total per agent and trajectory.
- Group-Relative Advantage Calculation:
with
where group indexes the collection of rollouts that share a sampling context.
- Critic-Free Policy Gradient Update: Using PPO-style clipping and reference policy KL-penalization, the loss is
where .
3. Group Rollout Sampling Methodologies
MHGPO’s scalability and stability are shaped by the choice of group rollout sampling strategy. For each, a base fork utility is deployed to handle pipeline branching:
- Independent Sampling (IS): For each agent and each , fork only at , producing continuations; repeated for , resulting in homogeneous groups. Low variance per agent, no cross-agent coupling, rollouts per sample.
- Fork-on-First (FoF): Forks only at the first pipeline agent, so downstream agents receive distinct inputs, but groups are always size . Captures full pipeline dependencies, agent calls per batch, higher ultimate accuracy.
- Round-Robin (RR): Randomly selects the fork point per using categorical probabilities , with post-hoc re-grouping to guarantee equal group sizes. Allows tuning between compute cost and variance, with typical overhead of agent calls per sample.
Sampling strategy selection affects agent variance, inter-agent dependency modeling, and computational overhead.
4. Policy Update Mechanism
MHGPO combines the above mechanisms into a PPO-inspired gradient update step, eschewing the learning and maintenance of a value function:
The KL penalty and PPO clipping parameter serve the same stabilizing function as in single-agent PPO. The absence of a value-function network eliminates the need for memory-intensive critic computations.
5. Empirical and Theoretical Comparisons with MAPPO
Comparison with MAPPO—the primary baseline—highlights key theoretical and practical distinctions:
| Algorithm | Critic Needed | Memory Usage | Training Stability | Compute Cost | Final F1 (HotpotQA) |
|---|---|---|---|---|---|
| MAPPO | Yes | High | Less stable | High | 46.40% |
| MHGPO-IS | No | Low | Most stable, lower accuracy ceiling | Moderate | 45.58% |
| MHGPO-FoF | No | Low | High, slower convergence | Moderate | 49.43% |
| MHGPO-RR | No | Low | Balanced | Lowest | 49.72% |
- MAPPO requires a full-size critic network for each agent, incurring up to 30–40% higher GPU memory usage and greater training instability, particularly in heterogeneous output regimes.
- MHGPO’s critic-free paradigm not only reduces hardware demands but also delivers smoother convergence curves and less collapse across agents (Chen et al., 3 Jun 2025).
6. Experimental Protocol and Quantitative Outcomes
Evaluations were conducted in a multi-agent search system (MASS) with a three-agent pipeline: Rewriter (query generation), Reranker (snippet selection), and Answerer (final synthesis). The system used Contriever-based retrieval from Wikipedia, and was benchmarked on HotpotQA (in-domain), 2WikiMultiHopQA, and MuSiQue (out-of-domain) using metrics such as Accuracy, Exact Match (EM), and F1.
Key experimental hyperparameters included batch size 512, one RL epoch per step, PPO epochs = 1, group size , and RR rollout probabilities . Results demonstrated that all MHGPO variants outperformed MAPPO on in-domain and out-of-domain tasks, with MHGPO-FoF and MHGPO-RR achieving the highest F1 scores (49.43% and 49.72% respectively) on HotpotQA, compared to MAPPO’s 46.40%. The IS strategy converged fastest but to a lower final score, FoF reached the highest final accuracy with slower convergence, and RR offered near-FoF accuracy at 15–20% lower rollout overhead (Chen et al., 3 Jun 2025).
7. Ablation, Sensitivity, and Design Trade-Offs
- Group Size (): Increasing reduces estimator variance at the expense of linear growth in rollout cost. was found to offer a balanced trade-off.
- Sampling Strategy Effects: IS minimizes agent interference but fails to capture full pipeline dependencies. FoF maximizes global coupling for highest accuracy at moderate cost. RR strikes a balance between accuracy (near FoF) and rollout efficiency.
- RR Probability Tuning (): Biasing RR toward early forks (e.g., ) stabilizes downstream agents by enforcing early coupling.
8. Scope, Limitations, and Prospects
Experiments to date focus on a fixed three-agent QA pipeline. Adaptation to larger agent pools, cyclic topologies, or dynamic agent assignment is not yet demonstrated. Training does not incorporate warm-up phases such as supervised fine-tuning (SFT); integration with SFT or direct preference optimization (DPO) is posited as a possible enhancement. Other prospective extensions include dynamic or hierarchical group sizing, handling branching MAS graph structures, and incorporating off-policy correction to leverage historical rollouts and improve sample efficiency.
A plausible implication is that the group advantage approach in MHGPO could generalize to MAS with more intricate coordination flows, variable agent roles, or open-ended dialogue architectures, contingent on future empirical validation.
MHGPO thus constitutes an efficient, scalable, and empirically validated alternative to critic-dependent MARL approaches in LLM-driven MAS, leveraging heterogeneity-aware group-based optimization principles for end-to-end policy improvement (Chen et al., 3 Jun 2025).