M-GRPO: Multi-Agent Group Relative Policy Optimization

Updated 21 November 2025

The paper introduces group-normalized advantage estimation that stabilizes training by normalizing rewards within agent groups.
It employs a clipped surrogate loss with KL regularization to ensure smooth, robust policy updates in varying multi-agent interactions.
It demonstrates enhanced cooperation and efficiency across spatial public goods, hierarchical LLM systems, and general Markov games.

Multi-Agent Group Relative Policy Optimization (M-GRPO) is a family of reinforcement learning algorithms that extend Proximal Policy Optimization (PPO) to structured multi-agent domains. These methods use group-relative, rather than absolute or critic-based, advantage estimates and are tailored for scenarios where distributed, cooperative, or hierarchical agent interactions are central. M-GRPO includes variants with global cooperation constraints for structured population games, hierarchical multi-agent orchestration in tool-assisted LLM systems, and scalable learning in general Markov games. Its key algorithmic innovation is the normalization of reward signals within agent groups or trajectories, enabling efficient and stable policy optimization for a wide spectrum of multi-agent systems (Yang et al., 7 Oct 2025, Hong et al., 17 Nov 2025, Han et al., 24 Jun 2025).

1. Theoretical Foundations and Motivation

M-GRPO emerges from the intersection of deep multi-agent reinforcement learning and the limitations of classical methods such as PPO in cooperative or heterogeneous multi-agent settings. Traditional PPO computes per-agent or centralized absolute advantages, which may lead to credit assignment pathologies, sample inefficiency, and instability—especially in nonstationary, vertically or horizontally structured agent teams.

M-GRPO addresses these by (i) estimating policy advantages relative to the performance of a group of candidate rollouts, (ii) dynamically shaping incentives with group or global context, and (iii) supporting decoupled or marginal-benefit–guided policy updates. This paradigm generalizes smoothly across spatial population games, hierarchical LLM-based AI systems, and standard multi-agent Markov games (Yang et al., 7 Oct 2025, Hong et al., 17 Nov 2025, Han et al., 24 Jun 2025).

2. Core Algorithmic Components

The M-GRPO framework is unified by the following central mechanisms:

Group-Normalized Advantage Estimation:

For each agent or node, a batch of $G$ candidate actions or sub-trajectories are sampled under the reference (typically, the “old”) policy. Returns $\{R^g\}_{g=1}^G$ are aggregated to compute the mean $\mu$ and standard deviation $\sigma$ , yielding group-normalized advantages:

$\hat{A}^g = \frac{R^g - \mu}{\sigma}$

This approach is used both in spatial population settings (Yang et al., 7 Oct 2025) and in node-wise optimization of hierarchical LLM systems (Hong et al., 17 Nov 2025, Han et al., 24 Jun 2025), removing scale biases and promoting fair credit assignment.

Clipped Objective with Frozen Reference:

M-GRPO extends the PPO clipped surrogate loss:

$\mathcal{L}_{\text{clip}}(\theta) = \mathbb{E}_g\left[ \min\left(r^g(\theta) \hat{A}^g,\, \text{clip}(r^g(\theta), 1-\epsilon, 1+\epsilon) \hat{A}^g\right)\right]$

where $r^g(\theta)$ is the likelihood ratio of candidate action $a^g$ under the updated vs. old policy. In spatial games, an additional reference-anchored KL penalty regulates abrupt policy change:

$\mathcal{L}_{\text{KL}}(\theta) = -\beta D_{\text{KL}}\left(\pi_\theta \, \|\, \pi_{\theta_\text{ref}}\right)$

This encourages smoother updates and prevents collapse to degenerate behaviors (Yang et al., 7 Oct 2025).

Global and Marginal-Benefit Incentive Schemes:

Population-based M-GRPO introduces a Global Cooperation Constraint (GCC) that modulates payoffs for cooperating agents based on the global cooperation rate, incentivizing sustainable cooperation while suppressing extremes (Yang et al., 7 Oct 2025). The Markov game formulation (Han et al., 24 Jun 2025) implements a variance-aware marginal-benefit selection: among all agents or decision nodes, only the $K$ with highest group return variance are updated per iteration, further stabilizing learning and improving sample efficiency.

Hierarchical and Trajectory-Aligned Credit Assignment:

In hierarchical multi-agent LLM systems, main (planner) and sub-agent (executor) policies are jointly optimized with separate group-relative advantage normalization and carefully aligned trajectories, ensuring coherent updates even under asynchronous rollouts and variable sub-agent invocation frequencies (Hong et al., 17 Nov 2025).

3. Formalization Across Domains

3.1 Spatial Public Goods and Population Games

In the GRPO-GCC variant for spatial public goods games (Yang et al., 7 Oct 2025), the per-agent/group optimization is embedded in a structured lattice, with cooperation rates $g$ and payoffs modulated according to global states. Pseudocode for each training epoch involves candidate sample generation, payoff adjustment including the GCC incentive, normalization, gradient steps, and reference policy synchronization, supporting stable emergence of sustained high cooperation.

3.2 Hierarchical Multi-Agent LLM Systems

The hierarchical extension of M-GRPO (Hong et al., 17 Nov 2025) partitions the system into a main agent $\mathcal{M}$ and $n$ sub-agents $\{\mathcal{S}_i\}_{i=1}^n$ . For each query, $K$ rollouts are performed for both levels, rewards are computed, and group-normalized advantages are assigned. Sub-agent trajectory counts are synchronized via duplication or truncation to enable batch updates despite variance in invocation, a crucial requirement for vertical pipelines with dynamic tool usage.

Optimization proceeds independently for main and sub-agents on their respective servers, with communication limited to minimal statistics (e.g., log-probabilities, reward scalars). This architecture scales to large systems while avoiding backpropagation across server boundaries and heterogeneous agents.

3.3 General Markov Games with Heterogeneous Agents

JoyAgents-R1 (Han et al., 24 Jun 2025) implements M-GRPO in the context of cooperative Markov games with $N$ heterogeneous agents. Node-wise Monte Carlo sampling enables per-agent group statistics, and a marginal benefit-driven selection identifies agents for update per step, sharply reducing sample and compute costs compared to exhaustive all-agent updates. An adaptive memory evolution mechanism utilizes GRPO-derived rewards to reinforce or decay memory entries, further accelerating convergence without the need for auxiliary learning signals.

4. Empirical Evaluation and Performance Benchmarks

Empirical results across domains consistently show the benefits of M-GRPO:

Spatial Public Goods Games (Yang et al., 7 Oct 2025):

GRPO-GCC achieves rapid onset (>80% at $r \geq 3.6$ , 100% at $r=5.0$ ) and high stability of cooperation in 200×200 torus lattices, outperforming baseline GRPO, Q-learning, and the Fermi rule in threshold, convergence speed, and variance.

Vertical Multi-Agent LLM Systems (Hong et al., 17 Nov 2025):

On GAIA, XBench-DeepSearch, and WebWalkerQA, M-GRPO secures 5–8 percentage points gains in accuracy over both single-agent and main-only co-training baselines. Trajectory-aligned co-training accelerates and stabilizes learning. | Benchmark | Single-Agent GRPO | Main-Only GRPO | M-GRPO (Co-training) | |--------------------|-------------------|----------------|----------------------| | GAIA | 42% | 47% | ~55% | | XBench-DeepSearch | 48% | 52% | 60% | | WebWalkerQA | 45% | 49% | 57% |

General Multi-Agent Reasoning (Han et al., 24 Jun 2025):

M-GRPO attains 10–15% higher accuracy with 30% fewer reasoning steps compared to naive GRPO or PPO-based baselines.

Ablation studies confirm that both hierarchical credit assignment and trajectory synchronization are essential for robust performance and sample efficiency.

5. Implementation and Algorithm Variants

Key hyperparameters for M-GRPO variants include learning rates ( $\alpha$ typically $10^{-4}$ to $10^{-5}$ ), clipping ranges ( $\epsilon=0.1$ –$0.2$), KL-penalty weights, group/candidate counts ( $G=8$ ), and training horizons (typically 1,000+ iterations) (Yang et al., 7 Oct 2025, Hong et al., 17 Nov 2025, Han et al., 24 Jun 2025).

Variants within the M-GRPO family differ on:

Reward shaping signals (e.g., GCC for cooperation, accuracy/format in reasoning tasks)
Update selection (all-agent, marginal benefit node selection/top- $K$ )
Structures handled (horizontal agent swarms, vertical planner–executor hierarchies, memory evolution)
Server architecture (centralized, decoupled/multi-server with minimal coordination)

Generic pseudocode frameworks involve (i) candidate sample generation, (ii) per-group normalization, (iii) surrogate clipping objectives, (iv) policy and reference parameter maintenance, and (v) joint but decoupled policy updates.

6. Broader Implications and Open Directions

M-GRPO sets a new sample-efficient, credit-robust baseline for deep reinforcement learning in complex multi-agent systems. By obviating the need for critic networks and adopting a normalization-first approach, it alleviates instability and reward-misalignment risks. Its readily extensible architecture supports heterogeneous agent groups, nonstationary environments, and asynchronous, large-scale deployments.

A plausible implication is that M-GRPO could underlie next-generation orchestrated LLM agent frameworks, multi-robot collectives, and socio-technical simulations where group-level coordination and scalable training are paramount. Further generalizations may encompass dynamic group sizes, automated hyperparameter adaptation, and integration with meta-RL or population-based training techniques (Yang et al., 7 Oct 2025, Hong et al., 17 Nov 2025, Han et al., 24 Jun 2025).