MAC-PO: Multi-Agent Regret-weighted Replay

Updated 22 June 2026

The paper introduces MAC-PO, a novel prioritized replay framework that minimizes collective regret in cooperative multi-agent settings.
It employs a closed-form priority integrating TD error, optimality gap, on-policy likelihood, and joint-action probabilities to boost sample efficiency.
Empirical results on Predator-Prey and SMAC benchmarks demonstrate superior win rates and faster convergence compared to established MARL methods.

Multi-Agent Regret-weighted Replay via Collective Priority Optimization (MAC-PO) is a prioritized experience replay framework specifically designed for cooperative multi-agent reinforcement learning (MARL) under the centralized training and decentralized execution (CTDE) paradigm. MAC-PO rigorously formulates the replay prioritization problem as a collective regret minimization over sampling weights, yielding a closed-form expression for sample priorities that integrates temporal-difference (TD) error, optimality gap, on-policy likelihood, and novel joint action factors unique to multi-agent settings. Empirical evaluation demonstrates MAC-PO's efficacy on benchmark environments, where it consistently surpasses established alternatives in sample efficiency and final performance (Mei et al., 2023).

1. Multi-Agent Markov Game Setting and Experience Replay

The MARL context considered is a decentralized partially observable Markov decision process (Dec-POMDP) operating under CTDE:

Agents and Actions: The agent set is $A = \{1,\ldots, n\}$ , with each agent $a$ selecting $u_a \in U^a$ . The joint action is $u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ .
State and Transitions: Global state $s\in S$ evolves via $P(s'|s, u)$ , with all agents sharing reward $r(s, u)$ and a common discount factor $\gamma \in [0, 1)$ .
Observations and Policies: Each agent gets local observations $z \sim O(s, a)$ and maintains private histories $\tau_a \in (Z \times U^a)^*$ . Decentralized policies $a$ 0 define the joint policy $a$ 1.
Experience Replay: A finite buffer $a$ 2 stores transitions $a$ 3, generated off-policy. Standard approaches sample transitions uniformly or with fixed priorities. In MARL, uniform experience replay is sub-optimal, as it ignores sample importance and inter-agent policy dependencies.

2. Regret Minimization and the Weighted Bellman-Error Objective

MAC-PO's central objective is to minimize policy regret, narrowing the gap between the expected return of the current joint policy and a nominal optimal joint policy.

Policy Regret: For joint policy $a$ 4, expected (discounted) return is $a$ 5, with regret defined as

$a$ 6

where $a$ 7 is optimal.

Weighted Bellman-Error Minimization: At iteration $a$ 8, with current estimate $a$ 9, MAC-PO fits $u_a \in U^a$ 0 by minimizing a prioritized, weighted Bellman error:

$u_a \in U^a$ 1

s.t. $u_a \in U^a$ 2, where $u_a \in U^a$ 3 is the Bellman target and $u_a \in U^a$ 4 are sampling-priority weights to be optimized.

Meta-optimization: The core problem is to optimize $u_a \in U^a$ 5:

$u_a \in U^a$ 6

with $u_a \in U^a$ 7 the Boltzmann policy induced by $u_a \in U^a$ 8.

3. Regret Relaxation, Lagrangian Duality, and Optimal Priorities

MAC-PO formulates a tight surrogate upper bound for policy regret, based on discounted Q-value errors:

Surrogate Loss: By Kakade's lemma and Jensen's inequality, regret is bounded as

$u_a \in U^a$ 9

replaced by $u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 0.

Lagrangian Formulation: With $u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 1 and duals $u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 2:

$u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 3

Implicit Differentiation: The Q-iteration solution is analytically differentiated w.r.t. $u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 4 via the implicit function theorem; $u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 5 involves the Bellman differences.
Closed-form Priority: Karush-Kuhn-Tucker (KKT) conditions yield the optimal sampling priority (Theorem 1):

$u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 6

where

$u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 7

$u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 8 is a normalization factor; $u = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n$ 9 is negligible for low return-to-state probability.

4. Practical Prioritization: Exact and Approximate Methods

The computation of optimal priorities in the multi-agent setting involves non-trivial joint-policy dependencies:

Exact MAC-PO: For each transition $s\in S$ 0:

$s\in S$ 1

with $s\in S$ 2.

Approximation Scheme: To avoid the $s\in S$ $s \in S$ 3 cost of computing full joint probabilities, MAC-PO introduces a three-level partition based on Theorem 2:
- High: Exactly one agent’s local policy has $s\in S$ 4 and all others $s\in S$ 5;
- Low: $s\in S$ 6 or $s\in S$ 7;
- Medium: Otherwise.

Assigning weights $s\in S$ 8 to these levels, final transition priority is computed by multiplying Bellman error, value enhancement, and level weight.

5. Implementation and Optimization Procedure

The MAC-PO algorithm proceeds as follows:

Initialization: Set parameter vectors $s\in S$ 9, target $P(s'|s, u)$ 0, and empty replay buffer $P(s'|s, u)$ 1.
Sampling: Collect trajectories $P(s'|s, u)$ 2 under $P(s'|s, u)$ 3-greedy policy $P(s'|s, u)$ 4; store in $P(s'|s, u)$ 5.
Minibatch Update:
- Sample $P(s'|s, u)$ 6 transitions uniformly.
- For each $P(s'|s, u)$ $P (s^{'} ∣ s, u)$ 7 pair:
  - Compute $P(s'|s, u)$ 8, Bellman target $P(s'|s, u)$ 9.
  - Compute per-agent Boltzmann policies $r(s, u)$ 0.
  - Estimate $r(s, u)$ 1 (unrestricted mixing or greedy maximizer).
  - Compute $r(s, u)$ 2 via closed-form or approximated priority.
- Apply gradient step on weighted loss: $r(s, u)$ 3.
- Periodically update $r(s, u)$ 4.

Core implementation details: networks employ Adam optimizer, target updates every 200 episodes, replay size $r(s, u)$ 5, minibatch $r(s, u)$ 6, learning rate $r(s, u)$ 7, TD- $r(s, u)$ 8 of $r(s, u)$ 9, with code built atop the PyMARL2 QMIX/WQMIX framework using NVIDIA 2080Ti GPUs.

6. Theoretical Analysis and Ingredients of Optimal Priority

Four key ingredients characterize optimal MAC-PO priorities:

Bellman-error ( $\gamma \in [0, 1)$ 0): Emphasizes transitions where the Q-estimate is inconsistent with the Bellman target.
Value-enhancement ( $\gamma \in [0, 1)$ 1): Downweights transitions where $\gamma \in [0, 1)$ 2 is distant from optimal; highlights areas near the global optimum.
On-policiness ( $\gamma \in [0, 1)$ 3): Favors transitions prevalent under the current policy, modulating off-policy bias.
Joint-action Probabilities: The term $\gamma \in [0, 1)$ 4 biases toward transitions where one agent’s decision is the unlikely bottleneck, a property unique to MARL.

This combination provably reduces a regret surrogate upper bound, systematically biasing policy improvement toward minimizers of the true regret functional.

7. Empirical Validation and Results

MAC-PO was evaluated on two principal benchmarks:

Predator-Prey: A grid-world with 8 predators and 8 prey; performance is measured as average episodic return.
SMAC: StarCraft Multi-Agent Challenge maps, including 3s_vs_5z (standard), 5m_vs_6m (hard), and MMM2 (super-hard); performance is the win rate over 32 evaluation episodes.

Baselines Compared

Category	Algorithms
Single-agent replay adapted to MARL	PER, PSER, DisCor, ReMERN
Value-decomposition MARL	QMIX, WQMIX, QPLEX
Actor-critic MARL	VDAC, FOP, DOP

Key Empirical Findings

Win rate improvement: On SMAC, MAC-PO improves final win rates by 4–16% relative to the next-best baseline and shows reduced variance.
Sample efficiency: On Predator-Prey, MAC-PO achieves faster convergence and higher asymptotic returns, particularly under harsh conditions ( $\gamma \in [0, 1)$ 5).
Approximation efficiency: The approximate priority scheme performs within a few percent of optimal while dramatically reducing computational cost.
Ablation study: Excluding the joint-probability term on the hardest SMAC map drops win rate by ~18%, affirming its necessity in MARL prioritization.

8. Additional Formalism and Implementation Notes

Key computational primitives include:

Bellman operator: $\gamma \in [0, 1)$ 6.
Boltzmann policy: $\gamma \in [0, 1)$ 7, with temperature $\gamma \in [0, 1)$ 8.
Weighted loss–priority equivalence: Non-uniform sampling with weight $\gamma \in [0, 1)$ 9 is equivalent to uniform sampling with per-sample loss scaled by $z \sim O(s, a)$ 0 [Fujimoto et al. 2020, as cited in (Mei et al., 2023)].
Replay and training details: Batch size 128, buffer 10,000, learning rate $z \sim O(s, a)$ 1, Adam optimizer, target update every 200 episodes, TD- $z \sim O(s, a)$ 2.

MAC-PO thus establishes a rigorous, regret-minimizing framework for prioritized multi-agent replay, integrating centralized regret criteria with agent-wise decentralized policies. The combination of classical temporal-difference measures and new multi-agent joint-action structures allows MAC-PO to attain consistent performance advantages over baseline methods across challenging collaborative domains (Mei et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

MAC-PO: Multi-Agent Experience Replay via Collective Priority Optimization (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-agent Regret-weighted Replay (MAC-PO).