Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAC-PO: Multi-Agent Regret-weighted Replay

Updated 22 June 2026
  • The paper introduces MAC-PO, a novel prioritized replay framework that minimizes collective regret in cooperative multi-agent settings.
  • It employs a closed-form priority integrating TD error, optimality gap, on-policy likelihood, and joint-action probabilities to boost sample efficiency.
  • Empirical results on Predator-Prey and SMAC benchmarks demonstrate superior win rates and faster convergence compared to established MARL methods.

Multi-Agent Regret-weighted Replay via Collective Priority Optimization (MAC-PO) is a prioritized experience replay framework specifically designed for cooperative multi-agent reinforcement learning (MARL) under the centralized training and decentralized execution (CTDE) paradigm. MAC-PO rigorously formulates the replay prioritization problem as a collective regret minimization over sampling weights, yielding a closed-form expression for sample priorities that integrates temporal-difference (TD) error, optimality gap, on-policy likelihood, and novel joint action factors unique to multi-agent settings. Empirical evaluation demonstrates MAC-PO's efficacy on benchmark environments, where it consistently surpasses established alternatives in sample efficiency and final performance (Mei et al., 2023).

1. Multi-Agent Markov Game Setting and Experience Replay

The MARL context considered is a decentralized partially observable Markov decision process (Dec-POMDP) operating under CTDE:

  • Agents and Actions: The agent set is A={1,,n}A = \{1,\ldots, n\}, with each agent aa selecting uaUau_a \in U^a. The joint action is u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n.
  • State and Transitions: Global state sSs\in S evolves via P(ss,u)P(s'|s, u), with all agents sharing reward r(s,u)r(s, u) and a common discount factor γ[0,1)\gamma \in [0, 1).
  • Observations and Policies: Each agent gets local observations zO(s,a)z \sim O(s, a) and maintains private histories τa(Z×Ua)\tau_a \in (Z \times U^a)^*. Decentralized policies aa0 define the joint policy aa1.
  • Experience Replay: A finite buffer aa2 stores transitions aa3, generated off-policy. Standard approaches sample transitions uniformly or with fixed priorities. In MARL, uniform experience replay is sub-optimal, as it ignores sample importance and inter-agent policy dependencies.

2. Regret Minimization and the Weighted Bellman-Error Objective

MAC-PO's central objective is to minimize policy regret, narrowing the gap between the expected return of the current joint policy and a nominal optimal joint policy.

  • Policy Regret: For joint policy aa4, expected (discounted) return is aa5, with regret defined as

aa6

where aa7 is optimal.

  • Weighted Bellman-Error Minimization: At iteration aa8, with current estimate aa9, MAC-PO fits uaUau_a \in U^a0 by minimizing a prioritized, weighted Bellman error:

uaUau_a \in U^a1

s.t. uaUau_a \in U^a2, where uaUau_a \in U^a3 is the Bellman target and uaUau_a \in U^a4 are sampling-priority weights to be optimized.

  • Meta-optimization: The core problem is to optimize uaUau_a \in U^a5:

uaUau_a \in U^a6

with uaUau_a \in U^a7 the Boltzmann policy induced by uaUau_a \in U^a8.

3. Regret Relaxation, Lagrangian Duality, and Optimal Priorities

MAC-PO formulates a tight surrogate upper bound for policy regret, based on discounted Q-value errors:

  • Surrogate Loss: By Kakade's lemma and Jensen's inequality, regret is bounded as

uaUau_a \in U^a9

replaced by u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n0.

  • Lagrangian Formulation: With u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n1 and duals u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n2:

u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n3

  • Implicit Differentiation: The Q-iteration solution is analytically differentiated w.r.t. u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n4 via the implicit function theorem; u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n5 involves the Bellman differences.
  • Closed-form Priority: Karush-Kuhn-Tucker (KKT) conditions yield the optimal sampling priority (Theorem 1):

u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n6

where

u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n7

u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n8 is a normalization factor; u=(u1,,un)U1××Unu = (u_1, \ldots, u_n) \in U^1 \times \cdots \times U^n9 is negligible for low return-to-state probability.

4. Practical Prioritization: Exact and Approximate Methods

The computation of optimal priorities in the multi-agent setting involves non-trivial joint-policy dependencies:

  • Exact MAC-PO: For each transition sSs\in S0:

sSs\in S1

with sSs\in S2.

  • Approximation Scheme: To avoid the sSs\in S3 cost of computing full joint probabilities, MAC-PO introduces a three-level partition based on Theorem 2:
    • High: Exactly one agent’s local policy has sSs\in S4 and all others sSs\in S5;
    • Low: sSs\in S6 or sSs\in S7;
    • Medium: Otherwise.

Assigning weights sSs\in S8 to these levels, final transition priority is computed by multiplying Bellman error, value enhancement, and level weight.

5. Implementation and Optimization Procedure

The MAC-PO algorithm proceeds as follows:

  1. Initialization: Set parameter vectors sSs\in S9, target P(ss,u)P(s'|s, u)0, and empty replay buffer P(ss,u)P(s'|s, u)1.
  2. Sampling: Collect trajectories P(ss,u)P(s'|s, u)2 under P(ss,u)P(s'|s, u)3-greedy policy P(ss,u)P(s'|s, u)4; store in P(ss,u)P(s'|s, u)5.
  3. Minibatch Update:
    • Sample P(ss,u)P(s'|s, u)6 transitions uniformly.
    • For each P(ss,u)P(s'|s, u)7 pair:
      • Compute P(ss,u)P(s'|s, u)8, Bellman target P(ss,u)P(s'|s, u)9.
      • Compute per-agent Boltzmann policies r(s,u)r(s, u)0.
      • Estimate r(s,u)r(s, u)1 (unrestricted mixing or greedy maximizer).
      • Compute r(s,u)r(s, u)2 via closed-form or approximated priority.
    • Apply gradient step on weighted loss: r(s,u)r(s, u)3.
    • Periodically update r(s,u)r(s, u)4.

Core implementation details: networks employ Adam optimizer, target updates every 200 episodes, replay size r(s,u)r(s, u)5, minibatch r(s,u)r(s, u)6, learning rate r(s,u)r(s, u)7, TD-r(s,u)r(s, u)8 of r(s,u)r(s, u)9, with code built atop the PyMARL2 QMIX/WQMIX framework using NVIDIA 2080Ti GPUs.

6. Theoretical Analysis and Ingredients of Optimal Priority

Four key ingredients characterize optimal MAC-PO priorities:

  1. Bellman-error (γ[0,1)\gamma \in [0, 1)0): Emphasizes transitions where the Q-estimate is inconsistent with the Bellman target.
  2. Value-enhancement (γ[0,1)\gamma \in [0, 1)1): Downweights transitions where γ[0,1)\gamma \in [0, 1)2 is distant from optimal; highlights areas near the global optimum.
  3. On-policiness (γ[0,1)\gamma \in [0, 1)3): Favors transitions prevalent under the current policy, modulating off-policy bias.
  4. Joint-action Probabilities: The term γ[0,1)\gamma \in [0, 1)4 biases toward transitions where one agent’s decision is the unlikely bottleneck, a property unique to MARL.

This combination provably reduces a regret surrogate upper bound, systematically biasing policy improvement toward minimizers of the true regret functional.

7. Empirical Validation and Results

MAC-PO was evaluated on two principal benchmarks:

  • Predator-Prey: A grid-world with 8 predators and 8 prey; performance is measured as average episodic return.
  • SMAC: StarCraft Multi-Agent Challenge maps, including 3s_vs_5z (standard), 5m_vs_6m (hard), and MMM2 (super-hard); performance is the win rate over 32 evaluation episodes.

Baselines Compared

Category Algorithms
Single-agent replay adapted to MARL PER, PSER, DisCor, ReMERN
Value-decomposition MARL QMIX, WQMIX, QPLEX
Actor-critic MARL VDAC, FOP, DOP

Key Empirical Findings

  • Win rate improvement: On SMAC, MAC-PO improves final win rates by 4–16% relative to the next-best baseline and shows reduced variance.
  • Sample efficiency: On Predator-Prey, MAC-PO achieves faster convergence and higher asymptotic returns, particularly under harsh conditions (γ[0,1)\gamma \in [0, 1)5).
  • Approximation efficiency: The approximate priority scheme performs within a few percent of optimal while dramatically reducing computational cost.
  • Ablation study: Excluding the joint-probability term on the hardest SMAC map drops win rate by ~18%, affirming its necessity in MARL prioritization.

8. Additional Formalism and Implementation Notes

Key computational primitives include:

  • Bellman operator: γ[0,1)\gamma \in [0, 1)6.
  • Boltzmann policy: γ[0,1)\gamma \in [0, 1)7, with temperature γ[0,1)\gamma \in [0, 1)8.
  • Weighted loss–priority equivalence: Non-uniform sampling with weight γ[0,1)\gamma \in [0, 1)9 is equivalent to uniform sampling with per-sample loss scaled by zO(s,a)z \sim O(s, a)0 [Fujimoto et al. 2020, as cited in (Mei et al., 2023)].
  • Replay and training details: Batch size 128, buffer 10,000, learning rate zO(s,a)z \sim O(s, a)1, Adam optimizer, target update every 200 episodes, TD-zO(s,a)z \sim O(s, a)2.

MAC-PO thus establishes a rigorous, regret-minimizing framework for prioritized multi-agent replay, integrating centralized regret criteria with agent-wise decentralized policies. The combination of classical temporal-difference measures and new multi-agent joint-action structures allows MAC-PO to attain consistent performance advantages over baseline methods across challenging collaborative domains (Mei et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-agent Regret-weighted Replay (MAC-PO).