MAC-PO: Multi-Agent Regret-weighted Replay
- The paper introduces MAC-PO, a novel prioritized replay framework that minimizes collective regret in cooperative multi-agent settings.
- It employs a closed-form priority integrating TD error, optimality gap, on-policy likelihood, and joint-action probabilities to boost sample efficiency.
- Empirical results on Predator-Prey and SMAC benchmarks demonstrate superior win rates and faster convergence compared to established MARL methods.
Multi-Agent Regret-weighted Replay via Collective Priority Optimization (MAC-PO) is a prioritized experience replay framework specifically designed for cooperative multi-agent reinforcement learning (MARL) under the centralized training and decentralized execution (CTDE) paradigm. MAC-PO rigorously formulates the replay prioritization problem as a collective regret minimization over sampling weights, yielding a closed-form expression for sample priorities that integrates temporal-difference (TD) error, optimality gap, on-policy likelihood, and novel joint action factors unique to multi-agent settings. Empirical evaluation demonstrates MAC-PO's efficacy on benchmark environments, where it consistently surpasses established alternatives in sample efficiency and final performance (Mei et al., 2023).
1. Multi-Agent Markov Game Setting and Experience Replay
The MARL context considered is a decentralized partially observable Markov decision process (Dec-POMDP) operating under CTDE:
- Agents and Actions: The agent set is , with each agent selecting . The joint action is .
- State and Transitions: Global state evolves via , with all agents sharing reward and a common discount factor .
- Observations and Policies: Each agent gets local observations and maintains private histories . Decentralized policies 0 define the joint policy 1.
- Experience Replay: A finite buffer 2 stores transitions 3, generated off-policy. Standard approaches sample transitions uniformly or with fixed priorities. In MARL, uniform experience replay is sub-optimal, as it ignores sample importance and inter-agent policy dependencies.
2. Regret Minimization and the Weighted Bellman-Error Objective
MAC-PO's central objective is to minimize policy regret, narrowing the gap between the expected return of the current joint policy and a nominal optimal joint policy.
- Policy Regret: For joint policy 4, expected (discounted) return is 5, with regret defined as
6
where 7 is optimal.
- Weighted Bellman-Error Minimization: At iteration 8, with current estimate 9, MAC-PO fits 0 by minimizing a prioritized, weighted Bellman error:
1
s.t. 2, where 3 is the Bellman target and 4 are sampling-priority weights to be optimized.
- Meta-optimization: The core problem is to optimize 5:
6
with 7 the Boltzmann policy induced by 8.
3. Regret Relaxation, Lagrangian Duality, and Optimal Priorities
MAC-PO formulates a tight surrogate upper bound for policy regret, based on discounted Q-value errors:
- Surrogate Loss: By Kakade's lemma and Jensen's inequality, regret is bounded as
9
replaced by 0.
- Lagrangian Formulation: With 1 and duals 2:
3
- Implicit Differentiation: The Q-iteration solution is analytically differentiated w.r.t. 4 via the implicit function theorem; 5 involves the Bellman differences.
- Closed-form Priority: Karush-Kuhn-Tucker (KKT) conditions yield the optimal sampling priority (Theorem 1):
6
where
7
8 is a normalization factor; 9 is negligible for low return-to-state probability.
4. Practical Prioritization: Exact and Approximate Methods
The computation of optimal priorities in the multi-agent setting involves non-trivial joint-policy dependencies:
- Exact MAC-PO: For each transition 0:
1
with 2.
- Approximation Scheme: To avoid the 3 cost of computing full joint probabilities, MAC-PO introduces a three-level partition based on Theorem 2:
- High: Exactly one agent’s local policy has 4 and all others 5;
- Low: 6 or 7;
- Medium: Otherwise.
Assigning weights 8 to these levels, final transition priority is computed by multiplying Bellman error, value enhancement, and level weight.
5. Implementation and Optimization Procedure
The MAC-PO algorithm proceeds as follows:
- Initialization: Set parameter vectors 9, target 0, and empty replay buffer 1.
- Sampling: Collect trajectories 2 under 3-greedy policy 4; store in 5.
- Minibatch Update:
- Sample 6 transitions uniformly.
- For each 7 pair:
- Compute 8, Bellman target 9.
- Compute per-agent Boltzmann policies 0.
- Estimate 1 (unrestricted mixing or greedy maximizer).
- Compute 2 via closed-form or approximated priority.
- Apply gradient step on weighted loss: 3.
- Periodically update 4.
Core implementation details: networks employ Adam optimizer, target updates every 200 episodes, replay size 5, minibatch 6, learning rate 7, TD-8 of 9, with code built atop the PyMARL2 QMIX/WQMIX framework using NVIDIA 2080Ti GPUs.
6. Theoretical Analysis and Ingredients of Optimal Priority
Four key ingredients characterize optimal MAC-PO priorities:
- Bellman-error (0): Emphasizes transitions where the Q-estimate is inconsistent with the Bellman target.
- Value-enhancement (1): Downweights transitions where 2 is distant from optimal; highlights areas near the global optimum.
- On-policiness (3): Favors transitions prevalent under the current policy, modulating off-policy bias.
- Joint-action Probabilities: The term 4 biases toward transitions where one agent’s decision is the unlikely bottleneck, a property unique to MARL.
This combination provably reduces a regret surrogate upper bound, systematically biasing policy improvement toward minimizers of the true regret functional.
7. Empirical Validation and Results
MAC-PO was evaluated on two principal benchmarks:
- Predator-Prey: A grid-world with 8 predators and 8 prey; performance is measured as average episodic return.
- SMAC: StarCraft Multi-Agent Challenge maps, including 3s_vs_5z (standard), 5m_vs_6m (hard), and MMM2 (super-hard); performance is the win rate over 32 evaluation episodes.
Baselines Compared
| Category | Algorithms |
|---|---|
| Single-agent replay adapted to MARL | PER, PSER, DisCor, ReMERN |
| Value-decomposition MARL | QMIX, WQMIX, QPLEX |
| Actor-critic MARL | VDAC, FOP, DOP |
Key Empirical Findings
- Win rate improvement: On SMAC, MAC-PO improves final win rates by 4–16% relative to the next-best baseline and shows reduced variance.
- Sample efficiency: On Predator-Prey, MAC-PO achieves faster convergence and higher asymptotic returns, particularly under harsh conditions (5).
- Approximation efficiency: The approximate priority scheme performs within a few percent of optimal while dramatically reducing computational cost.
- Ablation study: Excluding the joint-probability term on the hardest SMAC map drops win rate by ~18%, affirming its necessity in MARL prioritization.
8. Additional Formalism and Implementation Notes
Key computational primitives include:
- Bellman operator: 6.
- Boltzmann policy: 7, with temperature 8.
- Weighted loss–priority equivalence: Non-uniform sampling with weight 9 is equivalent to uniform sampling with per-sample loss scaled by 0 [Fujimoto et al. 2020, as cited in (Mei et al., 2023)].
- Replay and training details: Batch size 128, buffer 10,000, learning rate 1, Adam optimizer, target update every 200 episodes, TD-2.
MAC-PO thus establishes a rigorous, regret-minimizing framework for prioritized multi-agent replay, integrating centralized regret criteria with agent-wise decentralized policies. The combination of classical temporal-difference measures and new multi-agent joint-action structures allows MAC-PO to attain consistent performance advantages over baseline methods across challenging collaborative domains (Mei et al., 2023).