Centralized Advising & Decentralized Pruning (CADP)
- CADP is a novel cooperative multi-agent reinforcement learning framework that uses centralized advising via cross-attention during training to enhance joint-policy exploration.
- It employs a KL-divergence based pruning loss to smoothly transition from inter-agent communication to strict decentralized execution.
- Empirical results on SMAC and GRF benchmarks demonstrate CADP’s superiority over standard CTDE methods with higher win rates and improved sample efficiency.
Centralized Advising and Decentralized Pruning (CADP) is a novel framework in cooperative multi-agent reinforcement learning (MARL) that enhances the exploitation of global information during centralized training while meeting the practical requirement of policy decentralization during execution. CADP extends the dominant Centralized Training with Decentralized Execution (CTDE) paradigm by introducing a formal mechanism for explicit message exchange—termed "centralized advising"—during training and a smooth transition to strictly independent local policies—termed "decentralized pruning"—for evaluation and deployment (Zhou et al., 2023).
1. Motivation and Context
Standard CTDE frameworks rely on the independence of agent policies , where each agent conditions its local policy solely on its individual observation-action history , with global state injected only through a centralized mixing or critic network. While CTDE allows the use of global state for value or advantage calculation, agents cannot access teammates’ hidden states or intermediate belief structures during training, leading to inefficient joint-policy exploration and potentially suboptimal convergence. CADP addresses this shortcoming by facilitating advice exchange via latent message passing between agents at training time, followed by a principled removal of communication dependencies to retain strict decentralization at test time (Zhou et al., 2023).
2. Formal Problem Specification
CADP is formulated within the cooperative Decentralized Partially Observable Markov Decision Process (Dec-POMDP) setting: with agents , global states , individual actions , transition kernel , team reward 0, local observations 1, and local histories 2. The objective is to optimize the joint 3-function: 4 In CTDE, decentralized local policy 5 is trained individually, while CADP relaxes this restriction at training by permitting cross-agent attention and advice.
3. Centralized Advising Module
During training, each agent 6 processes its local observation history 7 using a GRU encoder: 8 This hidden state is projected into query 9, key 0, and value 1 embeddings: 2 For each pair 3, a cross-attention coefficient is computed: 4 Teammate advice is aggregated as: 5 Agent 6 then calculates its action-value as: 7 These local Q-values are integrated by a value mixing network (e.g., QMIX), using standard TD loss: 8 where 9.
4. Decentralized Pruning Mechanism
To ensure that final policies are fully decentralized, CADP imposes a pruning loss to steer the cross-attention coefficients to one-hot vectors focused on self: 0 where 1 is the one-hot vector at index 2. The Kullback-Leibler divergence-based pruning loss is
3
with time-adaptive weighting
4
yielding total loss
5
During execution, cross-agent communication is dropped, and agents use 6 exclusively.
5. Training Workflow
A stylized outline of the CADP training process is as follows:
- Initialize networks and replay buffer.
- For each timestep:
- Agents observe 7, update 8, compute 9, 0, 1.
- Agents receive 2, 3 from all teammates and compute 4, 5.
- Each agent computes 6, selects 7.
- Joint action 8 executed, transition stored in buffer.
- Periodically sample minibatches, compute 9, update networks.
- From pruning threshold 0 onward, add 1 to the loss.
- Execution uses only local streams 2 for action selection (Zhou et al., 2023).
6. Empirical Evaluation
CADP was evaluated on StarCraft II micromanagement (SMAC) and Google Research Football (GRF) benchmarks. Metrics used include average test win-rate over 5 seeds. Results, shown for QMIX backbone:
SMAC win-rate (mean±std):
| Method | 5m_vs_6m | corridor | 3s5z_vs_3s6z |
|---|---|---|---|
| QMIX (CTDE) | 0.43±0.13 | 0.70±0.35 | 0.24±0.36 |
| QMIX + CADP(C) | 0.68±0.08 | 0.85±0.04 | 0.94±0.03 |
| QMIX + CADP(D) | 0.68±0.08 | 0.84±0.03 | 0.93±0.03 |
GRF win-rate (mean±std):
| Method | 3v1_keeper | counterattack |
|---|---|---|
| QMIX (CTDE) | 0.58±0.21 | 0.24±0.13 |
| QMIX + CADP(C/D) | 0.77±0.00 | 0.64±0.15 |
Ablations show that CADP’s superiority persists despite reduced agent field-of-view and is robust to different pruning parameters 3. The CADP mechanism provides consistent improvements when integrated with other backbones including VDN, QPLEX, and MAPPO (Zhou et al., 2023).
7. Key Findings and Impact
- Centralized advising during training facilitates more efficient joint-policy exploration than standard CTDE or teacher-student distillation, leveraging richer global interactions.
- Smooth KL-based pruning achieves strictly decentralized execution with negligible performance loss.
- CADP consistently outperforms leading CTDE and teacher-student methods in both StarCraft II and GRF tasks.
- The framework uses only lightweight modules—cross-attention-based advising and a KL divergence loss—yet yields considerable increases in sample efficiency and final performance.
- The approach addresses the core “not centralized enough” limitation of CTDE by transparently trading cross-agent reliance at training for decentralized deployment compliance (Zhou et al., 2023).