Papers
Topics
Authors
Recent
Search
2000 character limit reached

Centralized Advising & Decentralized Pruning (CADP)

Updated 6 May 2026
  • CADP is a novel cooperative multi-agent reinforcement learning framework that uses centralized advising via cross-attention during training to enhance joint-policy exploration.
  • It employs a KL-divergence based pruning loss to smoothly transition from inter-agent communication to strict decentralized execution.
  • Empirical results on SMAC and GRF benchmarks demonstrate CADP’s superiority over standard CTDE methods with higher win rates and improved sample efficiency.

Centralized Advising and Decentralized Pruning (CADP) is a novel framework in cooperative multi-agent reinforcement learning (MARL) that enhances the exploitation of global information during centralized training while meeting the practical requirement of policy decentralization during execution. CADP extends the dominant Centralized Training with Decentralized Execution (CTDE) paradigm by introducing a formal mechanism for explicit message exchange—termed "centralized advising"—during training and a smooth transition to strictly independent local policies—termed "decentralized pruning"—for evaluation and deployment (Zhou et al., 2023).

1. Motivation and Context

Standard CTDE frameworks rely on the independence of agent policies πn(unτn)\pi^n(u^n|\tau^n), where each agent nn conditions its local policy solely on its individual observation-action history τn\tau^n, with global state ss injected only through a centralized mixing or critic network. While CTDE allows the use of global state for value or advantage calculation, agents cannot access teammates’ hidden states or intermediate belief structures during training, leading to inefficient joint-policy exploration and potentially suboptimal convergence. CADP addresses this shortcoming by facilitating advice exchange via latent message passing between agents at training time, followed by a principled removal of communication dependencies to retain strict decentralization at test time (Zhou et al., 2023).

2. Formal Problem Specification

CADP is formulated within the cooperative Decentralized Partially Observable Markov Decision Process (Dec-POMDP) setting: A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle, with NN agents A={1,,N}\mathcal{A} = \{1, \ldots, N\}, global states sSs \in \mathcal{S}, individual actions utiUu^i_t \in \mathcal{U}, transition kernel P(st+1st,ut)P(s_{t+1}|s_t, \mathbf{u}_t), team reward nn0, local observations nn1, and local histories nn2. The objective is to optimize the joint nn3-function: nn4 In CTDE, decentralized local policy nn5 is trained individually, while CADP relaxes this restriction at training by permitting cross-agent attention and advice.

3. Centralized Advising Module

During training, each agent nn6 processes its local observation history nn7 using a GRU encoder: nn8 This hidden state is projected into query nn9, key τn\tau^n0, and value τn\tau^n1 embeddings: τn\tau^n2 For each pair τn\tau^n3, a cross-attention coefficient is computed: τn\tau^n4 Teammate advice is aggregated as: τn\tau^n5 Agent τn\tau^n6 then calculates its action-value as: τn\tau^n7 These local Q-values are integrated by a value mixing network (e.g., QMIX), using standard TD loss: τn\tau^n8 where τn\tau^n9.

4. Decentralized Pruning Mechanism

To ensure that final policies are fully decentralized, CADP imposes a pruning loss to steer the cross-attention coefficients to one-hot vectors focused on self: ss0 where ss1 is the one-hot vector at index ss2. The Kullback-Leibler divergence-based pruning loss is

ss3

with time-adaptive weighting

ss4

yielding total loss

ss5

During execution, cross-agent communication is dropped, and agents use ss6 exclusively.

5. Training Workflow

A stylized outline of the CADP training process is as follows:

  • Initialize networks and replay buffer.
  • For each timestep:
    • Agents observe ss7, update ss8, compute ss9, A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,0, A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,1.
    • Agents receive A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,2, A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,3 from all teammates and compute A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,4, A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,5.
    • Each agent computes A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,6, selects A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,7.
    • Joint action A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,8 executed, transition stored in buffer.
  • Periodically sample minibatches, compute A,S,U,P,r,Ω,O,γ,\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle,9, update networks.
  • From pruning threshold NN0 onward, add NN1 to the loss.
  • Execution uses only local streams NN2 for action selection (Zhou et al., 2023).

6. Empirical Evaluation

CADP was evaluated on StarCraft II micromanagement (SMAC) and Google Research Football (GRF) benchmarks. Metrics used include average test win-rate over 5 seeds. Results, shown for QMIX backbone:

SMAC win-rate (mean±std):

Method 5m_vs_6m corridor 3s5z_vs_3s6z
QMIX (CTDE) 0.43±0.13 0.70±0.35 0.24±0.36
QMIX + CADP(C) 0.68±0.08 0.85±0.04 0.94±0.03
QMIX + CADP(D) 0.68±0.08 0.84±0.03 0.93±0.03

GRF win-rate (mean±std):

Method 3v1_keeper counterattack
QMIX (CTDE) 0.58±0.21 0.24±0.13
QMIX + CADP(C/D) 0.77±0.00 0.64±0.15

Ablations show that CADP’s superiority persists despite reduced agent field-of-view and is robust to different pruning parameters NN3. The CADP mechanism provides consistent improvements when integrated with other backbones including VDN, QPLEX, and MAPPO (Zhou et al., 2023).

7. Key Findings and Impact

  • Centralized advising during training facilitates more efficient joint-policy exploration than standard CTDE or teacher-student distillation, leveraging richer global interactions.
  • Smooth KL-based pruning achieves strictly decentralized execution with negligible performance loss.
  • CADP consistently outperforms leading CTDE and teacher-student methods in both StarCraft II and GRF tasks.
  • The framework uses only lightweight modules—cross-attention-based advising and a KL divergence loss—yet yields considerable increases in sample efficiency and final performance.
  • The approach addresses the core “not centralized enough” limitation of CTDE by transparently trading cross-agent reliance at training for decentralized deployment compliance (Zhou et al., 2023).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Centralized Advising and Decentralized Pruning (CADP).