Papers
Topics
Authors
Recent
2000 character limit reached

Centralized Training with Decentralized Execution

Updated 4 December 2025
  • Centralized Training with Decentralized Execution is a framework where agents benefit from full global information during training to overcome challenges of partial observability.
  • It utilizes methods like value-decomposition and centralized-critic actor-critic algorithms to optimize joint rewards and improve coordination across agents.
  • CTDE has shown significant empirical success in domains such as StarCraft II micromanagement, transportation, and smart grid management by balancing training efficiency and execution constraints.

Centralized Training with Decentralized Execution (CTDE) is a foundational paradigm in multi-agent reinforcement learning (MARL) that exploits centralized information during the learning phase to facilitate coordination and credit assignment but imposes strict locality constraints at test time, ensuring that each agent executes its policy using only local observations or histories. This framework is particularly effective for cooperative Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs), where agents share a joint reward but are only partially informed about the global state during execution. CTDE has enabled progress in domains such as StarCraft II micromanagement, transportation infrastructure management, and resilient electric vehicle charging, and has motivated algorithmic advances in value-decomposition, actor-critic architectures, and policy distillation.

1. Formal Definition and Theoretical Foundations

In the canonical setting, a team of nn agents interacts in a Dec-POMDP with global, hidden state sSs \in \mathcal S, local observations oiOio_i \in \mathcal O_i, actions aiAia_i \in \mathcal A_i, and a shared reward r(s,a)r(s, \mathbf a). Each agent ii maintains a policy πi(aiτi)\pi_i(a_i | \tau_i), where τi\tau_i is its action-observation history. The joint objective is to maximize the expected cumulative discounted reward: Vπ(s)=E[t=0γtrt    s0=s,π]V^{\bm \pi}(s) = \mathbb{E} \bigg[ \sum_{t=0}^\infty \gamma^t r_t \;\bigg|\; s_0 = s, \bm \pi \bigg] In CTDE, centralized information (e.g., full state ss or joint observations τ\bm\tau) is harnessed during training to optimize policies or credit assignment, but policies πi\pi_i must be executable based solely on agent-specific information at test time (Zhao et al., 2022, Amato, 4 Sep 2024).

CTDE-based algorithms rely formally on the Individual–Global–Max (IGM) principle, which ensures that joint actions maximizing the centralized value function correspond to the greedy choices of local value functions. For value-decomposition methods, this is expressed as: argmaxaQtot(s,a)=(argmaxa1Q1(o1,a1),,argmaxanQn(on,an))\arg\max_{\mathbf{a}} Q_{\mathrm{tot}}(s, \mathbf{a}) = \Bigl( \arg\max_{a_1} Q_1(o_1, a_1), \ldots, \arg\max_{a_n} Q_n(o_n, a_n) \Bigr) This condition guarantees decentralizability of execution while permitting sophisticated joint optimization during training (Amato, 4 Sep 2024).

2. Core Algorithmic Classes

Several algorithmic frameworks instantiate the CTDE paradigm. The two principal families are value-factorization and centralized-critic actor-critic methods.

Value-Decomposition Approaches

  • VDN: Approximates the joint action-value as a sum of local Q-values:

Qtot(h,a)i=1nQi(hi,ai)Q_{\mathrm{tot}}(h, a) \approx \sum_{i=1}^n Q_i(h_i, a_i)

  • QMIX: Employs a monotonic mixing network Qtot(h,s,a)=fmix(Q1(h1,a1),,Qn(hn,an);s)Q_{\mathrm{tot}}(h, s, a) = f_{\rm mix}(Q_1(h_1, a_1),\ldots, Q_n(h_n, a_n); s), enforcing Qtot/Qi0\partial Q_{\rm tot}/\partial Q_i \geq 0, with the global state ss as input (Amato, 4 Sep 2024).
  • QTRAN/QPLEX: Relax IGM’s constraints, incorporating additional decomposition or linearity conditions to extend representability (Amato, 4 Sep 2024, Zhou et al., 2023).
  • CTDS: Adds a “teacher” network with access to global state to generate per-agent Q-values, which are then distilled into decentralized “student” Q-networks solely dependent on local information, improving coordination under extreme partial observability (Zhao et al., 2022).

Centralized-Critic Actor-Critic Methods

  • MADDPG/COMA/MAPPO: Each agent is equipped with a decentralized actor πi(aioi)\pi_i(a_i|o_i), while a (possibly shared) centralized critic Qc(s,a1,,an)Q^c(s, a_1, \ldots, a_n) leverages global state and joint actions during training. Policy gradients are computed as:

θiJ=E[θilogπi(aioi)Qc(s,a1,...,an)]\nabla_{\theta_i} J = \mathbb{E} \bigl[ \nabla_{\theta_i} \log \pi_i(a_i|o_i) \cdot Q^c(s, a_1, ..., a_n) \bigr]

Execution requires only the actor network for each agent, preserving strict decentralization (Sharma et al., 2021, Shojaeighadikolaei et al., 18 Apr 2024).

  • Personalized Training and Distilled Execution (PTDE): Constructs agent-personalized global information via a hypernetwork during training and then distills it into lightweight student networks for execution, achieving superior performance while maintaining decentralization (Chen et al., 2022).

3. Information Sharing, Critic Centralization, and Factorization

Critic parameterization during training is central to CTDE methodology. Variants include:

  • Centralized Critics: Input joint histories or full state/action tuples; facilitate bias-free policy gradients in fully observable settings, but introduce bias and excess variance under partial observability unless conditioning on the agent histories τ\tau (Lyu et al., 26 Aug 2024, Lyu et al., 2021). Empirical results show that history-based critics are necessary to avoid representational collapse and performance degradation in partially observable tasks.
  • Decentralized Critics: Each agent learns its own value function on local information. Such methods avoid variance inflation due to sampling over joint agent histories/actions but may converge more slowly or to suboptimal equilibria due to nonstationarity (Lyu et al., 2021).
  • Hybrid/Conditional Factorization: Approaches like MACPF employ chain-rule factorization, π(aτ)=iπi(aiτ,a1:i1)\pi(a|\tau) = \prod_i \pi_i(a_i|\tau, a_{1:i-1}), permitting fully centralized optimization and guaranteeing the existence of an independent factorized policy with no return loss for decentralized execution (Wang et al., 2022).

Theoretical analyses detail that while centralized critics induce lower bias in fully observed or lightly observed settings, they can inflate variance, particularly through multi-action or multi-observation variance terms. Decentralized critics maintain lower variance at the possible cost of increased nonstationarity (Lyu et al., 2021, Lyu et al., 26 Aug 2024).

4. Extensions: Distillation, Curriculum, and Communication

Contemporary research proposes additional structural enhancements atop the CTDE scaffold:

  • Policy and Value Distillation: Approaches such as CTEDD and teacher-student frameworks perform centralized training (often with maximal entropy or stochastic exploration) and subsequently supervise decentralized policy networks to mimic the optimal centralized behaviors, with strong empirical retention of central performance in the distilled agents (Chen, 2019, Zhao et al., 2022).
  • Progressive Communication Pruning (CADP, TACO): Explicit communication channels or attention-based information flow are permitted during early or intermediate training phases for more effective coordination, followed by explicit pruning or reconstruction losses that enforce pure decentralization at execution. Such curricula systematically bridge the gap between full communication and strict local execution, and maintain or exceed the performance of CTDE baselines (Li et al., 2023, Zhou et al., 2023).
  • Generative Inference: In settings where communication may be unreliable or absent at execution, centralized training includes generative models (e.g., conditional WGANs) to help agents infer missing peer observations from local context, providing robustness to network disruptions (Corder et al., 2019).

5. Empirical Results and Practical Considerations

CTDE methods have achieved state-of-the-art results across a spectrum of multi-agent benchmarks: SMAC, ma-gym Combat, Google Research Football, particle environments, transportation network management, and EV charging control (Amato, 4 Sep 2024, Saifullah et al., 23 Jan 2024, Shojaeighadikolaei et al., 18 Apr 2024). Experimental findings demonstrate:

  • Centralization in training (regardless of the specific mix of critic centralization, value-mixing, or explicit communication) dramatically improves learning speed, final team win rates, coordination, and robustness under partial observability.
  • In domains with heavy partial observability or tight cooperation requirements, advanced CTDE variants (e.g., CTDS, PTDE, CADP, MACPF) achieve 10–30%+ higher win rates or cost reductions compared to prior baselines (Zhao et al., 2022, Chen et al., 2022, Zhou et al., 2023, Wang et al., 2022).
  • Curriculum designs (progressive pruning or distillation) allow for the seamless transition from centralized capabilities in training to strict autonomy at runtime with negligible or only marginal performance loss (Chen et al., 2022, Li et al., 2023).

A summary of notable empirical outcomes is provided:

Benchmark / Domain CTDE Variant Empirical Benefit
SMAC (StarCraft II) CADP, CTDS, PTDE, MACPF 10–30% increase in win-rate on hard maps
GRF, LTR PTDE, CADP Substantial performance gains, >80% PRR
EV Charging CTDE-DDPG 36% TV reduction, 9% cost reduction
Infrastructure Management DDMAC-CTDE 8% cost savings over CBM, strict constraint adherence

The above benefits are robust to the choice of value-decomposition or actor-critic base algorithm, provided the CTDE architecture is respected.

6. Limitations, Trade-Offs, and Open Directions

Despite demonstrated success, CTDE presents unresolved challenges:

  • Variance–Bias Trade-Offs: Centralized critics offer reduced bias but potentially catastrophic variance under severe partial observability or many-agent regimes (Lyu et al., 26 Aug 2024, Lyu et al., 2021). Hybrid or conditional critics and new variance-reduction strategies are active research areas.
  • Representational Scope: IGM or monotonicity constraints in value decomposition can unduly restrict the class of learnable joint policies. Approaches such as MACPF and CADP expand this space through richer dependency structures or explicit centralized advising (Wang et al., 2022, Zhou et al., 2023).
  • Training Scalability: Centralized critics and mixers scale poorly with agent population size due to increased observation/action space. Factorized or sparse critics, as in DDMAC-CTDE, mitigate this but may trade off learning signal richness (Saifullah et al., 23 Jan 2024).
  • Transition to Decentralization: Progressive pruning, curriculum learning, or staged distillation must be structured to avoid loss of the coordination abilities fostered by centralization in training (Li et al., 2023, Zhou et al., 2023).

Future research aims to:

  • Systematically investigate partial or hybrid centralization, e.g., via graph neural networks or adaptive critic factorization.
  • Develop sample-efficient, scalable protocols for training with large numbers of agents or in high-dimensional spaces.
  • Refine curriculum and distillation techniques to further close any remaining performance gaps between training and decentralized execution (Chen et al., 2022).
  • Integrate CTDE with real-world constraints such as privacy, communication limits, asynchronous observations, and adversarial agents (Shojaeighadikolaei et al., 18 Apr 2024, Saifullah et al., 23 Jan 2024).

7. Applications and Benchmark Impact

The CTDE paradigm is now standard in cooperative MARL, driving advances across disjointed application areas:

  • StarCraft II Micromanagement: Nearly all high-performing teams on SMAC employ CTDE-based value decomposition or actor-critic architectures for robust credit assignment and coordination (Zhou et al., 2023, Chen et al., 2022).
  • Large-Scale Infrastructure: DDMAC-CTDE achieves near-optimal cost minimization and constraint satisfaction for transportation asset management, scaling to O(1096)O(10^{96})O(10248)O(10^{248}) joint states (Saifullah et al., 23 Jan 2024).
  • Smart Grid and EV Charging: CTDE-DDPG results in both smoother and fairer charging curves and cost profiles, outperforming decentralized DDPG baselines by substantial margins (Shojaeighadikolaei et al., 18 Apr 2024).

By enabling the systematic exploitation of global structure and information in training while preserving strict decentralization at execution, CTDE continues to shape the state of the art in multi-agent sequential decision-making.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Centralized Training with Decentralized Execution.