CTDE: Centralized Training, Decentralized Execution
- CTDE is a learning paradigm that uses centralized training with full global information to mitigate nonstationarity and credit assignment issues in multi-agent reinforcement learning.
- It employs methodologies like value decomposition, centralized critic, and imitation-based strategies to ensure decentralized policies align with centralized optimal solutions.
- CTDE enhances sample efficiency and coordination in complex environments by addressing challenges such as partial observability and communication constraints.
Centralized Training, Decentralized Execution (CTDE) is a dominant learning paradigm in cooperative and mixed multi-agent reinforcement learning (MARL). The core idea is to leverage additional global or joint information during training—when communication, full observability, or centralized computation may be feasible—to produce policies capable of decentralized execution, where each agent acts solely on its local observation. This framework is designed to mitigate problems of nonstationarity, credit assignment, and partial observability that naturally arise in multi-agent settings. CTDE encompasses a spectrum of algorithmic architectures (e.g., value decomposition, centralized critic methods, imitation-based approaches), and is foundational in recent MARL research and application domains.
1. Principles and Mathematical Foundations
The CTDE framework decouples the training and execution regimes. During centralized training, policies or value functions are optimized with access to global state, joint actions, or inter-agent communication. At execution time, all decisions are computed by individual agents based exclusively on local observations and, when permitted, local message exchanges or reconstructed information (Amato, 4 Sep 2024, Amato, 10 May 2024).
Mathematically, many CTDE algorithms can be formalized in terms of decentralized partially observable Markov decision processes (Dec-POMDPs). Let denote agent 's local observation and its action. A joint policy is , and a global Q-value is trained using joint trajectories:
In value function factorization (e.g., VDN, QMIX) the global value decomposes via:
where is chosen under monotonicity or other constraints so that maximizing is compatible with coordinated, but decentralized, action choices (“Individual-Global-Max” or IGM property) (Amato, 4 Sep 2024, Hu et al., 2023). Actor-critic formulations often use a centralized critic for policy improvement steps, for example:
where is a baseline.
2. Canonical Algorithmic Structures
CTDE supports several canonical methodologies:
- Value Decomposition: Additive () as in Value-Decomposition Networks (VDN), and monotonic mixing as in QMIX (, monotonic in each argument) (Amato, 4 Sep 2024, Amato, 10 May 2024, Hu et al., 2023). TVDO introduces Tchebycheff-based nonlinear aggregation satisfying the IGM condition exactly (Hu et al., 2023).
- Centralized Critic Approaches: Each agent’s actor is decentralized and executed locally, while a centralized critic, available only during training, estimates the Q-function on the joint observation-action space (e.g., MADDPG, COMA, MAPPO, MASAC) (Amato, 10 May 2024, Xu et al., 2023, Saifullah et al., 23 Jan 2024, Shojaeighadikolaei et al., 18 Apr 2024).
- Imitation-Based CTDE: Methods such as CESMA (Lin et al., 2019) train a centralized expert on the full joint observation and use DAgger-style imitation learning to produce decentralized agents, with theoretical regret bounds connecting supervised loss to return gap.
- Teacher–Student and Knowledge Distillation Schemes: Centralized “teacher” models (with full observability) train “student” models using only local information via loss coupling (as in CTDS, PTDE) (Zhao et al., 2022, Chen et al., 2022).
- Communication Reconstruction Paradigms: Recent methods (e.g., TACO, SICA) train with explicit inter-agent communication but phase it out during training via progressive reconstruction, yielding “tacit” cooperation implementable without communication at execution (Li et al., 2023, Liu et al., 20 Dec 2024).
- Model-Based CTDE: Approaches such as MAMBA use centralized model-based rollouts (e.g., RSSM, communication Transformer blocks) for sample-efficient training, supporting global information flow and decentralized deployment (Egorov et al., 2022).
3. Performance, Scalability, and Theoretical Guarantees
CTDE algorithms demonstrate improved sample efficiency and coordination compared to decentralized learning, especially in environments with partial observability or nonstationarity (Amato, 4 Sep 2024, Amato, 10 May 2024). Centralized critics or mixing networks facilitate credit assignment by assigning common return signals and providing richer feedback during policy updates.
Monotonic factorization (as in QMIX) and its variants guarantee that independent local greedy policies jointly optimize , but representational constraints may result in underestimation of optimal actions (“relative overgeneralization”) (Zhang et al., 5 Feb 2025). TVDO addresses this by constructing a bias term using Tchebycheff aggregation so that global optimality and local greedy actions exactly coincide (Hu et al., 2023).
Empirically, CTDE methods achieve state-of-the-art performance on benchmark environments such as SMAC, GRF, MPE, and complex domains including transportation infrastructure management (e.g., DDMAC-CTDE achieves up to 31% lower costs than practical baselines) (Saifullah et al., 23 Jan 2024), voltage control with >100 agents (Xu et al., 2023), and EV charging control with 36% reduction in total variation and 9% reduced cost (Shojaeighadikolaei et al., 18 Apr 2024). Knowledge distillation approaches (CTDS, PTDE) and information selection frameworks (SICA) exhibit robust learning even as local information diminishes (Zhao et al., 2022, Chen et al., 2022, Liu et al., 20 Dec 2024).
Recent frameworks provide monotonic policy improvement guarantees (e.g., MAGPO), with performance difference lemmas guaranteeing that each update increases expected return under mild conditions (Li et al., 24 Jul 2025). Many works analyse the trade-off between generality (scaling), sample efficiency, and the practical constraints of partial observability or communication (Marchesini et al., 2021, Egorov et al., 2022).
4. Enhancements, Limitations, and Scalability
Despite many advances, several technical challenges remain:
- Information Bottlenecks: CTDE critics may be hampered by the curse of dimensionality as the number of agents increases. Scalable network-aware frameworks (SNA) address this by truncating critic inputs to local neighborhoods, exploiting decay of interactions in physical networks to preserve performance guarantees with bounded errors (Xu et al., 2023).
- Partial Observability and Communication Constraints: Progressive distillation and tacit learning strategies (e.g., TACO, SICA) diminish dependence on explicit communication by reconstructing global signals from local histories. These mechanisms provide competitive or better performance compared to communication-based and conventional CTDE methods, particularly under strict communication or bandwidth limits (Li et al., 2023, Liu et al., 20 Dec 2024).
- Personalization vs. Homogeneity: Applying identical global information to all agents is suboptimal. PTDE introduces agent-personalized global information generation and distillation, greatly improving performance retention when agents must execute locally (Chen et al., 2022).
- Flexible Integration: Frameworks such as CADP and CTDS generalize CTDE to incorporate explicit message advice channels and smooth transitions to decentralized execution, demonstrating applicability with diverse backbone algorithms (Zhou et al., 2023, Zhao et al., 2022).
- Competitive and Mixed Settings: CTDE, though naturally suited for cooperation, can generalize to symmetric two-team Markov games or competitive situations by strategically conditioning mixing networks or critics; population training and opponent diversity significantly improve robustness in team competition (Leroy et al., 2022, Amato, 4 Sep 2024).
- Fully Decentralized Alternatives: Fully decentralized and consensus-based methods challenge the necessity of centralized training in certain applications, particularly in distributed resource allocation or random-access network optimization, offering similar performance with far lower communication overhead by exchanging only scalar rewards (Oh et al., 9 Aug 2025).
5. Recent Algorithmic Innovations and Hybrid Directions
Recent research pushes CTDE toward more robust and flexible architectures:
- Centralized Advising and Pruning (CADP): Direct message-based cross-attention among agents during training, followed by explicit pruning to collapse the attention mechanism for independent execution (Zhou et al., 2023), mitigates CTDE’s independence assumption and enables richer joint-policy exploration.
- MAGPO: Centralized autoregressive guider policies train via Policy Mirror Descent and are subsequently projected onto decentralized learners, guaranteeing monotonic improvement while explicitly controlling the alignment between centralized and decentralized policies (Li et al., 24 Jul 2025).
- Centralized Permutation Equivariant (CPE) Policies: CPE architectures use lightweight, scalable Global-Local Permutation Equivariant (GLPE) networks to jointly process per-agent data and global context, delivering centralized execution that is scalable and substantially outperforms standard CTDE on both value-decomposition and actor-critic methods (Xu et al., 13 Aug 2025).
- LLM-Powered Decentralized Generative Agents: By moving from CTDE to decentralized frameworks with adaptive hierarchical knowledge graphs and structured communication (DAMCS), new research demonstrates superior scaling and flexibility, particularly in open-world, long-horizon cooperative planning with language-enabled agents (Yang et al., 8 Feb 2025). These approaches avoid centralized long-term planning bottlenecks and fixed cooperation protocols, favoring distributed, memory-rich, language-driven policies.
6. Applications, Empirical Findings, and Research Directions
CTDE methods are integral to applications in swarm robotics, multi-vehicle control, power networks, infrastructure asset management, real-time traffic control, and large-scale cooperative games. Empirical findings repeatedly demonstrate:
- Strong performance advantage for CTDE versus decentralized (DTE) or centralized (CTE) training, especially as problems scale in agent count or complexity (Amato, 4 Sep 2024, Amato, 10 May 2024).
- Best-in-class sample efficiency and robustness in challenging tasks (SMAC, Flatland, GRF).
- Minimal performance loss when distillation or regeneration from centralized training to decentralized execution is appropriately scheduled and architected (Zhao et al., 2022, Chen et al., 2022, Li et al., 2023, Liu et al., 20 Dec 2024).
- The necessity for advanced exploration strategies (optimism-driven, curriculum) to overcome underestimation induced by structural constraints on value decomposition in CTDE (Zhang et al., 5 Feb 2025).
Future research aims at further improving scalability, efficient factorization, automatic selection and distillation of centralized information, enhanced credit assignment, extending CTDE to competitive and mixed environments, and leveraging LLMs for agent policy construction or communication schema (Amato, 4 Sep 2024, Yang et al., 8 Feb 2025, Li et al., 24 Jul 2025). There remains a rich set of questions regarding optimally balancing centralization and decentralization to achieve efficient, generalizable, and robust multi-agent policies.