CTDE: Centralized Training & Decentralized Execution

Updated 25 August 2025

CTDE is a paradigm in cooperative MARL that trains agents with global information but executes decisions using local observations.
It mitigates challenges like nonstationarity and credit assignment through methods such as value function factorization and centralized critics.
CTDE enables scalable, robust multi-agent systems with applications in smart grids, transportation, and networked control.

Centralized Training with Decentralized Execution (CTDE) is a foundational paradigm in cooperative multi-agent reinforcement learning (MARL), where agents are jointly trained using additional global or shared information but are later deployed to act independently using only local observations. CTDE addresses the core challenges of nonstationarity, credit assignment, and sample inefficiency that arise when multiple adaptive agents interact, while enabling practical scalability and robustness in real-world multi-agent systems. This paradigm underlies a broad array of value function factorization, centralized critic, and policy distillation methods, and continues to inspire substantial research seeking to bridge the gap between joint policy learning and decentralized autonomy.

1. Core Principles and Formal Structure

CTDE separates the learning and execution phases: during centralized training, each agent has access to global signals (joint state, joint actions, other agents' policies, or communication) that are unavailable or restricted during decentralized execution, where each agent conditions its decisions strictly on individual observations or limited local context (Amato, 4 Sep 2024, Amato, 10 May 2024).

The canonical mathematical structure of CTDE is as follows. For a set of agents $\{1,\dots,N\}$ in a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), the global policy during training factors as

$\pi_\mathrm{joint}(\mathbf{a}|\mathbf{h}) = \prod_{i=1}^N \pi_i(a_i | h_i)$

where $h_i$ is the agent's local (possibly recurrent) history. Centralized critics—or mixing networks for value-based approaches—can leverage access to the joint state $s$ , joint action $\mathbf{a}$ , and joint history $\mathbf{h}$ . During execution, every $\pi_i$ is constrained to rely solely on $h_i$ .

Key advantages of CTDE include:

Improved coordination via improved exploration and more accurate credit assignment (through centralized critics or mixing networks).
Enhanced sample efficiency by exploiting global feedback channels, reducing the impact of nonstationarity prevalent in independent learning schemes.
Scalability and practical deployment, as execution does not require synchronized global communication or central controllers (Amato, 4 Sep 2024, Amato, 10 May 2024).

2. Representative Algorithmic Families

Several influential algorithmic frameworks formalize the CTDE principle. These can be grouped into two principal categories:

A. Value Function Factorization

These methods use a joint action-value function $Q_\mathrm{tot}$ , factored into agent-level contributions, typically satisfying a monotonicity or individual-global-max (IGM) condition for decentralized optimality:

$Q_\mathrm{tot}(\tau, \mathbf{a}) = f(Q_1(\tau_1,a_1),\dots,Q_N(\tau_N,a_N),s)$

with the constraint $\partial Q_\mathrm{tot} / \partial Q_i \geq 0$ for all $i$ (as in QMIX), guaranteeing that the joint greedy action can be assembled from per-agent argmax choices (Amato, 4 Sep 2024, Amato, 10 May 2024).

Notable methods include:

VDN (Value Decomposition Networks): additive factorization $Q_\mathrm{tot}=\sum_i Q_i$ .
QMIX: monotonic mixing network leveraging global state for more expressive, but still factorizable, joint Q-values.
QTRAN, QPLEX, and TVDO: relaxations or nonlinear generalizations of the monotonicity/IGM condition, the last using Tchebycheff-based aggregation for provably tight IGM satisfaction (Hu et al., 2023).

B. Centralized Critic Methods

Actor-critic approaches (MADDPG, MAPPO, COMA) use centralized critics $Q^i(s,\mathbf{a})$ or $V(s)$ during training—evaluating joint state-action pairs to provide low-variance policy gradients—while directly learning decentralized actors $\pi_i(a_i|o_i)$ . For example, in MADDPG, the policy gradient for agent $i$ is:

$\nabla_{\theta_i} J(\theta_i) = \mathbb{E}\left[ \nabla_{\theta_i} \mu_i(a_i|o_i) \nabla_{a_i} Q_i(o_1,\dots,o_N, a_1,\dots,a_N) \right]$

Only the critics require joint information during training; actors are fully decentralized at execution. (Shojaeighadikolaei et al., 18 Apr 2024, Amato, 4 Sep 2024).

A table summarizing these approaches:

Approach	Centralized Signal Used	Decentralization at Execution
VDN/QMIX	Mixing network, global state	Fully decentralized
MADDPG/COMA/MAPPO	Centralized critic(s)	Fully decentralized
QPLEX/TVDO	Duplex/flexibly mixed Q	Fully decentralized

3. Techniques for Bridging Global and Local Policies

CTDE research has developed specialized techniques to ensure that coordination and information learned using global context can be faithfully transferred to decentralized policies:

Policy Distillation: Centralized policies are distilled into decentralized policies by minimizing the divergence between full-state policies and local policy networks, as in CTEDD (Chen, 2019).
Centralized Teacher / Decentralized Student: A centralized teacher learns with global observations; decentralized students are optimized by matching their Q-values to the teacher's outputs via knowledge distillation, as in CTDS (Zhao et al., 2022).
Personalization of Global Information: Personalizing global state embeddings for each agent before distillation, as in PTDE (Chen et al., 2022), addresses the issue of irrelevant or redundant global features for locally specialized agents.
Centralized Advice & Self-Pruning: CADP (Zhou et al., 2023) employs explicit agent-to-agent advice exchange (via cross-attention) during training, then prunes communication so as to produce truly decentralized executing policies.

These techniques are mathematically characterized by dual-loss schemes, where the standard TD or policy loss is augmented by a knowledge distillation, imitation, or attention-based auxiliary loss.

4. Theoretical Guarantees and Consistency Conditions

A major challenge in CTDE is ensuring that the decentralized execution of learned policies can recover the optimal joint policy. The key consistency requirement is the Individual-Global-Max (IGM) condition, formalized as:

$\operatorname{argmax}_{\mathbf{a}} Q_{\mathrm{tot}}(\cdot,\mathbf{a}) = (\operatorname{argmax}_{a_1} Q_1(\cdot, a_1),\dots, \operatorname{argmax}_{a_N} Q_N(\cdot, a_N))$

TVDO (Hu et al., 2023) provides a nonlinear Tchebycheff aggregation guaranteeing both necessity and sufficiency for IGM without artificial affine or monotonicity constraints. Theoretical analyses further show that under certain factorization choices and network architectures, the learned decentralized policies can provably reproduce the optimal joint policy.

For centralized critics, convergence and variance properties of policy gradient estimators have been rigorously analyzed, with results indicating equivalence in expected gradient direction between centralized and independent critics when convergence is reached, but generally higher variance in centralized settings (Shojaeighadikolaei et al., 18 Apr 2024).

5. Scalability, Partial Observability, and Practical Limitations

CTDE techniques are motivated by their ability to mitigate the "curse of dimensionality" encountered in fully centralized methods:

Scalability: By factorizing value functions or critics and limiting input dimensionality (for example, using local neighborhoods in a network-aware critic (Xu et al., 2023)), CTDE scales to tasks with dozens or even hundreds of agents.
Partial Observability and Robustness: Approaches such as generative inference (Corder et al., 2019) and intrinsic reward shaping (Zhang et al., 26 Jun 2024) extend CTDE to environments with limited or unreliable local observations and sparse rewards, by equipping agents with models for inferring other agents’ states or for aligning action tendencies.
Communication Constraints: CTDE does not require online communication for execution, but communication during training or via compact message passing can be leveraged for credit assignment or coordination in more complex settings (Egorov et al., 2022, Xu et al., 2023).

Limitations persist. Performance gaps between teacher and student exist in extreme partial observability (Zhao et al., 2022). Large-scale centralized critics or mixing networks can still impose significant computational or bandwidth burdens during training. There remains a risk of overfitting to global features that are absent at execution time, and research has shown that naively using identical global information for all agents can even impair learning (Chen et al., 2022). Curriculum learning, smart initialization of policy structures, and progressive distillation of central information are areas of ongoing research (Liu et al., 20 Dec 2024).

6. Extensions, Applications, and Future Directions

CTDE has been adapted and extended for a broad range of domains:

Infrastructure Management: DDMAC-CTDE enables multi-agent scalable decision-making in large, partially observable transportation infrastructure environments, outperforming traditional heuristics and rule-based optimization (Saifullah et al., 23 Jan 2024).
Smart Grids and Electric Vehicle Charging: CTDE-DDPG frameworks yield improved cost efficiency, fairness, and demand-side management in large-scale, privacy-sensitive power grid scenarios. Notably, the local actor networks are deployable without communication at runtime (Shojaeighadikolaei et al., 18 Apr 2024).
Networked Control: Network-aware truncation and localized aggregation of information in critics allow CTDE to scale to energy grid regulation tasks with hundreds of agents (Xu et al., 2023).

Recent innovations include model-based extensions using imaginary rollouts to improve sample efficiency (as in MAMBA (Egorov et al., 2022)), advanced communication protocols for latent state sharing, and structured memory/knowledge graph systems for scalable, fully decentralized generative agents in open-world environments, which critique CTDE’s rigidity and centralization (Yang et al., 8 Feb 2025).

Open research directions noted across the literature include:

Adaptive selection of centralized information during training.
Robustness of value factorization in highly partially observable or non-stationary settings.
The theoretical limits of policy distillation and teacher-student mismatch.
Unification with consensus-based, fully decentralized, and permutation-equivariant alternatives (Xu et al., 13 Aug 2025), as well as hybrid approaches (MAGPO (Li et al., 24 Jul 2025)) that seek to systematically transfer centralized advances to decentralized execution.

7. Summary and Outlook

Centralized Training with Decentralized Execution is established as a central paradigm for enabling high-performance, scalable coordination in cooperative MARL. The research landscape demonstrates rapid algorithmic evolution—from factorized Q-learning, actor-critic models, and policy distillation, to advanced intrinsic reward shaping, tacit learning frameworks, and LLM-powered decentralized generative agents—each seeking to overcome the limitations of partial observability, scalability, and rigid centralization. The continued integration of theoretical guarantees (such as IGM satisfaction and policy improvement monotonicity) with algorithmic innovation positions CTDE as both the practical and conceptual fulcrum for advancing distributed intelligence in multi-agent systems (Amato, 4 Sep 2024, Amato, 10 May 2024, Zhao et al., 2022, Chen et al., 2022, Zhou et al., 2023, Hu et al., 2023, Xu et al., 2023, Saifullah et al., 23 Jan 2024, Shojaeighadikolaei et al., 18 Apr 2024, Chen, 2019, Corder et al., 2019, Egorov et al., 2022, Zhang et al., 26 Jun 2024, Liu et al., 20 Dec 2024, Yang et al., 8 Feb 2025, Bozkus et al., 7 Mar 2025, Li et al., 24 Jul 2025, Xu et al., 13 Aug 2025).