CTDE: Centralized Training for Decentralized Execution

Updated 28 December 2025

CTDE is a paradigm that uses centralized training with global information and decentralized execution with local observations to improve multi-agent coordination.
Techniques like value decomposition (e.g., QMIX, VDN) and actor-critic methods leverage centralized critics to address credit assignment and ensure policy consistency.
Recent extensions such as CADP, intrinsic rewards, and teacher-student frameworks enhance CTDE's scalability, stability, and practical performance in complex control tasks.

Centralized Training for Decentralized Execution (CTDE) is a dominant architectural paradigm in cooperative multi-agent reinforcement learning (MARL). It separates the learning and deployment phases: full global information (e.g., global state, joint actions, or rewards) is exploited during training to facilitate coordination, credit assignment, and stability, while decentralized execution mandates that each agent acts solely on its own local observations or action-observation history. CTDE is formalized in the framework of Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs), and has driven algorithmic innovation across value-based, actor-critic, intrinsic motivation, and credit-assignment approaches. This article provides a comprehensive technical overview of CTDE, recent algorithmic enhancements, its formal properties, representative empirical results, and current theoretical and practical challenges.

1. Formal Framework and Core Principles

A cooperative CTDE system is defined on a Dec-POMDP

$\langle \mathcal{A}, \mathcal{S}, \mathcal{U}, P, r, \Omega, O, \gamma \rangle$

where $\mathcal{A} = \{1, \dots, N\}$ denotes the $N$ agents, $s \in \mathcal{S}$ is the joint state, $a = (a_1, \dots, a_N)$ is the joint action, $o_i \sim O(s, i)$ is the local observation for agent $i$ , and $r(s, a)$ is a shared team reward.

Training phase: Agents (and a possibly centralized critic) can access global information—full state, joint actions, joint observations, etc.
Execution phase: Policies $\pi_i(a_i|o_i)$ must depend only on the local observation $o_i$ (or history $\tau_i$ ). No centralized coordination or inter-agent communication is present.

The primary architectural distinction is between value-based methods with function factorization (e.g., VDN, QMIX, QTRAN, QPLEX) and actor-critic methods with centralized critics (e.g., COMA, MADDPG, MAPPO), but both employ centralized information in training and enforce decentralized constraints in execution (Amato, 4 Sep 2024, Amato, 10 May 2024).

2. Value-Decomposition and Policy Consistency

Central to CTDE is the challenge of mapping centralized value functions to decentralized greedy execution. The key principle is the Individual-Global-Max (IGM) condition: the independent maximization of local utility functions yields the same joint action as maximizing the centralized joint utility,

$\underset{\mathbf{a}}{\arg\max}\; Q_{\rm tot}(\tau, \mathbf{a}) = \Bigl(\underset{a_1}{\arg\max}\; Q_1(\tau_1, a_1), \dots, \underset{a_N}{\arg\max}\; Q_N(\tau_N, a_N)\Bigr).$

Value Decomposition Networks (VDN):

$Q_{\rm tot}(\bm{\tau}, \bm{a}) = \sum_{i=1}^N Q_i(\tau_i, a_i)$

QMIX:

$Q_{\rm tot} = f_{\rm mix}(Q_1, \ldots, Q_N; s)$

subject to monotonicity $\frac{\partial Q_{\rm tot}}{\partial Q_i} \geq 0$ for all $i$ .

These architectures guarantee the IGM only under sufficient but not necessary constraints, restricting the class of representable joint Q-functions and therefore limiting joint policy expressiveness (Hu et al., 2023, Marchesini et al., 2021). The Tchebycheff Value-Decomposition Optimization (TVDO) framework resolves this by enforcing a max-norm bias regularizer, guaranteeing both necessity and sufficiency for IGM without affine or monotonicity restrictions (Hu et al., 2023). Nevertheless, many architectures risk policy inconsistency, wherein decentralized execution may be suboptimal even if the joint value is maximized centrally (Hu et al., 2023, Marchesini et al., 2021).

3. Advanced CTDE Extensions and Remedies

Recent research highlights limitations in standard CTDE—most prominently, the independence assumption, which impairs coordinated exploration and restricts the effective use of global information during training (Zhou et al., 2023).

Centralized Advising and Decentralized Pruning (CADP):

CADP relaxes the independence assumption by providing an explicit, train-time attention-based communication channel. Each agent encodes $o_i$ to $(q_i, k_i, v_i)$ via linear layers, exchanges $k_j, v_j$ , and computes cross-attention to form team-intention $z_i$ , producing Q-values via

$Q_i = \mathrm{MLP}([h_i, f(z_i)])$

Where $h_i$ is a GRU state. A KL-divergence-based pruning loss $\mathcal{L}_p$ ensures that, after a warm-up, the attention weights are driven to the one-hot vector, fully suppressing inter-agent advice for pure decentralization at execution (Zhou et al., 2023). CADP yields major empirical gains in SMAC and GRF benchmarks, improves coordinated exploration, and strictly subsumes standard CTDE expressiveness.

Intrinsic Rewards and Training Factorization:

The RA-CTDE framework (Zhang et al., 26 Jun 2024) factorizes the global TD objective into per-agent losses, formally establishing gradient equivalence between CTDE and RA-CTDE. This provides a natural interface for integrating agent-specific intrinsic rewards. Action-model-based intrinsic rewards encourage agents to match their realized action distributions to neighbors' predictions, substantially accelerating joint policy consensus and improving sample efficiency under sparse extrinsic rewards.

Centralized Teacher with Decentralized Student (CTDS):

CTDS introduces a teacher model with access to global state during training to estimate ideal Q-values, and a student model trained to mimic these values using only local information. This architecture directly increases the informativeness of local Q-value targets, boosting credit assignment and robustness to partial observability, while discarding the teacher at execution (Zhao et al., 2022).

Optimistic $\epsilon$ -Greedy Exploration:

Conventional CTDE-based value decomposition underestimates optimal actions if the exploration policy under-samples joint optima. Optimistic $\epsilon$ -greedy replaces uniform random exploration with a softmax over optimistic upper bounds learned per agent, ensuring higher sampling probability for optimal actions, better coverage of the value landscape, and improved avoidance of suboptimal equilibria (Zhang et al., 5 Feb 2025).

4. Credit Assignment and Factorization Challenges

A persistent bottleneck in CTDE is multi-agent credit assignment—determining each agent's true contribution to global reward. The Shapley Counterfactual Credits framework (Li et al., 2021) addresses this by computing the marginal contribution of each agent via a centralized critic $Q_{\rm tot}$ , using Shapley value theory. Approximation is achieved via Monte Carlo sampling over agent coalitions, dramatically improving credit assignment fidelity and yielding state-of-the-art performance on challenging coordination tasks.

Further, the GDQ architecture (Marchesini et al., 2021) centralizes only the state-value portion of dueling networks, allowing joint-state advantages without imposing restrictive additive or monotonic constraints, yielding performance gains in harder navigation environments.

5. Actor-Critic Methods and Model-Based CTDE

CTDE can be instantiated in actor-critic frameworks, exemplified by MADDPG and MAPPO (Amato, 4 Sep 2024). These architectures

Maintain decentralized actors
Employ a centralized critic parameterized by the global state and joint actions during training
At execution, actors act independently using only local information

This enables policy-gradient methods to utilize global credit signals for more stable, convergent training, while supporting continuous action spaces and real-world control scenarios (e.g., EV charging (Shojaeighadikolaei et al., 18 Apr 2024), large-scale voltage control (Xu et al., 2023)).

Model-based extensions—such as MAMBA (Egorov et al., 2022)—further CTDE advantages by learning centralized world models and performing imaginary rollouts for planning. These architectures exploit communication during training, but decentralized execution operates with strictly localized, lightweight messages, scaling efficiently to large agent populations and yielding significant reductions in required environment interactions.

6. Scalability, Practical Applications, and Limitations

CTDE is proven to scale to high-dimensional, resource-constrained cooperative control problems:

Infrastructure management: DDMAC-CTDE manages stochastic life-cycle decisions for transportation networks, achieving up to 31% cost reductions versus traditional baselines (Saifullah et al., 23 Jan 2024).
Energy systems: CTDE-based actor-critic and truncated network-aware critics demonstrably improve stability, scalability, and fairness in both distributed voltage control (Xu et al., 2023) and EV charging networks (Shojaeighadikolaei et al., 18 Apr 2024).
Multi-robot and tactical control: Centralized state-value injection (GDQ) outperforms strict value factorization on navigation and micromanagement benchmarks (Marchesini et al., 2021).

Despite its efficiency and empirical dominance, CTDE faces critical bottlenecks:

Centralized critics or mixing networks scale poorly: joint-action spaces and critic input dimension grow exponentially with agent count, although network-aware truncation mitigates this in sparse settings (Xu et al., 2023).
Monotonic or additive value-decomposition limits the richness of learned joint behaviors, especially in environments requiring nonmonotonic or highly coupled policies (Hu et al., 2023, Marchesini et al., 2021).
Coordination under partial observability and credit assignment in high-dimensional state-action spaces remain ongoing theoretical challenges.

Recent works such as MAGPO (Li et al., 24 Jul 2025) have introduced more scalable coordination via autoregressive joint policies and explicit alignment between centralized “guider” and decentralized “learner” policies, providing monotonic improvement guarantees and strong empirical results across diverse cooperation benchmarks.

7. Emerging Alternatives and Future Directions

Although CTDE remains dominant, its limitations in scalability, reward engineering, and adaptability are highlighted by emergent paradigms:

Fully decentralized agents with structured long-term memory and schema-based communication (e.g., DAMCS (Yang et al., 8 Feb 2025)) show that for open-world, multi-modal, or highly dynamic settings, hierarchical knowledge graphs and adaptive communication can replace centralized critics, enabling effective cooperation without top-down value guidance.
Hybrid approaches combining lightweight decentralized value estimates with in-context reasoning or memory compression are proposed for environments where centralized critics become intractable.
Intrinsic and consensus-based rewards: The injection of team-consensus signals or action-tendency matching has proven effective both in sample efficiency and in overcoming sparse extrinsic reward models (Zhang et al., 26 Jun 2024).

Open questions for the CTDE community include:

The automatic discovery of optimal factoring or graph structure for critics to improve scalability
Design of mixing networks that admit necessary and sufficient policy-consistency guarantees without restricting representational power
Generalization of CTDE to mixed-motive or competitive settings without sacrificing decentralized feasibility
Bridging the divide between CTDE and communication-based MARL to accommodate scenarios where limited or adaptive messaging is viable at execution

CTDE continues to be the reference paradigm for cooperative MARL, supporting both theoretical tractability and robust practical deployment. Algorithmic innovations that mitigate its limitations while preserving its core strengths remain at the forefront of the field (Amato, 4 Sep 2024, Li et al., 24 Jul 2025, Yang et al., 8 Feb 2025, Zhou et al., 2023, Hu et al., 2023, Zhang et al., 26 Jun 2024).