Centralized Training and Distributed Execution

Updated 25 March 2026

CTDE is a paradigm in cooperative multi-agent reinforcement learning that leverages centralized training with global state information and decentralized execution based on local observations to address nonstationarity and credit assignment challenges.
Key methodologies include value decomposition and mixing architectures such as VDN and QMIX, along with centralized-critic/decentralized-actor models that stabilize joint policy learning.
Practical applications span from StarCraft challenges to EV charging and traffic control, demonstrating CTDE's scalability and robust performance in complex, multi-agent environments.

Centralized Training and Distributed Execution (CTDE) is the principal paradigm in cooperative multi-agent reinforcement learning (MARL) for reconciling the need for coordinated learning under global information with the requirement of decentralized policies in execution, particularly under partial observability and communication constraints. CTDE methods leverage centralized, global state or joint observations during the training phase to solve nonstationarity and credit assignment challenges but ensure that each agent executes a strictly decentralized policy based only on its local information during online deployment. This structure has established CTDE as the foundation of scalable, robust MARL in challenging domains, notably outperforming purely centralized or decentralized alternatives across a spectrum of tasks (Amato, 2024, Amato, 2024).

1. Formal Structure of the CTDE Paradigm

CTDE operates within the DEC-POMDP framework, where $N$ agents interact with an environment evolving in a hidden global state $s_t\in\mathcal S$ . At each step, agent $i$ receives a local observation $o^i_t\sim O_i(\cdot|s_t)$ , forms a local history $\tau^i_t$ , and selects actions $a^i_t\in\mathcal{A}^i$ according to its policy, yielding a global team reward $r_t=r(s_t,\mathbf{a}_t)$ (Zhao et al., 2022, Amato, 2024).

During the centralized training phase, additional information such as the full state, joint actions, or joint histories is made available:

Centralized critics (e.g., $Q_{\text{tot}}(\mathbf{\tau}_t, \mathbf{a}_t; \theta, \phi)$ ), value mixers, or hypernetworks can exploit the global view.
Experience is commonly collected in a joint buffer, facilitating stable, off-policy updates that address nonstationarity.

At decentralized execution, each agent's policy relies exclusively on its own local information:

$a^i_t = \arg\max_{a} Q^i(\tau^i_t, a; \theta^i)$ , or $a^i_t \sim \pi^i(a|\tau^i_t)$ .
No access to global state, joint actions, or other agents' observations is permitted.

This bifurcated information structure is foundational to CTDE and distinguishes it from both fully centralized and fully decentralized MARL approaches (Amato, 2024, Amato, 2024, Leroy et al., 2022).

2. Value Decomposition and Mixing Architectures

The dominant methodological instantiation of CTDE uses value decomposition to ensure that decentralized policies can maximize the team objective by solving per-agent subproblems, under structural constraints that preserve consistency with the centralized perspective.

Core CTDE Value Decomposition Methods

Value Decomposition Networks (VDN):

$Q_{\text{tot}}(\tau, a) = \sum_{i=1}^N Q_i(\tau^i, a^i; \theta_i)$

Simple additivity admits independent maximization by each agent.

QMIX:

Uses a monotonic mixing network:

$Q_{\text{tot}}(\tau, s, a) = f_{\text{mix}}\left(Q_1, ..., Q_N; s\right)$

with $\frac{\partial Q_{\text{tot}}}{\partial Q_i} \geq 0$ , allowing maximization of $Q_{\text{tot}}$ via individual max operations (Zhao et al., 2022, Leroy et al., 2022).

QTRAN, QPLEX:

Extend the function class for $Q_{\text{tot}}$ via more complex architectural decompositions, ensuring the IGM (Individual–Global–Max) property (Leroy et al., 2022, Amato, 2024).

A representative training loss is: $\mathcal{L}_{\text{TD}}(\theta, \phi) = \mathbb{E}_{\text{batch}}\Big[(r_t + \gamma \max_{\mathbf{a}'} Q_{\text{tot}}(\mathbf{\tau}_{t+1}, \mathbf{a}'; \theta^-, \phi^-) - Q_{\text{tot}}(\mathbf{\tau}_t, \mathbf{a}_t; \theta, \phi))^2\Big]$ with delayed target networks as in DQN-style off-policy value backup (Zhao et al., 2022, Amato, 2024).

During execution, only the decentralized component $Q_i(\tau^i_t, a^i)$ is retained per agent.

3. Centralized-Critic and Decentralized-Actor Methods

In continuous action settings and complex partially observable domains, a prevalent CTDE pattern is the centralized-critic, decentralized-actor structure:

Actors: Each agent $i$ learns an individual policy $\mu_i(o_i; \theta_i)$ .
Centralized Critic: A function $Q^\mu(o_1,...,o_N, a_1,...,a_N)$ or $V(\mathbf{o}, \mathbf{a})$ is trained on full joint state-action information and provides value gradients to update the decentralized actors (Shojaeighadikolaei et al., 2023, Shojaeighadikolaei et al., 2024).

This architecture underpins methods such as MADDPG, MAPPO, and LIA_MADDPG, supporting robust coordination and solving the moving nonstationarity of joint policy learning (Saifullah et al., 2024, Lv et al., 2024).

During deployment, only the actor is executed, consuming a local observation; the global critic is discarded.

4. Enhancements and Extensions: Knowledge Distillation, Personalized Information, and Communication

Recent work demonstrates several extensions to the vanilla CTDE framework to address expressiveness, performance retention, or richer training-time coordination.

Centralized Teacher/Decentralized Student (CTDS): A centralized “teacher” with access to the full state computes per-agent Q-values, then a decentralized “student” is trained to mimic these via supervised distillation, enabling local policies to internalize global information (Zhao et al., 2022). CTDS outperforms VDN/QMIX baselines by 15–17 percentage points in win rate on SMAC tasks.
Personalized Training with Distilled Execution (PTDE): Each agent receives personalized global information during training via an Agent-Hyper Network; these features are distilled into a local-only student model for decentralized execution, yielding up to 89% retention of teacher performance in StarCraft and Football benchmarks (Chen et al., 2022).
Centralized Advising and Decentralized Pruning (CADP): Explicit cross-agent advice (attention) channels are allowed during training for enhanced policy exploration but are pruned through a KL penalty to ensure strict decentralization at execution (Zhou et al., 2023). CADP yields up to 70 percentage point win-rate gains over standard QMIX in super-hard SMAC scenarios.
Communication-Aware Attentional Models: Extensions encode both centralized critic and message-passing between agents at training (and, conditionally, at execution), using self-attention to aggregate entity and peer information over communication-constrained graphs (Fan et al., 17 Mar 2026).
Region-Based Semi-Centralized CTDE: In traffic control, a region-based splitting reduces scope of centralization, sharing parameters only among tightly-coupled agent groups, yielding superior scalability and performance (Yazdani et al., 4 Dec 2025).

Empirical ablations consistently show that principled integration of centralized features with careful design of decentralized execution leads to faster, more stable, and significantly stronger policies (Zhao et al., 2022, Chen et al., 2022, Yazdani et al., 4 Dec 2025, Lv et al., 2024).

5. Theoretical Properties, Bias–Variance Trade-offs, and Practical Considerations

CTDE provides strong theoretical and practical advantages:

Nonstationarity Mitigation: Centralized critics see the joint state/action space, rendering each agent's policy change observable and stabilizing value updates (Shojaeighadikolaei et al., 2024).
Credit Assignment: Mixing networks and centralized value functions propagate team rewards to individual agents effectively, solving the credit assignment problem in cooperative tasks (Zhao et al., 2022, Saifullah et al., 2024).
Bias-Variance Analysis: While policy-gradient estimators using centralized critics are unbiased, they often have higher variance due to conditioning on other agents' exploratory actions; this increases training complexity but improves the stability and final solution quality (Shojaeighadikolaei et al., 2024).
Scalability: CTDE scales favorably compared to fully centralized (exponential), provided mixing architectures or attention modules are architecture-aware (Amato, 2024, Amato, 2024).
Privacy and Robustness: Decentralized execution preserves agent privacy in scenarios such as EV charging or infrastructure management while maintaining near-centralized performance (Shojaeighadikolaei et al., 2023, Saifullah et al., 2024).

Table: Representative Gains of CTDE Variants

Scenario	CTDE Method	Performance Impact
SMAC (StarCraft)	CTDS, PTDE	+15–17 pp (CTDS); 73–89% PRR (PTDE)
EV Charging	CTDE-DDPG	–36% total variation, –9% cost
Traffic Control	SEMI-CTDE	Lower wait/travel/queue than baselines
Robot Swarm Allocation	LIA_MADDPG	+10–20% utility, >90% dominance rate
Channel Allocation	CARLTON-CTDE	Within 2–3% of centralized optimum

Abbreviations: pp = percentage points; PRR = performance retention ratio (Zhao et al., 2022, Chen et al., 2022, Shojaeighadikolaei et al., 2023, Yazdani et al., 4 Dec 2025, Lv et al., 2024, Cohen et al., 2024).

6. Limitations, Misconceptions, and Domains of Application

Expressivity Constraints: Additivity or monotonicity imposed by mixing networks (e.g., QMIX) limit the class of joint value functions that can be represented, potentially missing some cooperative behaviors (Marchesini et al., 2021, Leroy et al., 2022).

Partial Observability: Critics/mixers must condition on equivalent information as agents use for decision-making to avoid bias, especially in partially observable settings (Amato, 2024, Amato, 2024).

Communication Assumptions: CTDE does not require communication at execution; any such messaging is only present in some enhanced training paradigms (e.g., CADP attention or communication-aware GNN modules), and pruning/removal assures decentralization in deployed policies (Zhou et al., 2023, Fan et al., 17 Mar 2026).

Scalability: The size and structure of centralized critics present practical limits; scalable variants (local aggregation, regional partitioning) address this for networks of tens to hundreds of agents (Yazdani et al., 4 Dec 2025, Lv et al., 2024).

Application Domains: CTDE has been validated across StarCraft Multi-Agent Challenge, Google Research Football, dynamic channel allocation in wireless networks, EV charging, large-scale infrastructure management, adaptive traffic-signal control, and distributed multi-robot and UAV swarm tasks (Zhao et al., 2022, Shojaeighadikolaei et al., 2023, Saifullah et al., 2024, Lv et al., 2024, Yazdani et al., 4 Dec 2025, Fan et al., 17 Mar 2026, Cohen et al., 2024).

7. Research Directions and Advanced Algorithmic Variants

CTDE has catalyzed architectural innovation:

MAGPO integrates a centralized auto-regressive joint policy (“guider”) with explicit KL projection to decentralized learners, ensuring theoretical monotonic policy improvement and state-of-the-art performance, bridging the gap between CTDE and fully centralized methods (Li et al., 24 Jul 2025).
Epigraph-Form CTDE (Def-MARL): Formulates multi-robot safety-critical CMDPs using an epigraph transformation to enable distributed solution of global constraints under CTDE, provably preserving centralized-optimality (Zhang et al., 21 Apr 2025).
Hybrid Approaches: Semi-centralized, region-based CTDE, communication-aware partial message passing with neural attention, and personalized global feature distillation extend CTDE’s flexibility and empirical reach (Yazdani et al., 4 Dec 2025, Chen et al., 2022, Fan et al., 17 Mar 2026).

Empirical and theoretical advances continue to refine the CTDE toolkit, producing more expressive and robust decentralized multi-agent solutions leveraging the strengths of centralized learning infrastructures.

In summary, CTDE is the cornerstone paradigm for scalable cooperative MARL: it resolves the trade-offs between centralized coordination and decentralized autonomy via staged information architectures, enables expressive and robust value function learning, and provides a generic substrate for state-of-the-art agent design in high-stakes, multi-agent systems (Amato, 2024, Amato, 2024, Zhao et al., 2022, Chen et al., 2022, Li et al., 24 Jul 2025).