CTDE: Centralized Training & Distributed Execution

Updated 10 January 2026

CTDE is a framework that employs centralized training using global state and joint actions while ensuring decentralized execution for scalable multi-agent coordination.
It leverages methods such as value function factorization and centralized-critic actor-critic to optimize agent policies and enhance credit assignment during training.
Empirical benchmarks demonstrate CTDE's effectiveness in domains like team games, robotics swarms, and distributed energy systems, highlighting its real-world applicability.

Centralized Training with Distributed Execution (CTDE) is a dominant paradigm in cooperative multi-agent reinforcement learning (MARL), enabling agents to exploit global information during training while constraining policies to rely only on decentralized local data at deployment. CTDE has catalyzed advances across value function factorization, centralized-critic actor-critic, safe multi-agent control, heterogeneous information personalization, and robust coordination; it is foundational to state-of-the-art methods for domains ranging from complex team games to robot swarms, infrastructure management, and distributed energy systems (Zhang et al., 2024, Shojaeighadikolaei et al., 2023, Shojaeighadikolaei et al., 2024, Zhao et al., 2022, Zhou et al., 2023, Marchesini et al., 2021, Park et al., 2022, Zhang et al., 21 Apr 2025, Lv et al., 2024, Chen et al., 2022, Egorov et al., 2022, Cohen et al., 2024, Saifullah et al., 2024, Leroy et al., 2022, Amato, 2024, Amato, 2024, Li et al., 24 Jul 2025). The following sections dissect CTDE's mathematical underpinnings, canonical algorithmic instantiations, theoretical properties, recent methodological innovations, practical deployment, and empirical benchmarks.

1. Formal Framework and Mathematical Definition

CTDE operates in cooperative Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs), characterized by agent set $\mathcal{N}$ , global state $s \in \mathcal{S}$ , per-agent observations $o_i \in \mathcal{O}_i$ , joint action $\bm a = (a_1,\dots,a_N) \in \mathcal{A}$ , state transitions $\mathcal{P}(s'|s,\bm a)$ , and a shared team reward $r^{ext} = \mathcal{R}(s,\bm a)$ . Agents maintain local policies $\pi_i(a_i \mid \tau_i)$ , where $\tau_i$ is the action-observation history. Training is centralized: algorithms leverage global state, joint actions, and possibly other agents' trajectories, typically through a mixing network $\mathcal{F}(Q_1,\dots,Q_N,s;\phi)$ or a centralized critic $Q^\omega(s,\bm a)$ ; execution is decentralized, with each agent acting solely on local information (Zhang et al., 2024, Amato, 2024, Amato, 2024).

Canonical CTDE value factorization is exemplified by QMIX: $Q_{tot}(\bm\tau,\bm a;\theta,\phi) = \mathcal{F}(Q_1(\tau_1,a_1),\dots,Q_N(\tau_N,a_N),s;\phi)$ with the global TD-loss: $\mathcal{L}^G(\theta,\phi) = \mathbb{E}_{\mathcal{D}}\left[\left(r^{ext} + \gamma \max_{\bm a'} Q_T(\bm\tau',\bm a') - \mathcal{F}(Q_1,\dots,Q_N,s;\phi)\right)^2\right]$ where $Q_T$ is a target network copy (Zhang et al., 2024). The Individual–Global–Max (IGM) principle constrains the joint greediness to be consistent with agent-wise maximization (Marchesini et al., 2021, Leroy et al., 2022). Variants incorporate monotonic mixing, dueling decompositions, and attention-augmented cross-agent value propagation.

2. Principal Algorithmic Classes and Their Formulation

CTDE spans multiple algorithmic families:

Value Function Factorization: VDN, QMIX, QPLEX, QTRAN, GDQ; each learns per-agent action-value networks $Q_i(\tau_i,a_i)$ , which are composed into a joint $Q_{tot}$ through mixing parameterizations, typically enforcing IGM monotonicity constraints (Amato, 2024, Marchesini et al., 2021, Leroy et al., 2022, Amato, 2024).
Centralized-Critic Actor-Critic: MADDPG, COMA, MAPPO, FACMAC, DDMAC-CTDE. Shared critic $Q^\omega(s,\bm a)$ or $V(s,\bm a)$ conditions on global state and joint actions, whereas actors/policies $\pi_i(a_i \mid \tau_i)$ are updated with gradients backpropagated through the centralized value estimate (Shojaeighadikolaei et al., 2023, Shojaeighadikolaei et al., 2024, Saifullah et al., 2024, Amato, 2024).
Hybrid, Model-Based, and Safety-Constrained Extensions: MAMBA leverages shared recurrent state-space models for imaginary rollouts and world-model-based policy updates (Egorov et al., 2022), while Def-MARL applies epigraph-form constrained optimization to achieve zero violation safe multi-robot coordination with distributed execution (Zhang et al., 21 Apr 2025).

A typical training/execution loop is:

for each episode:
    collect joint transitions (s, {o_i}, {a_i}, r, s')
    for each agent i:
        update local Q_i or policy π_i via gradients from centralized Q_{tot} or Q^\omega
    update mixing network or centralized critic

for each agent i:
    observe o_i, compute action a_i = argmax_a Q_i(o_i, a) or sample from π_i(o_i)

3. Intrinsic Credit Assignment, Policy Consistency, and Recent Extensions

The granularity of credit assignment and policy consistency under CTDE has received focused study:

Reward-Additive CTDE and Intrinsic Rewards: The RA-CTDE formulation decomposes the global TD-loss into $N$ per-agent TD-losses, each absorbable to personalized intrinsic rewards; gradient equivalence ensures robustness for training. Intrinsic Action Tendency Consistency introduces action models so agents can predict and align with neighbors' tendencies, accelerating consensus and mitigating sample inefficiency caused by divergence in decentralized policies (Zhang et al., 2024).
Centralized Advising and Decentralized Pruning (CADP): Cross-attention networks permit rich inter-agent advising, followed by a KL-pruning regularizer that anneals communication back to strict decentralization, enabling efficient centralized exploration and strict independence at execution (Zhou et al., 2023).
Teacher-Student Distillation (CTDS, PTDE): Centralized "teacher" networks leverage global state for optimal value decomposition, while "student" networks distill this guidance into decentralized local Qs or actors. PTDE customizes global information per agent and employs a two-stage offline distillation for practical decentralized policies with minimal performance degradation, outperforming unified global personalization (Zhao et al., 2022, Chen et al., 2022).

4. Theoretical Properties: Gradient Equivalence, Variance-Bias Trade-offs, and Monotonic Improvement

CTDE's theoretical guarantees address gradient consistency, exploration variance, and efficiency:

Gradient Equivalence: Under distinct parameters, gradients of global CTDE loss $\mathcal L^G$ match those of the per-agent RA-CTDE losses $\mathcal L^E_i$ ; intrinsic reward integration does not disturb convergence (Zhang et al., 2024).
Variance and Bias in Actor-Critic Methods: Centralized critics result in unbiased policy gradients matching independent critic schemes in expectation but with strictly higher variance due to the inclusion of other agents' action randomness; however, centralized critics stabilize coordination and mitigate nonstationarity (Shojaeighadikolaei et al., 2024).
Monotonic Policy Improvement: MAGPO leverages a centralized auto-regressive guiding policy and a decentralized learner policy regularized through KL constraints, with provable monotonic improvements per iteration under exact projections, attaining guarantees absent in prior CTDE approaches (Li et al., 24 Jul 2025).

5. Practical Implementation and Scalability: Information, Communication, Safety, and Parameterization

CTDE methods vary in the degree and handling of centralized information and communication during training versus deployment:

Centralized Training Data Requirements: Algorithms require full joint transition data (states, actions, observations, rewards) to optimize mixing networks or centralized critics. In realistic settings, this demands synchronous simulation or secure data aggregation infrastructures (Shojaeighadikolaei et al., 2023, Amato, 2024).
Decentralized Execution: After training, policies and Q-functions are deployed locally, and agents act independently with access only to their local observations/histories, sometimes augmented by purely local communication or personalized distilled information (Zhang et al., 2024, Zhao et al., 2022, Chen et al., 2022, Lv et al., 2024).
Sample Efficiency and Scalability: Model-based CTDE (MAMBA) supports scalable imaginary rollouts, reducing environment interaction by an order of magnitude in multi-agent domains (Egorov et al., 2022). Distributed optimal control (Def-MARL) provides scalable safe control via decentralized 1D epigraph optimization, attaining zero-violation under hard constraints with stable learning (Zhang et al., 21 Apr 2025).
Safety and Constraints: CTDE has been extended to safe CMDPs by leveraging distributed epigraph reformulations and decentralized constraint solving, yielding high safety rates and near-optimal global costs in robotics and infrastructure management (Zhang et al., 21 Apr 2025, Saifullah et al., 2024).

6. Empirical Benchmarks and Applications

CTDE's empirical performance has been validated on diverse domains:

Multi-Agent Coordination Benchmarks: StarCraft Multi-Agent Challenge (SMAC), Google Research Football (GRF), Level-Based Foraging, Multi-Agent Particle Environments (MPE) (Zhang et al., 2024, Zhou et al., 2023, Chen et al., 2022, Li et al., 24 Jul 2025).
Cyber-Physical Networks: Distributed EV charging—CTDE-DDPG lowers total demand variation and cost compared to fully decentralized baselines, preserves privacy, and scales with agent count (Shojaeighadikolaei et al., 2023, Shojaeighadikolaei et al., 2024).
Robotics and Swarms: LIA_MADDPG achieves robust and scalable dynamic task allocation in robot swarms via centralized learning with local aggregation modules, significantly outperforming non-MARL and traditional MARL baselines (Lv et al., 2024).
Infrastructure and Safe Control: DDMAC-CTDE delivers strict constraint satisfaction and up to 31% cost reduction on large-scale transportation infrastructure compared to optimized condition-based and rule-based policies (Saifullah et al., 2024).
Competitive/Mixed Team Games: Population-based CTDE training in symmetric Markov games fosters robust teams facing dynamic opponent strategies, with value-factorized mixers (QMIX, QVMix) maintaining superior generalization over skill-augmented explorers (MAVEN) (Leroy et al., 2022).

7. Future Directions and Open Challenges

CTDE remains an active field with ongoing innovations:

Addressing the independence assumption and discovering optimal forms of centralized information for coordinated exploration and robust credit assignment (Zhou et al., 2023, Chen et al., 2022).
Enhancing theoretical guarantees for monotonic improvement and global optimality under partial observability and decentralized information limits (Li et al., 24 Jul 2025).
Scaling to extremely large agent populations and heterogeneously equipped teams, including communication-efficient protocols and hierarchical extensions (Egorov et al., 2022, Lv et al., 2024).
Extending safety-constrained frameworks to adversarial, non-cooperative, or competitive settings, integrating distributed constraint-solving with robust MARL (Zhang et al., 21 Apr 2025, Amato, 2024).

CTDE provides a mathematically principled compromise between global sample-efficient learning and fully distributed execution, foundational for modern cooperative MARL algorithms across domains. Its recent methodological advances—RA-CTDE, teacher-student distillation, centralized advising, model-based extensions, and monotonic policy improvement—address core limitations of earlier approaches and continue to shape the field’s evolution.