Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 97 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 36 tok/s

GPT-5 High 34 tok/s Pro

GPT-4o 91 tok/s

GPT OSS 120B 462 tok/s Pro

Kimi K2 217 tok/s Pro

2000 character limit reached

An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning (2409.03052v1)

Published 4 Sep 2024 in cs.LG and cs.MA

Abstract: Multi-agent reinforcement learning (MARL) has exploded in popularity in recent years. Many approaches have been developed but they can be divided into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and Decentralized training and execution (DTE). CTDE methods are the most common as they can use centralized information during training but execute in a decentralized manner -- using only information available to that agent during execution. CTDE is the only paradigm that requires a separate training phase where any available information (e.g., other agent policies, underlying states) can be used. As a result, they can be more scalable than CTE methods, do not require communication during execution, and can often perform well. CTDE fits most naturally with the cooperative case, but can be potentially applied in competitive or mixed settings depending on what information is assumed to be observed. This text is an introduction to CTDE in cooperative MARL. It is meant to explain the setting, basic concepts, and common methods. It does not cover all work in CTDE MARL as the subarea is quite extensive. I have included work that I believe is important for understanding the main concepts in the subarea and apologize to those that I have omitted.

References (77)

Citations (2)

View on Semantic Scholar

Collections

Summary

The paper introduces CTDE as a framework leveraging centralized training with decentralized execution to improve coordination in MARL.
It details various value decomposition methods like VDN, QMIX, and QPLEX to optimize joint action-value functions under partial observability.
Centralized critic techniques such as MADDPG and MAPPO are examined to enhance policy updates and address the credit assignment challenge.

An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning

The paper provides an extensive overview of different approaches within Centralized Training for Decentralized Execution (CTDE) in cooperative multi-agent reinforcement learning (MARL). The paper effectively delineates the landscape of MARL concerning three primary paradigms: Centralized Training and Execution (CTE), CTDE, and Decentralized Training and Execution (DTE). The focus of this paper is on CTDE, given its prevalence and practical advantages for scalability and performance.

Cooperative Problem: Dec-POMDP

The cooperative MARL problem is formally introduced using the Decentralized Partially Observable Markov Decision Processes (Dec-POMDP) framework. A Dec-POMDP extends the standard POMDP to multi-agent systems operating under partial observability and decentralized local observations. Formally, a Dec-POMDP is characterized by a set of agents, joint states, individual action sets, state transition functions, joint reward functions, individual observation sets, and observation functions. Each agent’s policy is defined over its local action-observation history, and the joint policy involves a combination of these individual policies to maximize the expected cumulative reward under uncertainty.

CTDE Overview

The concept of CTDE allows agents to leverage centralized information during training to learn efficient policies while ensuring execution can be performed based on decentralized information alone. By utilizing a shared training phase where collective experiences inform policy updates, CTDE methods often achieve a balance between performance and scalability.

Value Function Factorization Methods

Value-based CTDE methods are categorized based on the way they factorize the joint value function (Q-function) into individual agent-specific value functions.

Value Decomposition Networks (VDN) assume an additive decomposition of the joint Q-function: $\jQ(\jh, \ja) \approx \sum_{i \in \agentS}^\nrA \Qi(\hi, \ai)$

This simple summation allows each agent to independently optimize its policy based on its individual Q-values while ensuring joint action maximization during training.

QMIX extends the factorization to non-linear monotonic functions: $\jQ(\jh, \ja) \approx f_{mono}(\Qi(\hi, \ai), \ldots, Q_n(\hi, \ai))$

The monotonicity constraint ensures that the global argmax action is composed of the local argmax actions, improving upon the flexibility and representational power over VDN.

QTRAN and QPLEX offer more sophisticated decompositions to address the limitations of VDN and QMIX by introducing auxiliary functions and optimization constraints. QPLEX, notably, guarantees the representation of any individual-global-max (IGM) function through an advantage-based IGM principle.

Centralized Critic Methods

Policy gradient methods using centralized critics form another prominent class within CTDE. These methods typically involve centralized value estimation (critic) during training, aiding in the update of decentralized policies (actors).

Multi-Agent DDPG (MADDPG) applies centralized critics with continuous action spaces and deterministic policies: $J = (1-\gamma) E_{\jh, \ja}\left[ \mu_i(\hi) \nablasub \ja \jQpol(\jh, \ja) \Big|_{\ai=\mu_i(\hi)} \right]$

COMA introduces a counterfactual baseline to refine the credit assignment problem by considering each agent’s contribution while fixing other agents' actions. It’s tailored towards enhancing coordination efficiency in policy gradients but utilizes an incorrect state-based critic structure, which is theoretically suboptimal.

MAPPO extends PPO to multi-agent settings, enforcing trust region-like updates to ensure stable policy learning. It effectively balances exploration with policy update constraints using clipped loss functions: $\mathcal{L}^{MAPPO}_{clip}(\ppi)= \min \left( r_{\ppi,i}\jA, \text{clip}(r_{\ppi,i},1-\epsilon,1+\epsilon)\jA \right)$

Critic Types and Practical Considerations

The choice and implementation of critics greatly impact the empirical performance of MARL algorithms. The paper elucidates potential pitfalls of using state-only critics in partially observable environments, advocating history-based or history-state critics to mitigate bias and maintain theoretical correctness.

Combining Approaches and Future Directions

Contemporary methods also hybridize value factorization with centralized critics. FACMAC incorporates QMIX's value factorization into continuous policy gradient updates, offering more flexible value approximations without monotonic constraints.

Conclusion

The insights provided in the paper address both practical implementations and theoretical implications of CTDE strategies in cooperative MARL. By detailing critical value decomposition methods and centralized critic mechanisms, the discussion sets the foundation for further advancements, including improved implementations and novel hybrid approaches. The need for a globally optimal model-free MARL method for Dec-POMDP remains an open research question poised to push the boundaries of MARL efficiency and scalability.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (1)

Christopher Amato

Tweets

https://twitter.com/cjdamato/status/1833569558029881420