Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 462 tok/s Pro
Kimi K2 217 tok/s Pro
2000 character limit reached

An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning (2409.03052v1)

Published 4 Sep 2024 in cs.LG and cs.MA

Abstract: Multi-agent reinforcement learning (MARL) has exploded in popularity in recent years. Many approaches have been developed but they can be divided into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and Decentralized training and execution (DTE). CTDE methods are the most common as they can use centralized information during training but execute in a decentralized manner -- using only information available to that agent during execution. CTDE is the only paradigm that requires a separate training phase where any available information (e.g., other agent policies, underlying states) can be used. As a result, they can be more scalable than CTE methods, do not require communication during execution, and can often perform well. CTDE fits most naturally with the cooperative case, but can be potentially applied in competitive or mixed settings depending on what information is assumed to be observed. This text is an introduction to CTDE in cooperative MARL. It is meant to explain the setting, basic concepts, and common methods. It does not cover all work in CTDE MARL as the subarea is quite extensive. I have included work that I believe is important for understanding the main concepts in the subarea and apologize to those that I have omitted.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com.
  2. C. Amato. An introduction to decentralized training and execution in cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2405.06161, 2024.
  3. A. Baisero and C. Amato. Unbiased asymmetric reinforcement learning under partial observability. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2022.
  4. Sample bounded distributed reinforcement learning for decentralized POMDPs. In Proceedings of the National Conference on Artificial Intelligence, 2012.
  5. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
  6. C. Boutilier. Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge, 1996.
  7. A comprehensive survey of multi-agent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 38(2):156–172, Mar. 2008.
  8. C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the National Conference on Artificial Intelligence, pages 746–752, 1998.
  9. Deep transformer Q-networks for partially observable reinforcement learning. arXiv preprint arXiv:2206.01078, 2022.
  10. Learning to communicate with deep multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 29, 2016.
  11. Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 1146–1155, 2017.
  12. Learning with opponent-learning awareness. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2018a.
  13. Counterfactual multi-agent policy gradients. In Proceedings of the National Conference on Artificial Intelligence, 2018b.
  14. C. V. Goldman and S. Zilberstein. Decentralized control of cooperative systems: Categorization and complexity analysis. Journal of AI Research, 22:143–174, 2004.
  15. Multiagent Planning with Factored MDPs. In Advances in Neural Information Processing Systems, 2001.
  16. Cooperative multi-agent control using deep reinforcement learning. In Adaptive and Learning Agents Workshop at AAMAS, 2017.
  17. Hypernetworks. In Proceedings of the International Conference on Learning Representations, 2017.
  18. M. Hausknecht and P. Stone. Deep recurrent Q-learning for partially observable MDPs. arXiv preprint arXiv:1507.06527, 2015.
  19. Y.-C. Ho. Team decision theory and information structures. Proceedings of the IEEE, 68(6):644–654, 1980.
  20. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  21. S. Iqbal and F. Sha. Actor-attention-critic for multi-agent reinforcement learning. In International conference on machine learning, 2019.
  22. J. Jiang and Z. Lu. I2Q: A fully decentralized q-learning algorithm. In Advances in Neural Information Processing Systems, pages 20469–20481, 2022.
  23. J. Jiang and Z. Lu. Best possible Q-learning. arXiv preprint arXiv:2302.01188, 2023.
  24. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, 1998.
  25. J. R. Kok and N. Vlassis. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research, 7:1789–1828, 2006.
  26. L. Kraemer and B. Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82–94, 2016.
  27. Trust region policy optimisation in multi-agent reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2022.
  28. M. Lauer and M. A. Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In Proceedings of the International Conference on Machine Learning, 2000.
  29. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  30. From explicit communication to tacit cooperation: A novel paradigm for cooperative marl. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2023.
  31. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2016.
  32. Decentralized multi-agents by imitation of a centralized controller. In Mathematical and Scientific Machine Learning, pages 619–651, 2022.
  33. L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
  34. On the complexity of solving Markov decision problems. In Proceedings of Uncertainty in Artificial Intelligence, 1995.
  35. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, 2017.
  36. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2021.
  37. A deeper understanding of state-based critics in multi-agent reinforcement learning. In Proceedings of the National Conference on Artificial Intelligence, 2022.
  38. On centralized critics in multi-agent reinforcement learning. Journal of AI Research, 77:235–294, 2023.
  39. On centralized critics in multi-agent reinforcement learning (updated version). arXiv preprint arXiv: 2408.14597, 2024.
  40. On stateful value factorization in multi-agent reinforcement learning. arXiv preprint arXiv: 2408.15381, 2024.
  41. J. Marschak. Elements for a theory of teams. Management Science, 1:127–137, 1955.
  42. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  43. K. P. Murphy. A survey of POMDP solution techniques. Technical report, University of British Columbia, 2000.
  44. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 705–711, 2003.
  45. When do transformers shine in RL? decoupling memory from credit assignment. In Advances in Neural Information Processing Systems, 2023.
  46. C. Nota and P. S. Thomas. Is the policy gradient a gradient? In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2020.
  47. F. A. Oliehoek and C. Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.
  48. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the International Conference on Machine Learning, 2017.
  49. The complexity of Markov decision processes. Mathematics of operations research, 12(3):441–450, 1987.
  50. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 2021.
  51. Learning to cooperate via policy search. In Proceedings of Uncertainty in Artificial Intelligence, 2000.
  52. Eligibility traces for off-policy policy evaluation. In Proceedings of the International Conference on Machine Learning, 2000.
  53. M. L. Puterman. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
  54. R. Radner. Team decision problems. Annals of Mathematical Statistics, 33:857–881, 1962.
  55. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018.
  56. Weighted QMIX: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, 2020a.
  57. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21(1):7234–7284, 2020b.
  58. The StarCraft Multi-Agent Challenge. arXiv preprint arXiv:1902.04043, 2019.
  59. Distributed value functions. In Proceedings of the International Conference on Machine Learning, 1999.
  60. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, 2015.
  61. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  62. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, 2014.
  63. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2019.
  64. MA2QL: A minimalist approach to fully decentralized multi-agent reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2024.
  65. Value-decomposition networks for cooperative multi-agent learning. arXiv:1706.05296, 2017.
  66. Reinforcement Learning: An Introduction (second edition). The MIT Press, 2018.
  67. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4), 2017.
  68. Revisiting parameter sharing in multi-agent deep reinforcement learning. arXiv preprint arXiv:2005.13625, 2020.
  69. QPLEX: Duplex dueling multi-agent Q-learning. In Proceedings of the International Conference on Learning Representations, 2021a.
  70. Learning nearly decomposable value functions via communication minimization. In Proceedings of the International Conference on Learning Representations, 2020.
  71. DOP: Off-policy multi-agent decomposed policy gradients. In Proceedings of the International Conference on Learning Representations, 2021b.
  72. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
  73. Q-learning. Machine Learning, 8(3):279–292, May 1992.
  74. COLA: Consistent learning with opponent-learning awareness. In Proceedings of the International Conference on Machine Learning, 2022.
  75. Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv preprint arXiv:2002.03939, 2020.
  76. The surprising effectiveness of PPO in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35, 2022.
  77. FOP: Factorizing optimal joint policy of maximum-entropy multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2021.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces CTDE as a framework leveraging centralized training with decentralized execution to improve coordination in MARL.
  • It details various value decomposition methods like VDN, QMIX, and QPLEX to optimize joint action-value functions under partial observability.
  • Centralized critic techniques such as MADDPG and MAPPO are examined to enhance policy updates and address the credit assignment challenge.

An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning

The paper provides an extensive overview of different approaches within Centralized Training for Decentralized Execution (CTDE) in cooperative multi-agent reinforcement learning (MARL). The paper effectively delineates the landscape of MARL concerning three primary paradigms: Centralized Training and Execution (CTE), CTDE, and Decentralized Training and Execution (DTE). The focus of this paper is on CTDE, given its prevalence and practical advantages for scalability and performance.

Cooperative Problem: Dec-POMDP

The cooperative MARL problem is formally introduced using the Decentralized Partially Observable Markov Decision Processes (Dec-POMDP) framework. A Dec-POMDP extends the standard POMDP to multi-agent systems operating under partial observability and decentralized local observations. Formally, a Dec-POMDP is characterized by a set of agents, joint states, individual action sets, state transition functions, joint reward functions, individual observation sets, and observation functions. Each agent’s policy is defined over its local action-observation history, and the joint policy involves a combination of these individual policies to maximize the expected cumulative reward under uncertainty.

CTDE Overview

The concept of CTDE allows agents to leverage centralized information during training to learn efficient policies while ensuring execution can be performed based on decentralized information alone. By utilizing a shared training phase where collective experiences inform policy updates, CTDE methods often achieve a balance between performance and scalability.

Value Function Factorization Methods

Value-based CTDE methods are categorized based on the way they factorize the joint value function (Q-function) into individual agent-specific value functions.

Value Decomposition Networks (VDN) assume an additive decomposition of the joint Q-function: $\jQ(\jh, \ja) \approx \sum_{i \in \agentS}^\nrA \Qi(\hi, \ai)$

This simple summation allows each agent to independently optimize its policy based on its individual Q-values while ensuring joint action maximization during training.

QMIX extends the factorization to non-linear monotonic functions: $\jQ(\jh, \ja) \approx f_{mono}(\Qi(\hi, \ai), \ldots, Q_n(\hi, \ai))$

The monotonicity constraint ensures that the global argmax action is composed of the local argmax actions, improving upon the flexibility and representational power over VDN.

QTRAN and QPLEX offer more sophisticated decompositions to address the limitations of VDN and QMIX by introducing auxiliary functions and optimization constraints. QPLEX, notably, guarantees the representation of any individual-global-max (IGM) function through an advantage-based IGM principle.

Centralized Critic Methods

Policy gradient methods using centralized critics form another prominent class within CTDE. These methods typically involve centralized value estimation (critic) during training, aiding in the update of decentralized policies (actors).

Multi-Agent DDPG (MADDPG) applies centralized critics with continuous action spaces and deterministic policies: $J = (1-\gamma) E_{\jh, \ja}\left[ \mu_i(\hi) \nablasub \ja \jQpol(\jh, \ja) \Big|_{\ai=\mu_i(\hi)} \right]$

COMA introduces a counterfactual baseline to refine the credit assignment problem by considering each agent’s contribution while fixing other agents' actions. It’s tailored towards enhancing coordination efficiency in policy gradients but utilizes an incorrect state-based critic structure, which is theoretically suboptimal.

MAPPO extends PPO to multi-agent settings, enforcing trust region-like updates to ensure stable policy learning. It effectively balances exploration with policy update constraints using clipped loss functions: $\mathcal{L}^{MAPPO}_{clip}(\ppi)= \min \left( r_{\ppi,i}\jA, \text{clip}(r_{\ppi,i},1-\epsilon,1+\epsilon)\jA \right)$

Critic Types and Practical Considerations

The choice and implementation of critics greatly impact the empirical performance of MARL algorithms. The paper elucidates potential pitfalls of using state-only critics in partially observable environments, advocating history-based or history-state critics to mitigate bias and maintain theoretical correctness.

Combining Approaches and Future Directions

Contemporary methods also hybridize value factorization with centralized critics. FACMAC incorporates QMIX's value factorization into continuous policy gradient updates, offering more flexible value approximations without monotonic constraints.

Conclusion

The insights provided in the paper address both practical implementations and theoretical implications of CTDE strategies in cooperative MARL. By detailing critical value decomposition methods and centralized critic mechanisms, the discussion sets the foundation for further advancements, including improved implementations and novel hybrid approaches. The need for a globally optimal model-free MARL method for Dec-POMDP remains an open research question poised to push the boundaries of MARL efficiency and scalability.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run custom paper prompts using GPT-5 on this paper.

Authors (1)

X Twitter Logo Streamline Icon: https://streamlinehq.com