QPLEX: Duplex Dueling Multi-Agent Q-Learning (2008.01062v3)

Published 3 Aug 2020 in cs.LG, cs.AI, cs.MA, and stat.ML

Abstract: We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or may not perform well in complex domains. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms state-of-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration.

PDF Abstract

An Overview of QPLEX: Duplex Dueling Multi-Agent Q-Learning

QPLEX introduces a sophisticated approach to value-based multi-agent reinforcement learning (MARL), addressing critical challenges associated with the centralized training and decentralized execution (CTDE) paradigm. The inherent complexity of multi-agent systems lies in ensuring effective local decision-making while maintaining coordination among agents. Existing models often restrict the representation capacity of their value function classes or compromise the Individual-Global-Max (IGM) principle to achieve scalability, which can lead to instability or suboptimal performance in intricate environments.

Core Contributions and Methodology

QPLEX introduces an innovative network architecture that adheres to the IGM principle through a duplex dueling structure, allowing for the factorization of the joint value function in a manner that ensures policy consistency across both training and execution phases. This architecture encapsulates joint and individual dueling strategies, highlighting QPLEX's comprehensive representation capacity which is theoretically shown to encompass a complete IGM function class.

The paper delineates how QPLEX's architecture skillfully transforms individual action-value functions into a joint action-value function via a transformation network and a dueling mixing network module. This transition is facilitated by leverage of a multi-head attention mechanism that systematically assigns importances to local action-value contributions, while maintaining coherence with the jointly maximized policy.

Empirical Results and Observations

The empirical evaluation of QPLEX is conducted on both didactic problems and complex benchmark tasks, notably the StarCraft II micromanagement suite. These experiments underscore the superiority of QPLEX over state-of-the-art MARL frameworks, such as QMIX, VDN, Qatten, QTRAN, and WQMIX, across both online and offline data collection settings.

In the didactic scenarios, particularly the matrix games, QPLEX demonstrates its ability to achieve optimal policies by leveraging the full expressiveness of the IGM class, outperforming other methods that lack this capability. On the StarCraft II tasks, QPLEX shows a marked improvement in performance metrics such as test win rates and learning speed, notably excelling in scenarios demanding high coordination amid partial observability.

Theoretical and Practical Implications

The theoretical analysis confirms that QPLEX maintains IGM consistency without the relaxation techniques employed by other approaches like QTRAN. By integrating a duplex dueling architecture capable of accommodating any sufficiently complex value functions, QPLEX ensures stable learning dynamics and efficient scalability, which are particularly critical in domains with vast state-action spaces and partial observability constraints.

Additionally, QPLEX's demonstrated capability to effectively employ offline datasets without additional online exploration holds significant promise for advancing MARL applications in areas where data collection is costly or prohibitive.

Future Directions

The prospect of extending QPLEX's underlying principles to continuous action spaces presents a promising avenue for research. This could involve adapting QPLEX's scalable architecture for environments characterized by continuous control tasks such as those found in robotics or autonomous vehicle systems.

Overall, QPLEX represents a significant step forward in MARL, combining robust theoretical underpinnings with empirical efficacy. By ensuring full adherence to IGM under CTDE frameworks, QPLEX not only provides a solid foundation for practical applications but also sets a new benchmark for future research in cooperative multi-agent systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jianhao Wang (16 papers)
Zhizhou Ren (13 papers)
Terry Liu (2 papers)
Yang Yu (385 papers)
Chongjie Zhang (68 papers)

Citations (386)

View on Semantic Scholar