Counterfactual Multi-Agent Policy Gradients (1705.08926v3)

Published 24 May 2017 in cs.AI and cs.MA

Abstract: Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing and the coordination of autonomous vehicles. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.

Authors (5)

Jakob Foerster (101 papers)
Gregory Farquhar (21 papers)
Triantafyllos Afouras (29 papers)
Nantas Nardelli (19 papers)
Shimon Whiteson (122 papers)

Citations (1,898)

View on Semantic Scholar

Summary

Counterfactual Multi-Agent Policy Gradients: An Overview

The paper "Counterfactual Multi-Agent Policy Gradients" introduces an innovative approach for learning decentralized policies in cooperative multi-agent systems using reinforcement learning (RL). The method, named Counterfactual Multi-Agent (COMA) Policy Gradients, addresses the issues of multi-agent credit assignment and efficient policy learning using a structure that combines centralized training with decentralized execution.

Introduction and Motivation

Multi-agent systems present unique challenges compared to single-agent scenarios, primarily due to the exponential growth of the joint action space as the number of agents increases. Some application domains include network packet routing, autonomous vehicle coordination, and distributed logistics. Traditional actor-critic methods for single-agent RL generally do not scale well to these multi-agent contexts.

The paper emphasizes the necessity for decentralized policies in situations involving partial observability and communication constraints. It proposes COMA as a novel multi-agent actor-critic method leveraging three key ideas: centralized critics, counterfactual baselines, and efficient critic representation.

Methodology

COMA uses a centralized critic to estimate the $Q$ -function while deploying decentralized actors for policy optimization. The centralized critic is employed during the learning phase and allows each agent's policy (the actor) to condition only on its local action-observation history during execution.

Centralized Critic

The centralized critic leverages global state and joint actions, unlike independent actor-critic (IAC) methods where each agent's critic only considers local information. This allows the critic to provide more accurate evaluations of actions, significantly aiding the learning process.

Counterfactual Baseline

To address the multi-agent credit assignment issue, COMA introduces a counterfactual baseline. This baseline marginalizes out the action of a single agent, holding other agents' actions constant, and compares the expected returns. This method avoids the pitfalls of requiring extra simulations or predefined default actions, as seen in traditional difference rewards.

Efficient Critic Representation

COMA employs a critic network architecture that allows efficient computation of $Q$ -values for various actions of a single agent in one forward pass. This reduces the computational overhead and enables scalability.

Experimental Setup

COMA is evaluated in the challenging context of StarCraft unit micromanagement, particularly under conditions of significant partial observability and limited agent field of view. This benchmark poses a complex environment with high-dimensional state-action spaces and delayed rewards.

Results and Comparisons

COMA demonstrates superior performance compared to several baselines, including independent actor-critic methods (IAC- $Q$ and IAC- $V$ ), centralized- $V$ , and centralized- $QV$ . The results on four StarCraft scenarios (3m, 5m, 5w, and 2d_3z) highlight COMA's efficacy in improving win rates and learning stability.

Key Results:

3m Scenario: COMA achieved 87% mean win rate, outperforming the nearest baseline.
5m Scenario: COMA reached an 81% mean win rate, significantly higher than IAC- $Q$ and IAC- $V$ .
5w Scenario: COMA led with an 82% win rate, indicating robustness across varied scenarios.
2d_3z Scenario: Though the most difficult, COMA showed substantial improvement with a 47% win rate.

Implications and Future Directions

The COMA method has important theoretical and practical implications. Theoretically, it provides a robust framework for multi-agent RL in decentralized settings, addressing the credit assignment problem effectively. Practically, it shows promise in complex real-world situations like self-driving cars and robotic coordination tasks.

Future work aims to extend COMA to handle large-scale multi-agent scenarios with improved sample efficiency. Architectures for factored critics and enhanced exploration strategies will be critical to scaling the method further.

Conclusion

The paper establishes Counterfactual Multi-Agent Policy Gradients as a proficient method for decentralized multi-agent RL, combining centralized training with decentralised execution. By effectively addressing multi-agent credit assignment and optimizing policy learning through a counterfactual baseline, COMA sets a new benchmark in the field.

PDF Markdown