Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning (2109.11251v2)

Published 23 Sep 2021 in cs.AI and cs.MA

Abstract: Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.

PDF Abstract

Trust Region Methods in Multi-Agent Reinforcement Learning

The paper introduces a novel approach to trust region policy optimisation (TRPO) in the context of multi-agent reinforcement learning (MARL). Unlike traditional RL settings where monotonic improvement is relatively straightforward, MARL scenarios pose unique challenges. These challenges arise primarily from the conflicting policy update directions amongst agents, even in cooperative games. To address these complexities, the authors propose extending the theory of trust region learning to MARL, presenting new algorithms—Heterogeneous-Agent Trust Region Policy Optimisation (HATRPO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO).

Key Contributions

1. Multi-agent Advantage Decomposition Lemma:

A central theoretical contribution of the paper is the multi-agent advantage decomposition lemma. This lemma allows the decomposition of the joint advantage function into individual agents' advantage functions. It holds broadly for cooperative Markov games, without necessitating assumptions regarding the decomposability of joint value functions or parameter sharing among agents. This decomposition is pivotal in solving the issue of conflicting policy updates in MARL, assuring the direction of performance improvement by sequentially updating agent policies.

2. Sequential Policy Update Scheme:

The paper argues against updating all agents simultaneously, as often done in MARL settings. Instead, it proposes a sequential policy update where each agent optimizes its policy considering the cumulative updates of preceding agents. This not only provides a way to ensure monotonic performance improvement but also enables theoretical guarantees in MARL, mirroring the rigorous trust region approach in single-agent RL.

3. Novel Algorithms (HATRPO and HAPPO):

The extension of TRPO principles to MARL manifests in the HATRPO and HAPPO algorithms. Unlike existing MARL algorithms, these do not require parameter sharing among agents nor do they impose restrictive assumptions on the joint value function. HATRPO maintains the KL-divergence constraint, whereas HAPPO adopts a clipping approach similar to PPO, providing a more computationally efficient alternative. These algorithms demonstrate theoretically-justified monotonic policy improvements, supported by empirical evaluations on benchmarks like Multi-Agent MuJoCo and StarCraft II.

Numerical Results

The paper provides thorough experimental validation showing that both HATRPO and HAPPO outperform several established baselines such as IPPO, MAPPO, and MADDPG across various tasks. On the MuJoCo and StarCraft II benchmarks, these algorithms established a new state-of-the-art by significantly improving task completion metrics against competitive baseline algorithms.

Implications and Future Directions

Theoretical Implications:

The paper's contributions include a deeper understanding of the role of sequential policy updates in MARL, offering a structured approach to achieving monotonic improvement for heterogeneous agents. This fundamentally challenges and expands existing paradigms in MARL where parameter sharing was often a simplifying necessity.

Practical Implications:

The empirical success of HATRPO and HAPPO on diverse tasks shows the potential of these algorithms in complex multi-agent systems beyond the tested benchmarks. It opens avenues for applying MARL in real-world scenarios where agents possess heterogeneous capabilities and require robust coordination.

Future Directions:

The paper lays groundwork for enhancing MARL methods with trust region concepts, suggesting possible extensions to safety-critical environments where guarantees on policy improvement are vital. Further research could explore the integration of safety protocols into HATRPO and HAPPO to address environments involving uncertainty and dynamic adversarial behavior.

In summary, the paper provides a significant step forward in the MARL domain by successfully integrating trust region methods with theoretically-backed algorithms, capable of addressing the intricate dynamics of policy updates in multi-agent settings.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Jakub Grudzien Kuba (12 papers)
Ruiqing Chen (3 papers)
Muning Wen (20 papers)
Ying Wen (75 papers)
Fanglei Sun (7 papers)
Jun Wang (991 papers)
Yaodong Yang (169 papers)

Citations (197)

View on Semantic Scholar

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning (2109.11251v2)

Trust Region Methods in Multi-Agent Reinforcement Learning

Key Contributions

Numerical Results

Implications and Future Directions

Related Papers