Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning (1702.08887v3)

Published 28 Feb 2017 in cs.AI, cs.LG, and cs.MA

Abstract: Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to the multi-agent setting. A major stumbling block is that independent Q-learning, the most popular multi-agent RL method, introduces nonstationarity that makes it incompatible with the experience replay memory on which deep Q-learning relies. This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Results on a challenging decentralised variant of StarCraft unit micromanagement confirm that these methods enable the successful combination of experience replay with multi-agent RL.

Authors (7)

Jakob Foerster (101 papers)
Nantas Nardelli (19 papers)
Gregory Farquhar (21 papers)
Triantafyllos Afouras (29 papers)
Philip H. S. Torr (219 papers)
Pushmeet Kohli (116 papers)
Shimon Whiteson (122 papers)

Citations (574)

View on Semantic Scholar

Summary

The paper presents two innovative techniques—multi-agent importance sampling and fingerprints—to stabilize experience replay in dynamic multi-agent environments.
It demonstrates that the fingerprint method significantly improves performance in decentralized tasks like StarCraft unit micromanagement, especially with feedforward models.
The study reveals that although both methods individually enhance learning stability, their combined use does not offer additional benefits beyond fingerprints alone.

Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning

The paper "Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning" addresses the challenge of adapting experience replay, a fundamental mechanism in single-agent RL, to the multi-agent context. In multi-agent systems, independent Q-learning (IQL) introduces nonstationarity due to other learning agents being part of the environment, making standard experience replay ineffective. This paper proposes two methods to mitigate this issue, focusing on improving multi-agent reinforcement learning (MARL) stability and efficiency.

Proposed Methods

Multi-Agent Importance Sampling (IS): This method interprets replay memory data as off-environment data. By recording the probabilities of joint actions in the replay memory, it allows computation of importance weights to adjust for nonstationarity. This technique naturally decays the relevance of older data, which is crucial in dynamic environments where agent policies evolve.
Multi-Agent Fingerprints (FP): Inspired by hyper Q-learning, this approach conditions the value function of each agent on a "fingerprint" comprising the training iteration and exploration rate. This low-dimensional fingerprint helps disambiguate data based on its age in replay memory, helping to stabilize learning.

Experimental Evaluation

The authors validate these methods using a decentralized StarCraft unit micromanagement task, a complex environment where traditional RL methods struggle. Importantly, the research tested both feedforward and recurrent neural network architectures to assess the robustness of the solutions.

Key Findings

Effectiveness of FP and IS: The experiments demonstrated that both methods improved the stability and performance of MARL in the StarCraft environment. When compared to methods without experience replay, the FP approach was particularly beneficial, showing substantial improvements over the baseline.
Recurrent Models vs. Feedforward Models: In recurrent networks, the benefit of FP was less pronounced, likely due to the inherent ability of RNNs to capture trajectory information, which aids disambiguation of training stages.
Combination of Methods: While both FP and IS individually contributed to improved learning outcomes, their combined application did not show additional benefits beyond those provided by FP alone.

Implications and Future Work

The proposed techniques hold significance for addressing the challenge of nonstationarity in multi-agent dynamics, making MARL more feasible in high-dimensional and complex environments. By enhancing sample efficiency and stability, these methods pave the way for broader applications in real-world scenarios, such as traffic control or network optimization.

Future research could explore extending these methods to actor-critic frameworks or deploying them in nonstationary data tasks beyond reinforcement learning. Additionally, examining further refinements to fingerprints or alternative regularization strategies could provide deeper insights into efficiently stabilizing multi-agent systems.

PDF Markdown

Related Papers

YouTube

Show All Videos