Taming Multi-Agent Reinforcement Learning with Estimator Variance Reduction (2209.01054v2)
Abstract: Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms. Despite its popularity, it suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state. As agents explore and update their policies during training, these single samples may poorly represent the actual joint-policy of the system of agents leading to high variance gradient estimates that hinder learning. To address this problem, we propose an enhancement tool that accommodates any actor-critic MARL method. Our framework, Performance Enhancing Reinforcement Learning Apparatus (PERLA), introduces a sampling technique of the agents' joint-policy into the critics while the agents train. This leads to TD updates that closely approximate the true expected value under the current joint-policy rather than estimates from a single sample of the joint-action at a given state. This produces low variance and precise estimates of expected returns, minimising the variance in the critic estimators which typically hinders learning. Moreover, as we demonstrate, by eliminating much of the critic variance from the single sampling of the joint policy, PERLA enables CT-DE methods to scale more efficiently with the number of agents. Theoretically, we prove that PERLA reduces variance in value estimates similar to that of decentralised training while maintaining the benefits of centralised training. Empirically, we demonstrate PERLA's superior performance and ability to reduce estimator variance in a range of benchmarks including Multi-agent Mujoco, and StarCraft II Multi-agent Challenge.
- Shared experience actor-critic for multi-agent reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020a.
- Deep multi-agent reinforcement learning for decentralized continuous cooperative control. arXiv preprint arXiv:2003.06709, 2020b.
- Counterfactual multi-agent policy gradients. In Thirty-second AAAI conference on artificial intelligence, 2018.
- Revisiting some common practices in cooperative multi-agent reinforcement learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 6863–6877. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/fu22d.html.
- Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
- Sparse cooperative q-learning. In Proceedings of the twenty-first international conference on Machine learning, page 61, 2004.
- Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- Settling the variance of multi-agent policy gradients. Advances in Neural Information Processing Systems, 34:13458–13470, 2021.
- Decentralised learning in systems with many, many strategic agents. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Learning in nonzero-sum stochastic games with potentials. arXiv preprint arXiv:2103.09284, 2021a.
- Ligs: Learnable intrinsic-reward generation selection for multi-agent learning. arXiv preprint arXiv:2112.02618, 2021b.
- Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2021. URL http://arxiv.org/abs/2006.07869.
- Facmac: Factored multi-agent centralised policy gradients. arXiv preprint arXiv:2003.06709, 2020.
- Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017.
- Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pages 4295–4304. PMLR, 2018.
- Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:2006.10800, 2020.
- The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.
- Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, pages 5887–5896. PMLR, 2019.
- Qplex: Duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062, 2020.
- Multi-agent determinantal q-learning. In International Conference on Machine Learning, pages 10757–10766. PMLR, 2020.
- The surprising effectiveness of mappo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021.
- Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881. PMLR, 2018.
- Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving. arXiv preprint arXiv:2010.09776, 2020.
- Taher Jafferjee (7 papers)
- Juliusz Ziomek (11 papers)
- Tianpei Yang (25 papers)
- Zipeng Dai (5 papers)
- Jianhong Wang (24 papers)
- Matthew Taylor (12 papers)
- Kun Shao (29 papers)
- Jun Wang (991 papers)
- David Mguni (23 papers)