Efficient Multi-agent Reinforcement Learning by Planning (2405.11778v1)
Abstract: Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search. We design a novel network structure to facilitate distributed execution and parameter sharing. To enhance search efficiency in deterministic environments with sizable action spaces, we introduce two novel techniques: Optimistic Search Lambda (OS($\lambda$)) and Advantage-Weighted Policy Optimization (AWPO). Extensive experiments on the SMAC benchmark demonstrate that MAZero outperforms model-free approaches in terms of sample efficiency and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. Our code is available at https://github.com/liuqh16/MAZero.
- Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
- Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017.
- Planning in stochastic environments with a learned model. In International Conference on Learning Representations, 2021.
- Alphastar: An evolutionary computation perspective. In Proceedings of the genetic and evolutionary computation conference companion, pp. 314–315, 2019.
- The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020.
- Cooperative prioritized sweeping. In AAMAS, pp. 160–168, 2021.
- Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33:17057–17069, 2020.
- Efficient model-based deep reinforcement learning with variational state tabulation. In International Conference on Machine Learning, pp. 1049–1058. PMLR, 2018.
- Policy improvement by planning with gumbel. In International Conference on Learning Representations, 2021.
- Scalable multi-agent model-based reinforcement learning. arXiv preprint arXiv:2205.15023, 2022.
- Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, pp. 1–49, 2022.
- World models. arXiv preprint arXiv:1803.10122, 2018.
- Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
- Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022.
- What model does muzero learn? arXiv preprint arXiv:2306.00840, 2023.
- A very condensed survey and critique of multiagent deep reinforcement learning. In Proceedings of the 19th international conference on autonomous agents and multiagent systems, pp. 2146–2148, 2020.
- Muesli: Combining improvements in policy optimization. In International conference on machine learning, pp. 4214–4226. PMLR, 2021.
- Learning-based model predictive control: Toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 3:269–296, 2020.
- Learning and planning in complex action spaces. In International Conference on Machine Learning, pp. 4476–4486. PMLR, 2021.
- When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
- Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Bandit based monte-carlo planning. In European conference on machine learning, pp. 282–293. Springer, 2006.
- Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 4501–4510, 2020.
- Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018.
- Celebrating diversity in shared multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34:3991–4002, 2021.
- Multi-agent game abstraction via graph attention neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 7211–7218, 2020.
- Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 6252–6259. IEEE, 2018.
- Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.
- Tesseract: Tensorised actors for multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 7301–7312. PMLR, 2021.
- Speedyzero: Mastering atari with limited data and time. In The Eleventh International Conference on Learning Representations, 2022.
- Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 7559–7566. IEEE, 2018.
- Awac: Accelerating online reinforcement learning with offline datasets, 2021.
- Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications. IEEE transactions on cybernetics, 50(9):3826–3839, 2020.
- A concise introduction to decentralized POMDPs, volume 1. Springer, 2016.
- Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34:12208–12221, 2021.
- Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 21(1):7234–7284, 2020.
- Multi-agent actor-critic with hierarchical graph attention network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 7236–7243, 2020.
- Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning, pp. 5887–5896. PMLR, 2019.
- Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pp. 2898–2933. PMLR, 2019.
- Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017.
- Reinforcement learning: An introduction. MIT press, 2018.
- Gerald Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219, 1994.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Qplex: Duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062, 2020a.
- Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649, 2019.
- Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
- Rode: Learning roles to decompose multi-agent tasks. arXiv preprint arXiv:2010.01523, 2020b.
- Mambpo: Sample-efficient multi-robot reinforcement learning using learned world models. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5635–5640. IEEE, 2021.
- Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv preprint arXiv:2002.03939, 2020.
- Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 6672–6679, 2020.
- Towards global optimality in cooperative marl with the transformation and distillation framework, 2023.
- Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021.
- The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35:24611–24624, 2022.
- Model-based multi-agent rl in zero-sum markov games with near-optimal sample complexity. Advances in Neural Information Processing Systems, 33:1166–1178, 2020.
- Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving. arXiv preprint arXiv:2010.09776, 2020.
- Qihan Liu (6 papers)
- Jianing Ye (7 papers)
- Xiaoteng Ma (24 papers)
- Jun Yang (357 papers)
- Bin Liang (115 papers)
- Chongjie Zhang (68 papers)