MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure (2405.00902v1)
Abstract: Multi-agent reinforcement learning (MARL) algorithms often struggle to find strategies close to Pareto optimal Nash Equilibrium, owing largely to the lack of efficient exploration. The problem is exacerbated in sparse-reward settings, caused by the larger variance exhibited in policy learning. This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning. It learns to explore by first identifying the agents' high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to "cover" the subspace. These trained exploration policies can be integrated with any off-policy MARL algorithm for test-time tasks. We first showcase MESA's advantage in a multi-step matrix game. Furthermore, experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments, and exhibits the ability to generalize to more challenging tasks at test time.
- Modular multitask reinforcement learning with policy sketches. In ICML. PMLR, 166–175.
- Hindsight experience replay. arXiv preprint arXiv:1707.01495 (2017).
- Unifying count-based exploration and intrinsic motivation. NeurIPS 29 (2016), 1471–1479.
- Avrim Blum and Ronald Rivest. 1988. Training a 3-node neural network is NP-complete. NeurIPS 1 (1988).
- Exploration by random network distillation. In ICLR.
- GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms. In ICML.
- Offline Meta Learning of Exploration. arXiv preprint arXiv:2008.02598 (2020).
- Simon Du and Jason Lee. 2018. On the power of over-parametrization in neural networks with quadratic activation. In ICML. PMLR, 1329–1338.
- Rl 2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779 (2016).
- First return, then explore. Nature 590, 7847 (2021), 580–586.
- Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML. PMLR, 1407–1416.
- Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th ICML (PMLR, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1126–1135. https://proceedings.mlr.press/v70/finn17a.html
- Continuous deep q-learning with model-based acceleration. In ICML. PMLR, 2829–2838.
- Meta-Reinforcement Learning of Structured Exploration Strategies. NIPS 2018 (2018), 5302–5311.
- Uneven: Universal value exploration for multi-agent reinforcement learning. In ICML. PMLR, 3930–3941.
- Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3796–3803.
- Inequity aversion improves cooperation in intertemporal social dilemmas. NIPS 2018 31 (2018).
- Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In ICML. PMLR, 3040–3049.
- Intrinsic social motivation via causal influence in multi-agent RL. (2018).
- Meta reinforcement learning with task embedding and shared policy. arXiv preprint arXiv:1905.06527 (2019).
- Cooperative exploration for multi-agent deep reinforcement learning. In ICML. PMLR, 6826–6836.
- Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137.
- Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In NIPS.
- MAVEN: Multi-Agent Variational Exploration. In NeurIPS, Vol. 32. 7613–7624.
- Count-based exploration with neural density models. In ICML. PMLR, 2721–2730.
- Interesting object, curious agent: Learning task-agnostic exploration. NeurIPS 34 (2021), 20516–20530.
- Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning. In ICLR (Poster).
- Curiosity-driven exploration by self-supervised prediction. In ICML. PMLR, 2778–2787.
- Facmac: Factored multi-agent centralised policy gradients. NeurIPS 34 (2021).
- Dean A Pomerleau. 1991. Efficient training of artificial neural networks for autonomous navigation. Neural computation 3, 1 (1991), 88–97.
- Offline Meta-Reinforcement Learning with Online Self-Supervision. arXiv preprint arXiv:2107.03974 (2021).
- Efficient off-policy meta-reinforcement learning via probabilistic context variables. In ICML. PMLR, 5331–5340.
- Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML. PMLR, 4295–4304.
- Learning by playing solving sparse reward tasks from scratch. In ICML. PMLR, 4344–4353.
- Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated Exploration. In Proceedings of the 21st AAMAS. 1146–1154.
- Parrot: Data-Driven Behavioral Priors for Reinforcement Learning. In ICLR.
- # exploration: A study of count-based exploration for deep reinforcement learning. In 31st NIPS, Vol. 30. 1–18.
- Quadratic Q-network for learning continuous control for autonomous vehicles. arXiv preprint arXiv:1912.00074 (2019).
- Influence-Based Multi-Agent Exploration. In ICLR.
- Learning to explore with meta-policy gradient. arXiv preprint arXiv:1803.05044 (2018).
- The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. arXiv:2103.01955 [cs.LG]
- Metacure: Meta reinforcement learning with empowerment-driven exploration. In ICML. PMLR, 12600–12610.
- Episodic Multi-agent Reinforcement Learning with Curiosity-driven Exploration. NeurIPS 34 (2021).
- VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning. In ICLR.
- Zhicheng Zhang (76 papers)
- Yancheng Liang (8 papers)
- Yi Wu (171 papers)
- Fei Fang (103 papers)