Automated, generalizable policy augmentation for partner diversity in multiagent RL

Develop an automated and generalizable procedure for generating a diverse set of partner policies to use during training in cooperative multiagent reinforcement learning, analogous to data augmentation in supervised learning, so that agents experience a wide variety of partner behaviors and thereby generalize to unseen partners.

Background

The paper demonstrates that agents trained jointly with MADDPG co-adapt to one another, developing landmark-specific preferences that hinder generalization to unseen partners, such as deterministic "Sheldon" agents that always occupy a fixed landmark.

To mitigate this, the authors argue that agents should encounter a wide variety of partner policies during training, akin to data augmentation in supervised learning. However, they note that it is not currently clear how to implement such policy diversity in an automated and generalizable way.

This motivates an open methodological challenge: constructing principled, automated mechanisms to generate diverse partner policies during training in multiagent reinforcement learning, with the aim of improving generalization beyond the training population.

References

Similar to data augmentation used in supervised training we would need to ``augment'' our policies in various ways to produce the widest variety of training partners. Unfortunately it is not clear how to achieve this in an automated and generalizable way.

Do deep reinforcement learning agents model intentions?  (1805.06020 - Matiisen et al., 2018) in Discussion — Generalization in multiagent setups