Multi-Game Decision Transformers (2205.15241v2)

Published 30 May 2022 in cs.AI and cs.LG

Abstract: A longstanding goal of the field of AI is a method for learning a highly capable, generalist agent from diverse experience. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.

PDF Abstract

Overview

The research presented in this manuscript explores the scalability of transformer-based models to create generalist reinforcement learning (RL) agents capable of performing across multiple gaming environments. The investigation builds upon the success of such models in language and vision tasks, aiming to utilize large and diverse datasets alongside transformers to achieve generalized performance in RL. The authors introduce the Multi-Game Decision Transformer, a single set-parameter model that demonstrates the potential to train a high-performing generalist agent to act across diverse tasks based on offline data alone.

Methods and Contributions

The proposed Multi-Game Decision Transformer addresses the challenge of training on a set of 41 distinct Atari games with varying dynamics, visuals, and agent embodiments, leveraging previously collected trajectories. This approach seeks to identify whether learning from an extensive range of video game experiences allows models to capture something universally beneficial. Critically, the researchers deviate from standard decision transformers by incorporating a guided generation technique to generate expert-level actions from both expert and non-expert experience trajectories during inference.

The work contrasts multiple methods in the multi-game domain, reviewing online reinforcement learning, offline temporal difference methods, contrastive representations, and behavior cloning. Among these, decision-transformer-based models surface as the superior choice in terms of scalability and generalist agent performance when modeling the task as offline sequence modeling.

Findings

A key finding is that the Multi-Game Decision Transformer achieves an aggregate performance exceeding human-level gameplay across all evaluated games. It exhibits rapid fine-tuning to unfamiliar games with limited data and showcases scalability akin to advancements seen in language and vision—larger models consistently offer improved performance. Not all multi-environment training techniques yield positive outcomes, and the manuscript carefully delineates those that fell short, such as offline non-transformer models and online multi-game methods.

Future Research Implications

This paper sets a precedent for further exploration into generalist agents, offering a fostering ground for future research in this domain. The authors provide the models and code to the community, facilitating ongoing work. Beyond the results, there is an allusion to an important question: whether online learning algorithms can be modified to be as receptive to data as methods like Decision Transformers. This marks an avenue where the RL field could possibly further evolve.

Limitations and Societal Impacts

The researchers recognize that the work's generalizability might be limited by the specificity of the Atari environment and online/offline RL dataset scales. Moreover, they caution against extending their algorithms and methodologies to scenarios involving human interaction without thorough consideration of safety and ethical implications. While the current models' application is restricted to game-playing, the potential for decision-making based on reward feedback signifies an untapped area that needs careful alignment with human values and objectives.