Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction

Published Feb 6, 2024 in cs.AI and cs.LG


Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.

Diagram of the Decision Transformer with Game Instruction model, showcasing multimodal instruction representation.


  • Introduces multimodal game instructions (MGI) integration into Decision Transformers (DTs) for improved multitasking and generalization in Reinforcement Learning (RL).

  • Describes the novel Decision Transformer with Game Instruction (DTGI) model, which combines textual and visual instructions and a unique design, SHyperGenerator, for enhanced adaptability.

  • Presents empirical evidence showing that DTs with MGI outperform those with singular modal instructions, particularly in unseen gaming environments.

  • Envisions a future where the integration of multimodal instructions in AI and LLMs can lead to superior performance across diverse tasks and challenges.

Integrating Multimodal Game Instructions into Decision Transformers for Enhanced Multitasking and Generalization in Reinforcement Learning


In the realm of AI, developing generalist agents that exhibit adeptness across diverse tasks has been a long-standing objective. Reinforcement Learning (RL) approaches, empowered by extensive offline datasets, have demonstrated remarkable multitasking capabilities. Nevertheless, these models often grapple with the adaptation to unfamiliar tasks due to the limitations in accessing task-specific knowledge and contextual information. While recent advancements have attempted to surmount these barriers with textual or visual guidance, the effectiveness of singular modal guidance remains inadequate for providing comprehensive contextual task understanding. This paper posits that the utilization of multimodal game instructions could significantly elevate the performance of Decision Transformers (DTs) by offering enriched contextual cues, thereby facilitating superior multitasking and generalization.

Multimodal Game Instruction: A New Frontier

The inception of multimodal game instructions (MGI) marks a pivotal advancement in the pursuit of crafting more versatile and adaptable RL agents. Drawing inspiration from the efficacy of multimodal instruction tuning in visual task performance enhancement, this study pioneers the integration of these instructions into the Decision Transformer framework, thus birthing the Decision Transformer with Game Instruction (DTGI). This novel configuration not only leverages the collective strengths of textual and visual instructions but also introduces a unique design, SHyperGenerator, to enable between-task knowledge sharing, further augmenting the model's adaptability to unseen gaming environments.

Compelling Experimental Insights

The empirical evaluations underscore the significant improvements ushered in by the incorporation of MGIs into DTs. The findings distinctly illustrate that:

  • DTs equipped with MGI markedly outperform those facilitated by singular modal instructions, underscoring the superior comprehensive nature of multimodal contextual information.
  • The adaptability and performance of DTs in unseen games escalate noticeably with the integration of MGIs, highlighting the model's enhanced generalization capabilities.
  • A larger dataset size and a greater diversity of training games proportionately improve both in-distribution (ID) and out-of-distribution (OOD) performance, suggesting the multifaceted benefits of MGI across various scales of data availability.

The Future Trajectory of AI and LLMs

This research provides a compelling demonstration of how multimodal instructions can revolutionize decision-making processes in RL-driven models. The integration of MGI into DT not only reflects a significant leap in the development of generalist agents but also paves the way for future explorations into the realms of AI and LLMs. The potential for a generalized multimodal instruction framework looms on the horizon, promising enhancements in performance across not just vision-based tasks but also in the broader landscape of AI challenges.

In summation, the integration of MGIs into DTs heralds a new era in the evolution of AI, where the symbiosis of multimodal cues and decision-making processes unfurls new dimensions of learning, adaptability, and task execution. The journey forward is ripe with possibilities for extending this innovative approach across various domains, further solidifying the foundation for more intelligent and versatile AI agents.


