Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

471 4

Scaling Instructable Agents Across Many Simulated Worlds (2404.10179v3)

Published 13 Mar 2024 in cs.RO, cs.AI, cs.HC, and cs.LG

Abstract: Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.

PDF HTML Abstract

Scaling Instructable Agents Across Many Simulated Worlds

The paper "Scaling Instructable Agents Across Many Simulated Worlds" details the development and initial results of the Scalable, Instructable, Multiworld Agent (SIMA) project. SIMA aims to create embodied AI systems capable of executing arbitrary language instructions within diverse 3D environments, bridging the gap between symbolic language and embodied perception/action.

Introduction and Motivation

The ability of AI systems to follow complex language instructions and perform tasks in realistic 3D environments has long been a challenge. While modern AI demonstrates proficiency in abstract domains such as chess and programming, interacting with the physical world through grounded perception and action remains significantly underdeveloped. The SIMA project seeks to overcome this limitation by training agents to follow diverse free-form instructions across various 3D settings, from curated research environments to open-ended commercial video games.

A critical approach of SIMA is its reliance on a generic, human-like interface. Agents receive image observations and language instructions, translating these into keyboard-and-mouse actions, enabling them to adapt seamlessly to new environments. This universal interface models human interaction, allowing for direct imitation learning from human behavior data and zero-shot transfer to new tasks and environments.

Environments

SIMA includes a range of both commercial video games and research-specific environments, selected for their visual richness and diverse interaction possibilities. The commercial video games such as Goat Simulator 3, Hydroneer, No Man's Sky, Satisfactory, Teardown, Valheim, and Wobbly Life offer a high degree of complexity and visual fidelity. Conversely, research environments like Construction Lab, Playhouse, ProcTHOR, and WorldLab provide controlled settings for assessing specific skills essential to grounded AI.

Data Collection and Processing

The project collects large datasets of human expert gameplay, capturing videos, language instructions, and actions within these environments. Data quality is ensured through filtering and preprocessing measures. The instructions span a variety of domains such as resource gathering, combat, navigation, and object management, with data collection methodologies including single-player sessions and two-player "setter-solver" interactions.

Agent Architecture

The SIMA agent's architecture integrates inputs from pretrained models and fine-tunes them using behavioral cloning. Key components include:

Vision models like SPARC for fine-grained image-text alignment,
Video prediction models like Phenaki,
Transformers for processing visual observations, language instructions, and memory states,
A policy network generating keyboard-and-mouse actions.

This architectural setup allows the agent to assimilate extensive prior knowledge while adapting to specific tasks within varied environments. Classifier-Free Guidance is employed to enhance the policy's language conditionality during inference, improving the agent's responsiveness to instructions.

Evaluation Methods

The diverse and complex environments necessitate varied evaluation methodologies:

Ground-truth evaluations in research environments for precise task success metrics.
Optical Character Recognition (OCR) for detecting in-game text denoting task completion.
Human evaluations for assessing agent performance on tasks where automatic metrics are infeasible.

These strategies ensure robust, scalable assessment across multiple environments while maintaining a high sensitivity to language instruction adherence.

Initial Results

Agents demonstrate varied success rates across environments, exhibiting better performance in controlled research settings than in more interactive commercial games. Performance is particularly notable in simpler environments like Playhouse and WorldLab, indicating the agents' ability to generalize basic skills. Performance across skill categories varies, with higher success in movement and basic interactions but lower success in more intricate tasks like combat and resource management.

Comparisons with several baselines highlight the benefits of the integrated approach, showing significant improvements over agents trained without pretraining or language input. Zero-shot evaluation results are promising, with agents transferring basic skills to held-out environments, underlining the generalization capabilities of the approach.

Implications and Future Directions

The implications of SIMA are twofold: practical and theoretical. Practically, SIMA's approach offers an efficient, scalable method for training embodied AI, circumventing the prohibitive costs and risks associated with real-world robotics testing. Theoretically, it advances the understanding of language grounding in rich, embodied settings, contributing to the development of General AI.

Future developments will focus on scaling to more environments, enhancing agent robustness, leveraging more sophisticated pretrained models, and refining evaluation protocols. The ultimate goal is to create a general instructable agent capable of complex, language-driven behavior across any simulated 3D environment, potentially extending to real-world applications.

Conclusion

The SIMA project represents a significant step toward achieving general AI capable of understanding and executing language instructions in rich, interactive environments. By leveraging large-scale data collection, robust agent architectures, and diverse evaluation techniques, SIMA not only advances AI capabilities but also provides a critical platform for future research in grounded language understanding and general AI.

PDF Markdown Bookmark Chat (Pro)

References (92)

Authors (94)

SIMA Team (1 paper)
Maria Abi Raad (4 papers)
Arun Ahuja (24 papers)
Catarina Barros (3 papers)
Frederic Besse (11 papers)
Andrew Bolt (8 papers)
Adrian Bolton (3 papers)
Bethanie Brownfield (2 papers)
Gavin Buttimore (3 papers)
Max Cant (2 papers)
Sarah Chakera (1 paper)
Stephanie C. Y. Chan (20 papers)
Jeff Clune (65 papers)
Adrian Collister (4 papers)
Vikki Copeman (1 paper)
Alex Cullum (1 paper)
Ishita Dasgupta (35 papers)
Dario de Cesare (3 papers)
Julia Di Trapani (1 paper)
Yani Donchev (3 papers)

Citations (27)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1780644688967467325

https://twitter.com/agi2025/status/1780439283662455186

https://twitter.com/AndrewLampinen/status/1780795921196110092

https://twitter.com/fly51fly/status/1780710660588065148

https://twitter.com/lu_sichu/status/1844746693692068246

https://twitter.com/oisinmacaodha/status/1790367439596716227

YouTube

Show All Videos