Scaling Instructable Agents Across Many Simulated Worlds
The paper "Scaling Instructable Agents Across Many Simulated Worlds" details the development and initial results of the Scalable, Instructable, Multiworld Agent (SIMA) project. SIMA aims to create embodied AI systems capable of executing arbitrary language instructions within diverse 3D environments, bridging the gap between symbolic language and embodied perception/action.
Introduction and Motivation
The ability of AI systems to follow complex language instructions and perform tasks in realistic 3D environments has long been a challenge. While modern AI demonstrates proficiency in abstract domains such as chess and programming, interacting with the physical world through grounded perception and action remains significantly underdeveloped. The SIMA project seeks to overcome this limitation by training agents to follow diverse free-form instructions across various 3D settings, from curated research environments to open-ended commercial video games.
A critical approach of SIMA is its reliance on a generic, human-like interface. Agents receive image observations and language instructions, translating these into keyboard-and-mouse actions, enabling them to adapt seamlessly to new environments. This universal interface models human interaction, allowing for direct imitation learning from human behavior data and zero-shot transfer to new tasks and environments.
Environments
SIMA includes a range of both commercial video games and research-specific environments, selected for their visual richness and diverse interaction possibilities. The commercial video games such as Goat Simulator 3, Hydroneer, No Man's Sky, Satisfactory, Teardown, Valheim, and Wobbly Life offer a high degree of complexity and visual fidelity. Conversely, research environments like Construction Lab, Playhouse, ProcTHOR, and WorldLab provide controlled settings for assessing specific skills essential to grounded AI.
Data Collection and Processing
The project collects large datasets of human expert gameplay, capturing videos, language instructions, and actions within these environments. Data quality is ensured through filtering and preprocessing measures. The instructions span a variety of domains such as resource gathering, combat, navigation, and object management, with data collection methodologies including single-player sessions and two-player "setter-solver" interactions.
Agent Architecture
The SIMA agent's architecture integrates inputs from pretrained models and fine-tunes them using behavioral cloning. Key components include:
- Vision models like SPARC for fine-grained image-text alignment,
- Video prediction models like Phenaki,
- Transformers for processing visual observations, language instructions, and memory states,
- A policy network generating keyboard-and-mouse actions.
This architectural setup allows the agent to assimilate extensive prior knowledge while adapting to specific tasks within varied environments. Classifier-Free Guidance is employed to enhance the policy's language conditionality during inference, improving the agent's responsiveness to instructions.
Evaluation Methods
The diverse and complex environments necessitate varied evaluation methodologies:
- Ground-truth evaluations in research environments for precise task success metrics.
- Optical Character Recognition (OCR) for detecting in-game text denoting task completion.
- Human evaluations for assessing agent performance on tasks where automatic metrics are infeasible.
These strategies ensure robust, scalable assessment across multiple environments while maintaining a high sensitivity to language instruction adherence.
Initial Results
Agents demonstrate varied success rates across environments, exhibiting better performance in controlled research settings than in more interactive commercial games. Performance is particularly notable in simpler environments like Playhouse and WorldLab, indicating the agents' ability to generalize basic skills. Performance across skill categories varies, with higher success in movement and basic interactions but lower success in more intricate tasks like combat and resource management.
Comparisons with several baselines highlight the benefits of the integrated approach, showing significant improvements over agents trained without pretraining or language input. Zero-shot evaluation results are promising, with agents transferring basic skills to held-out environments, underlining the generalization capabilities of the approach.
Implications and Future Directions
The implications of SIMA are twofold: practical and theoretical. Practically, SIMA's approach offers an efficient, scalable method for training embodied AI, circumventing the prohibitive costs and risks associated with real-world robotics testing. Theoretically, it advances the understanding of language grounding in rich, embodied settings, contributing to the development of General AI.
Future developments will focus on scaling to more environments, enhancing agent robustness, leveraging more sophisticated pretrained models, and refining evaluation protocols. The ultimate goal is to create a general instructable agent capable of complex, language-driven behavior across any simulated 3D environment, potentially extending to real-world applications.
Conclusion
The SIMA project represents a significant step toward achieving general AI capable of understanding and executing language instructions in rich, interactive environments. By leveraging large-scale data collection, robust agent architectures, and diverse evaluation techniques, SIMA not only advances AI capabilities but also provides a critical platform for future research in grounded language understanding and general AI.