Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Instructable Agents Across Many Simulated Worlds (2404.10179v3)

Published 13 Mar 2024 in cs.RO, cs.AI, cs.HC, and cs.LG
Scaling Instructable Agents Across Many Simulated Worlds

Abstract: Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.

Scaling Instructable Agents Across Many Simulated Worlds

The paper "Scaling Instructable Agents Across Many Simulated Worlds" details the development and initial results of the Scalable, Instructable, Multiworld Agent (SIMA) project. SIMA aims to create embodied AI systems capable of executing arbitrary language instructions within diverse 3D environments, bridging the gap between symbolic language and embodied perception/action.

Introduction and Motivation

The ability of AI systems to follow complex language instructions and perform tasks in realistic 3D environments has long been a challenge. While modern AI demonstrates proficiency in abstract domains such as chess and programming, interacting with the physical world through grounded perception and action remains significantly underdeveloped. The SIMA project seeks to overcome this limitation by training agents to follow diverse free-form instructions across various 3D settings, from curated research environments to open-ended commercial video games.

A critical approach of SIMA is its reliance on a generic, human-like interface. Agents receive image observations and language instructions, translating these into keyboard-and-mouse actions, enabling them to adapt seamlessly to new environments. This universal interface models human interaction, allowing for direct imitation learning from human behavior data and zero-shot transfer to new tasks and environments.

Environments

SIMA includes a range of both commercial video games and research-specific environments, selected for their visual richness and diverse interaction possibilities. The commercial video games such as Goat Simulator 3, Hydroneer, No Man's Sky, Satisfactory, Teardown, Valheim, and Wobbly Life offer a high degree of complexity and visual fidelity. Conversely, research environments like Construction Lab, Playhouse, ProcTHOR, and WorldLab provide controlled settings for assessing specific skills essential to grounded AI.

Data Collection and Processing

The project collects large datasets of human expert gameplay, capturing videos, language instructions, and actions within these environments. Data quality is ensured through filtering and preprocessing measures. The instructions span a variety of domains such as resource gathering, combat, navigation, and object management, with data collection methodologies including single-player sessions and two-player "setter-solver" interactions.

Agent Architecture

The SIMA agent's architecture integrates inputs from pretrained models and fine-tunes them using behavioral cloning. Key components include:

  • Vision models like SPARC for fine-grained image-text alignment,
  • Video prediction models like Phenaki,
  • Transformers for processing visual observations, language instructions, and memory states,
  • A policy network generating keyboard-and-mouse actions.

This architectural setup allows the agent to assimilate extensive prior knowledge while adapting to specific tasks within varied environments. Classifier-Free Guidance is employed to enhance the policy's language conditionality during inference, improving the agent's responsiveness to instructions.

Evaluation Methods

The diverse and complex environments necessitate varied evaluation methodologies:

  • Ground-truth evaluations in research environments for precise task success metrics.
  • Optical Character Recognition (OCR) for detecting in-game text denoting task completion.
  • Human evaluations for assessing agent performance on tasks where automatic metrics are infeasible.

These strategies ensure robust, scalable assessment across multiple environments while maintaining a high sensitivity to language instruction adherence.

Initial Results

Agents demonstrate varied success rates across environments, exhibiting better performance in controlled research settings than in more interactive commercial games. Performance is particularly notable in simpler environments like Playhouse and WorldLab, indicating the agents' ability to generalize basic skills. Performance across skill categories varies, with higher success in movement and basic interactions but lower success in more intricate tasks like combat and resource management.

Comparisons with several baselines highlight the benefits of the integrated approach, showing significant improvements over agents trained without pretraining or language input. Zero-shot evaluation results are promising, with agents transferring basic skills to held-out environments, underlining the generalization capabilities of the approach.

Implications and Future Directions

The implications of SIMA are twofold: practical and theoretical. Practically, SIMA's approach offers an efficient, scalable method for training embodied AI, circumventing the prohibitive costs and risks associated with real-world robotics testing. Theoretically, it advances the understanding of language grounding in rich, embodied settings, contributing to the development of General AI.

Future developments will focus on scaling to more environments, enhancing agent robustness, leveraging more sophisticated pretrained models, and refining evaluation protocols. The ultimate goal is to create a general instructable agent capable of complex, language-driven behavior across any simulated 3D environment, potentially extending to real-world applications.

Conclusion

The SIMA project represents a significant step toward achieving general AI capable of understanding and executing language instructions in rich, interactive environments. By leveraging large-scale data collection, robust agent architectures, and diverse evaluation techniques, SIMA not only advances AI capabilities but also provides a critical platform for future research in grounded language understanding and general AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Imitating Interactive Intelligence. arXiv preprint arXiv:2012.05672, 2020.
  2. Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2211.11602, 2022a.
  3. Evaluating Multimodal Interactive Agents. arXiv preprint arXiv:2205.13274, 2022b.
  4. Human-Timescale Adaptation in an Open-Ended Task Space. In International Conference on Machine Learning, 2023.
  5. Compositional Foundation Models for Hierarchical Planning. In Advances in Neural Information Processing Systems, 2023.
  6. Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds. In Advances in Neural Information Processing Systems, 2022.
  7. PaLM 2 Technical Report. arXiv preprint arXiv:2305.10403, 2023.
  8. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos. In Advances in Neural Information Processing Systems, 2022.
  9. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  10. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680, 2019.
  11. Improving fine-grained understanding in image-text pre-training. arXiv preprint arXiv:2401.09865, 2024.
  12. Behavioural Cloning: Phenomena, Results and Problems. IFAC Proceedings Volumes, 28(21):143–149, 1995.
  13. RT-1: Robotics Transformer for Real-World Control at Scale. arXiv preprint arXiv:2212.06817, 2022.
  14. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv preprint arXiv:2307.15818, 2023a.
  15. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. In Conference on Robot Learning, 2023b.
  16. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, 2020.
  17. Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration. In Advances in Neural Information Processing Systems, 2020.
  18. Language and culture internalization for human-like autotelic AI. Nature Machine Intelligence, 4(12):1068–1076, 2022.
  19. PyBullet, a Python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016.
  20. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Association for Computational Linguistics, 2019.
  21. Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning. arXiv preprint arXiv:2112.03763, 2021.
  22. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In Advances in Neural Information Processing Systems, 2022.
  23. CARLA: An Open Urban Driving Simulator. In Conference on Robot Learning, 2017.
  24. PaLM-E: An Embodied Multimodal Language Model. arXiv preprint arXiv:2303.03378, 2023.
  25. An Interactive Agent Foundation Model. arXiv preprint arXiv:2402.05929, 2024.
  26. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. In International Conference on Machine Learning, 2018.
  27. MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. In Advances in Neural Information Processing Systems, 2022.
  28. Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805, 2023.
  29. Shaping Belief States with Generative Environment Models for RL. In Advances in Neural Information Processing Systems, 2019.
  30. Making Efficient Use of Demonstrations to Solve Hard Exploration Problems. In International Conference on Learning Representations, 2019.
  31. MineRL: A Large-Scale Dataset of Minecraft Demonstrations. In International Joint Conference on Artificial Intelligence, 2019.
  32. Recurrent World Models Facilitate Policy Evolution. In Advances in Neural Information Processing Systems, 2018.
  33. Mastering Atari with Discrete World Models. In International Conference on Learning Representations, 2020.
  34. Mastering Diverse Domains through World Models. arXiv preprint arXiv:2301.04104, 2023.
  35. Stevan Harnad. The Symbol Grounding Problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990.
  36. Grounded Language Learning in a Simulated 3D World. arXiv preprint arXiv:1706.06551, 2017.
  37. Environmental drivers of systematicity and generalization in a situated agent. In International Conference on Learning Representations, 2019.
  38. Grounded Language Learning Fast and Slow. In International Conference on Learning Representations, 2020.
  39. Classifier-Free Diffusion Guidance. arXiv preprint arXiv:2207.12598, 2022.
  40. Sim2Real in Robotics and Automation: Applications and Challenges. IEEE Transactions on Automation Science and Engineering, 18(2):398–400, 2021.
  41. Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556, 2022.
  42. Thought Cloning: Learning to Think while Acting by Imitating Human Thinking. arXiv preprint arXiv:2306.00323, 2023.
  43. Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning. arXiv preprint arXiv:2311.17842, 2023.
  44. An Embodied Generalist Agent in 3D World. arXiv preprint arXiv:2311.12871, 2023.
  45. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In International Conference on Machine Learning, 2022.
  46. A data-driven approach for learning to control computers. In International Conference on Machine Learning, 2022.
  47. Language as an Abstraction for Hierarchical Deep Reinforcement Learning. In Advances in Neural Information Processing Systems, 2019.
  48. The Malmo Platform for Artificial Intelligence Experimentation. In International Joint Conference on Artificial Intelligence, 2016.
  49. Language Models can Solve Computer Tasks. In Advances in Neural Information Processing Systems, 2023.
  50. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv preprint arXiv:2401.13649, 2024.
  51. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv preprint arXiv:1712.05474, 2017.
  52. Using Natural Language and Program Abstractions to Instill Human Inductive Biases in Machines. In Advances in Neural Information Processing Systems, 2022.
  53. Tell me why! Explanations support learning relational and causal structure. In International Conference on Machine Learning, 2022.
  54. Competition-Level Code Generation with AlphaCode. Science, 378(6624):1092–1097, 2022.
  55. STEVE-1: A Generative Model for Text-to-Behavior in Minecraft. arXiv preprint arXiv:2306.00937, 2023.
  56. Isaac Gym: High Performance GPU Based Physics Simulation For Robot Learning. In Advances in Neural Information Processing Systems, 2021.
  57. Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences, 117(42):25966–25974, 2020.
  58. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  59. Hans Moravec. Mind Children: The Future of Robot and Human Intelligence. Harvard University Press, 1988.
  60. Improving Intrinsic Exploration with Language Abstractions. In Advances in Neural Information Processing Systems, 2022.
  61. Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling. arXiv preprint arXiv:2301.12050, 2023.
  62. Open-Ended Learning Leads to Generally Capable Agents. arXiv preprint arXiv:2107.12808, 2021.
  63. OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
  64. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv preprint arXiv:2310.08864, 2023.
  65. Counter-Strike Deathmatch with Large-Scale Behavioural Cloning. In IEEE Conference on Games, 2022.
  66. VirtualHome: Simulating Household Activities via Programs. In Computer Vision and Pattern Recognition, 2018.
  67. Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots. arXiv preprint arXiv:2310.13724, 2023.
  68. A Generalist Agent. Transactions on Machine Learning Research, 2022.
  69. Stay on topic with Classifier-Free Guidance. arXiv preprint arXiv:2306.17806, 2023.
  70. Habitat: A Platform for Embodied AI Research. In International Conference on Computer Vision, 2019.
  71. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In Computer Vision and Pattern Recognition, 2020.
  72. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
  73. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.
  74. BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments. In Conference in Robot Learning, 2021.
  75. Open-World Object Manipulation using Pre-trained Vision-Language Models. arXiv preprint arXiv:2303.00905, 2023.
  76. Habitat 2.0: Training Home Assistants to Rearrange their Habitat. In Advances in Neural Information Processing Systems, 2021.
  77. Semantic Exploration from Language Abstractions and Pretrained Representations. In Advances in Neural Information Processing Systems, 2022.
  78. Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study. arXiv preprint arXiv:2403.03186, 2024.
  79. Gerald Tesauro et al. Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3):58–68, 1995.
  80. A Deep Hierarchical Approach to Lifelong Learning in Minecraft. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
  81. MuJoCo: A physics engine for model-based control. In IEEE International Conference on Intelligent Robots and Systems, 2012.
  82. ChatGPT for Robotics: Design Principles and Model Abilities. arXiv preprint arXiv:2306.17582, 2023.
  83. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions. In International Conference on Learning Representations, 2022.
  84. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  85. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv:2305.16291, 2023a.
  86. JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models. arXiv preprint arXiv:2311.05997, 2023b.
  87. Using Unity to Help Solve Intelligence. arXiv preprint arXiv:2011.09294, 2020.
  88. Learning Interactive Real-World Simulators. arXiv preprint arXiv:2310.06114, 2023.
  89. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning. In Conference on Robot Learning, 2020.
  90. Transporter Networks: Rearranging the Visual World for Robotic Manipulation. In Conference on Robot Learning, 2021.
  91. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. In International Conference on Learning Representations, 2022.
  92. GATS: Gather-Attend-Scatter. arXiv preprint arXiv:2401.08525, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (94)
  1. SIMA Team (1 paper)
  2. Maria Abi Raad (4 papers)
  3. Arun Ahuja (24 papers)
  4. Catarina Barros (3 papers)
  5. Frederic Besse (11 papers)
  6. Andrew Bolt (8 papers)
  7. Adrian Bolton (3 papers)
  8. Bethanie Brownfield (2 papers)
  9. Gavin Buttimore (3 papers)
  10. Max Cant (2 papers)
  11. Sarah Chakera (1 paper)
  12. Stephanie C. Y. Chan (20 papers)
  13. Jeff Clune (65 papers)
  14. Adrian Collister (4 papers)
  15. Vikki Copeman (1 paper)
  16. Alex Cullum (1 paper)
  17. Ishita Dasgupta (35 papers)
  18. Dario de Cesare (3 papers)
  19. Julia Di Trapani (1 paper)
  20. Yani Donchev (3 papers)
Citations (27)
Youtube Logo Streamline Icon: https://streamlinehq.com