MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge (2206.08853v2)

Published 17 Jun 2022 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-LLMs as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite, knowledge bases, algorithm implementation, and pretrained models (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.

PDF Abstract

Overview of the 'Framework for Building Open-Ended Embodied Agents'

This paper presents advancements in developing generalized embodied AI agents within open-ended environments by leveraging internet-scale multimodal knowledge. The framework, implemented in "Minecraft," aims to overcome limitations prevalent in specialist AI models, which typically operate within narrowly defined parameters and objectives.

Key Contributions

This work features significant innovation in three primary areas:

Diverse Open-Ended Environment: The authors introduce a comprehensive simulation environment that utilizes the Minecraft platform. This environment includes thousands of tasks specified in natural language, supporting a diverse range of interactions. This differs from typical closed-ended environments, allowing for broader exploration without predefined rewards.
Internet-Scale Knowledge Base: The authors compile a large, multimodal knowledge base sourced from Minecraft's ecosystem. This includes over 730,000 YouTube videos, a wealth of textual content, and interactive forums. Such data allows agents to leverage collective community experiences, facilitating more efficient learning from past demonstrations rather than isolated trial-and-error.
Novel Algorithmic Approach: A new learning algorithm utilizing pre-trained video-LLMs is proposed. These models serve a dual role: providing a reward function and supporting evaluation metrics. By utilizing the vast available data, the resultant models exhibit notable improvements in performing both programmatic and creative tasks specified within the environment.

Technical Highlights

Simulation Platform and Task Benchmarking: The work extends the functionality of existing platforms such as MineRL, significantly increasing the number and variety of tasks available within Minecraft. The tasks include those that require a long temporal extension to solve and cannot easily be automated, such as architectural building challenges.
Leveraging Large-Scale Pre-Training: By adopting Transformer-based architectures and integrating knowledge from large pre-trained models, the proposed algorithm can efficiently translate the collective information into actionable insights for task accomplishment.
Evaluation Metrics: The introduction of automated evaluation metrics replaces human judgment in some tasks, particularly creative ones. These metrics are aligned with human evaluations, effectively converting subjective task assessments into quantifiable measures.

Experimental Outcomes

The empirical results suggest that the proposed framework achieves competitive performance against manual reward functions, particularly excelling in creative task evaluation through learned models. For instance, tasks such as "Combat Spider" demonstrate the framework's capability to sustain robust performance. Additionally, the paper illustrates the framework's capacity for cross-task knowledge transfer, highlighting an up to 73% success rate improvement over traditional reward engineering.

Implications and Future Directions

The paper's implications extend into several domains, primarily AI research on generalized learning agents in open-world settings. This framework encourages the development of algorithms that can scale across various domains by utilizing comprehensive, multi-modal datasets. It posits that open-ended problems benefit significantly from models informed by extensive shared human experience encoded through digital media.

Given the depth and breadth of the Minecraft data collected, future work could explore integrating more structured data formats from this knowledge base. Additionally, the proposed methodologies could be applied to other similar open-ended systems, prompting breakthroughs in more general AI learning strategies.

In conclusion, this framework marks a promising step towards realizing AI agents capable of adapting seamlessly across diverse domains, enriched by internet-scale datasets. It opens avenues for further research into scalable, efficient multi-task learning, potentially revolutionizing the methodology behind AI development.