Odyssey: Empowering Minecraft Agents with Open-World Skills (2407.15325v2)

Published 22 Jul 2024 in cs.AI

Abstract: Recent studies have delved into constructing generalist agents for open-world environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce Odyssey, a new framework that empowers LLM-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new agent capability benchmark includes the long-term planning task, the dynamic-immediate planning task, and the autonomous exploration task. Extensive experiments demonstrate that the proposed Odyssey framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.

PDF Abstract

Empowering Agents with Open-World Skills

This paper introduces a novel framework, geared towards enhancing the capabilities of LLM-based agents within the open-world environment of Minecraft. The framework seeks to address the challenges involved in the development of autonomous embodied agents that are capable of executing complex tasks in dynamic settings, a significant step towards realizing artificial general intelligence.

Framework Overview

The framework, termed Odyssey, comprises three core components:

Interactive Agent and Skill Library: A sophisticated library that consists of 40 primitive skills and 183 compositional skills, enabling agents to interact and plan effectively within the open world. Skills are categorized into operational and spatial, enhancing the agent's interaction spectrum beyond simple tasks.
Fine-Tuned LLaMA-3 Model: This model is refined using a comprehensive question-answering dataset derived from the Minecraft Wiki, boasting over 390k instruction entries. The fine-tuning process aims to adapt the LLaMA-3 model for optimal performance in open-world environments, facilitating improved planning and decision-making capabilities.
Open-World Benchmark: A novel benchmark is introduced, consisting of long-term planning tasks, dynamic-immediate planning tasks, and autonomous exploration tasks. This benchmark evaluates the planning and exploration prowess of agents, offering a platform for measuring effectiveness in complex open-world challenges.

Experimental Insights

The experiments demonstrate the efficacy of the Odyssey framework across several dimensions. Key results include:

Enhanced success rates and efficiency in Minecraft-specific tasks compared to existing studies, including superior performance in basic programmatic tasks such as tool crafting and resource collection.
The fine-tuned MineMA models outperform the baseline LLaMA-3 models on both the multi-theme and Wiki-based multiple-choice question datasets, indicating the effectiveness of the domain-specific fine-tuning process.
In long-term planning tasks, especially those involving combat scenarios, the framework shows significant improvements in execution time and success rates. The introduction of iterative planning further enhances efficiency, as evidenced by the substantial reduction in execution time across multiple task completions.
In dynamic-immediate tasks, such as farming and animal husbandry, the MineMA-70B model outperforms its smaller counterparts, reflecting the influence of model size on adaptive task execution.
The framework demonstrates strong performance in autonomous exploration tasks, rivaling state-of-the-art methods that leverage more expensive resources like GPT-4.

Theoretical and Practical Implications

The Odyssey framework delivers several theoretical and practical implications:

Theoretical Contribution: By integrating a library of diverse, open-world skills with a finely tuned LLM, the framework challenges the traditional bottleneck of narrow task definitions in agent training. The recursive skill execution method streamlines task decomposition, representing a significant stride in hierarchical planning and execution strategies.
Practical Impact: The public availability of the dataset, model weights, and code establishes a platform for future research, enabling advancements in the development of autonomous agents. The benchmarking suite sets a new standard for evaluating AI agents in open-world environments, promoting broader adoption and further refinement of LLM-based frameworks.

Future Directions

The authors acknowledge limitations around the potential for hallucinations in open-source LLMs that affect performance consistency. Future research might focus on enhancing the retrieval-augmented generation strategies to mitigate this issue. Additionally, integrating visual processing capabilities to expand the framework's applicability to tasks requiring visual input could be a promising avenue.

Overall, this paper presents a comprehensive approach to equipping LLM-based agents with the skills necessary to thrive in open-world environments. It is a foundational step towards developing more robust and versatile autonomous agents in AI research.