STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft (2406.11247v1)

Published 17 Jun 2024 in cs.CV

Abstract: Building an embodied agent system with a LLM as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challenging tasks such as navigation and even creative tasks, with an efficiency far exceeding previous state-of-the-art methods by a factor of $2.5\times$ to $7.3\times$. We begin our exploration with a vanilla LLM, augmenting it with a vision encoder and an action codebase trained on our collected high-quality dataset STEVE-21K. Subsequently, we enhanced it with a Critic and memory to transform it into a complex system. Finally, we constructed a hierarchical multi-agent system. Our recent work explored how to prune the agent system through knowledge distillation. In the future, we will explore more potential applications of STEVE agents in the real world.

PDF HTML Abstract

Overview of the STEVE Series: Multi-Modal Agent Systems in Minecraft

The paper "STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft" explores the development of embodied agent systems using LLMs as their foundational components, evaluated in the virtual context of Minecraft. The paper mentions the substantial advantage of designing such agent architectures within controlled virtual environments before considering real-world applications, mainly due to the challenges related to deployment costs and unpredictability in open-world scenarios.

Research Methods and Structures

The authors detail a structured approach to agent system development through the STEVE series, progressively enhancing capabilities across several iterations. Key to their methodology is the integration of a fine-tuned multi-modal LLM, STEVE-13B, embedded with a visual encoder bolstered by the bespoke STEVE-21K dataset, which consists of substantial vision-environment pairs and skill-code mappings. The subsequent iterations demonstrate a gradual sophistication in agents' task handling, from basic command execution to more sophisticated navigation and creative assignments within Minecraft's expansive environment.

Key Findings and Numerical Results

Significant quantitative improvements are highlighted throughout the paper. Specifically, STEVE-1.5, which introduces a hierarchical multi-agent framework, markedly improves on task coordination, allowing for enhanced distributed problem-solving. Compared to existing models such as AutoGPT and Voyager, the STEVE series demonstrates a 2.5× to 7.3× increase in efficiency across tasks ranging from basic skills to complex navigation and creation. In terms of knowledge QA and technical mastery, STEVE-13B outperforms other models, including LLaMA2 and GPT-4, showcasing its superior capability in processing and responding to multifaceted queries.

Advancements in Agent Systems

The progression to STEVE-2 sees the incorporation of advanced multi-modal abilities, enabling a broader spectrum of environmental interactions. Notably, STEVE-2 records a 91% success rate in the "Goal Search" navigation metric, outperforming its predecessors and competitors, further supported by an impressive 19× improvement in material collection activities. These enhancements are achieved through refined planning and execution strategies, highlighted by an innovative knowledge distillation process that distills multi-agent strategies into a streamlined single-model architecture, improving both operational simplicity and overall performance.

Implications and Future Directions

The research encapsulated in the STEVE series lays a robust groundwork for future exploration of applying sophisticated agent systems in dynamic, real-world settings. The outlined progression from basic task execution to advanced autonomous decision-making offers valuable insights into optimizing multi-modal AI architectures. Future implications of this work involve adapting these models for environments beyond Minecraft, emphasizing the potential for such embodied systems to navigate and respond to real-world multifaceted challenges accurately.

Conclusion

This paper demonstrates a systematic, data-driven approach to developing agent systems in virtual environments by leveraging advanced LLM-based frameworks. The STEVE series illustrates the significant gains possible with well-curated datasets and innovative model integration techniques. Its strong empirical results suggest promising avenues for extending multi-modal agents' scopes beyond gaming environments, potentially impacting diverse real-world applications where autonomous agents must function reliably amidst complexity and variability.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Zhonghan Zhao (11 papers)
Wenhao Chai (50 papers)
Xuan Wang (205 papers)
Ke Ma (75 papers)
Kewei Chen (13 papers)
Dongxu Guo (5 papers)
Tian Ye (65 papers)
Yanting Zhang (26 papers)
Hongwei Wang (150 papers)
Gaoang Wang (68 papers)

Citations (6)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1802913084928934242

YouTube

Show All Videos