Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning (2504.17950v1)

Published 24 Apr 2025 in cs.MA and cs.CL

Abstract: Collaboration is ubiquitous and essential in day-to-day life -- from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open-world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state-of-the-art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill-optimized for multi-agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in-context and imitation learning. Our website can be found here: https://mindcraft-minecollab.github.io/

Summary

  • The paper introduces the open-source 'mind' platform and MineCollab benchmark, demonstrating new methods for multi-agent collaborative embodied reasoning using LLMs.
  • It details a structured framework with 47 high-level commands that enable LLM agents to execute coordinated actions and communicate effectively in Minecraft.
  • Experimental findings reveal that increasing task and collaboration complexity challenges LLMs in long-term planning and resource management.

This paper introduces mind, a versatile open-source platform built on Minecraft for researching embodied AI, specifically focusing on multi-agent collaboration. It also presents MineCollab, a benchmark dataset and task suite designed to evaluate the collaborative and embodied reasoning abilities of LLM agents in this environment.

The core idea is to paper how LLMs can coordinate actions, communicate effectively, and reason about embodied tasks in a shared, dynamic world.

The mind Platform: A Foundation for Embodied Collaboration

mind provides a structured environment for deploying and testing LLM agents controlling characters within the open-world game of Minecraft. Its design facilitates plug-and-play experimentation with different agent architectures and communication strategies.

Environment and Interaction:

Minecraft is chosen for its complexity, open-ended nature, and rich possibilities for interaction, making it suitable for long-horizon tasks and embodied reasoning. Agents operate in a partially observable world, requiring them to actively query the environment for information using specific commands. This tool-calling approach reduces the need for processing noisy, high-dimensional observations (like raw pixels, though vision inputs are partially supported) and keeps context lengths manageable [2024].

Action Space:

Instead of low-level mouse/keyboard actions or even intermediate Mineflayer API calls, mind provides a set of 47 high-level, parameterized tools that LLMs can directly invoke. This abstraction allows agents to reason over more complex sequences of actions relevant to task goals. Examples of these high-level commands include:

  • !givePlayer("playername", "item name", quantity): Transfer items to another player.
  • !craftRecipe("recipe name", quantity): Attempt to craft an item.
  • !searchForBlock("block type", search range): Find and navigate to a block.
  • !goToCoordinates(x, y, z, closeness): Navigate to a specific location.
  • !checkBlueprintLevel(levelNum): Check the completion status of a specific level in a building blueprint.
  • !startConversation("playername", "message"): Initiate or send a message in a conversation with another agent.

For tasks requiring custom behavior, a tool exists that allows agents to output raw Mineflayer Javascript code.

Agent Architecture:

The platform consists of four main components:

  1. A server to manage agent instances and the Minecraft world.
  2. A main agent loop to process messages, environment feedback, and agent decisions.
  3. A library implementing the high-level action commands and observation queries.
  4. A layer for interfacing with various LLMs via prompting and API calls.

Additional modules support features like custom code generation, default behaviors, self-guided play, and inter-agent dialogue management. The architecture is designed to support agents by providing robust tools, allowing researchers to focus on the LLM's reasoning and collaboration capabilities. Retrieval Augmented Generation (RAG) is used with few-shot examples via embedding similarity to the current conversation to improve agent performance.

Multi-agent Collaboration Implementation:

Collaboration is managed through a conversation manager. Agents can initiate (!startConversation) and end (!endConversation) conversations with other agents. Communication is pairwise, but the framework can scale to multiple agents by switching between conversations. The manager also controls conversation pace, pausing or slowing down dialogue when agents are busy executing actions in the environment.

MineCollab: A Benchmark for Collaborative Embodied Tasks

MineCollab is a benchmark designed to test multi-agent collaboration in embodied scenarios. It comprises three task domains requiring agents to coordinate, communicate, and plan:

  1. Cooking Tasks: Agents must collect ingredients and prepare meals (e.g., cake, rabbit stew) by coordinating resource collection, sharing ingredients, and using crafting stations (furnace, smoker, crafting table). A "Hell's Kitchen" variant requires agents to share incomplete recipe knowledge, forcing explicit communication of multi-step plans.
  2. Crafting Tasks: Agents craft items (tools, furniture) from mined or collected materials. Tasks require agents to share inventory information, exchange resources, and potentially communicate complex crafting recipes, especially when recipe knowledge is not universally available. An example task requires crafting a bookshelf, where agents may need to coordinate obtaining and sharing wood, leather, and paper [Appendix \ref{appendix : example_task}].
  3. Construction Tasks: Agents build structures based on procedurally generated blueprints. Complexity can be varied by changing the number of unique materials or rooms. Collaboration is essential as agents may specialize in handling certain materials or sections of the blueprint and must coordinate placement to avoid undoing each other's work. Task success is measured by an edit distance metric comparing the constructed building to the blueprint.

Tasks are procedurally generated and split into distinct train and test sets to avoid data leakage. Item divisions for cooking tasks between train and test sets are detailed in Appendix \ref{appendix:train-test-divide}. Task validation involves checking inventory for crafted/cooked items or assessing blueprint completion percentage.

SFT Dataset Creation:

The platform includes tools to generate Supervised Fine Tuning (SFT) data. An oracle agent (in the paper, LLaMA-70B-3.3-Instruct) performs tasks on the training set, and successful trajectories are collected as training examples. This dataset is used to fine-tune smaller models, demonstrating that performance comparable to larger models can be achieved, improving benchmark accessibility [Table \ref{tab: full_results}, \ref{tab:dataset-table}].

Experiments and Findings

The paper evaluates several state-of-the-art LLMs (GPT-4o, Claude 3.5 Sonnet, LLaMA3-70B/8B-Instruct) on MineCollab.

Impact of Embodied Task Complexity:

Performance generally decreases as task complexity increases, whether measured by the number of unique materials or rooms in construction tasks [Figure \ref{fig:construction-materials}, \ref{fig:construction-room}, Table \ref{tab:construction-success-rates}]. Qualitative analysis revealed that agents often struggle with longer horizons, sometimes undoing previous progress, highlighting limitations in long-term planning and memory in embodied contexts [Appendix \ref{appendix: construction-task-fail}].

Impact of Collaborative Complexity:

Increasing the number of collaborating agents from two to five significantly decreases performance across all tested LLMs on cooking and crafting tasks [Figure \ref{fig:num-agents-cooking}, \ref{fig:num-agents-crafting}]. While theoretically more agents should parallelize work, the added coordination load and communication overhead overwhelm current models. Agents struggle with avoiding redundant work and effectively managing shared resources or spaces.

Requiring agents to explicitly communicate detailed multi-step plans (e.g., how to craft an item when the recipe is not directly available to them) drastically reduced task success rates by over 15% [Figure \ref{fig:plan-cooking}, \ref{fig:plan-crafting}, Table \ref{tab:model-performance}, \ref{tab:model-success-rates}]. This is attributed to LLMs not being well-optimized for complex natural language communication necessary for effective coordination and knowledge transfer. Failure modes include misinterpreting plans, not requesting necessary information, or failing to act on communicated plans [Appendix \ref{appendix: partial-plan-failure}].

Ablation studies on prompting demonstrated the criticality of components like summarized memory and few-shot examples for agent performance, particularly in enabling progression over interaction turns [Table \ref{tab:agent-ablations}].

Practical Implications

The research highlights that while LLMs show promise in embodied tasks, significant challenges remain in multi-agent settings. Current models struggle with:

  • Robust Communication: Effectively communicating complex plans and status updates via natural language is a major bottleneck. Agents fail to efficiently share information, leading to misunderstandings and coordination failures.
  • Coordination and Resource Management: As the number of agents increases, the complexity of coordinating actions, managing shared resources, and avoiding interference becomes overwhelming.
  • Long-Term Embodied Planning: Agents exhibit difficulties in maintaining consistent progress over long horizons, sometimes destroying work already done by collaborators.

These findings suggest that improving LLM agents for collaborative embodied tasks requires methods beyond standard prompting and fine-tuning, potentially involving novel architectures or training techniques specifically designed for multi-agent interaction and embodied reasoning in dynamic environments. The mind platform and MineCollab benchmark provide a valuable testbed for developing and evaluating such methods.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com