Experience Transfer for Multimodal LLM Agents in Minecraft Game

Published 7 Apr 2026 in cs.AI | (2604.05533v1)

Abstract: Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer-oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five dimensions: structure, attribute, process, function, and interaction. This formulation allows the agent to identify recurring patterns shared across different tasks and infer what prior experience remains applicable in new situations. Building on this formulation, Echo leverages In-Context Analogy Learning (ICAL) to retrieve relevant experiences and adapt them to unseen tasks through contextual examples. Experiments in Minecraft show that, under a from-scratch learning setting, Echo achieves a 1.3x to 1.7x speed-up on object-unlocking tasks. Moreover, Echo exhibits a burst-like chain-unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval after acquiring transferable experience. These results suggest that experience transfer is a promising direction for improving the efficiency and adaptability of multimodal LLM agents in complex interactive environments.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces Echo, a novel framework that decomposes transferable experience into five explicit, interpretable axes for efficient analogical reasoning.
It leverages a Contextual State Descriptor (CSD) schema to structure episodic memory, enabling fast retrieval and verification of multi-dimensional task patterns.
Experimental results demonstrate a 1.3×–1.7× speed-up in task unlocking and superior continual learning compared to baseline agents.

Experience Transfer for Multimodal LLM Agents in Minecraft Game

Introduction and Motivation

The paper "Experience Transfer for Multimodal LLM Agents in Minecraft Game" (2604.05533) addresses the challenges of efficient knowledge reuse and experience transfer for multimodal, planning-oriented LLM agents operating in open-ended, interactive environments like Minecraft. Unlike previous agents that treat episodic memory as a static repository, the proposed Echo framework treats memory as an active resource for analogical learning—enabling agents to proactively discover, adapt, and apply transferable knowledge across structurally diverse and causally complex tasks.

A primary challenge in this domain is the semantic and causal diversity of real-world state transitions, leading to limited generalization of agents that lack explicit memory structuring. Additionally, existing MLLMs are prone to hallucinations, which degrades control stability and verification in open domains (Figure 1).

Figure 1: Motivation and problem framework demonstrating the Structured In-Context Learning approach and the limitations of conventional MLLM agents.

To address these, Echo introduces five explicit transfer axes—Structural, Attribute, Procedural, Functional, and Interaction—for systematic organization and analogical retrieval of experiences, all operationalized through a Contextual State Descriptor (CSD) schema (Figure 2). This enables interpretable, semantically-aligned transfer that extends beyond index-based or passive recall mechanisms.

Figure 2: Overview of the CSD schema, defining the five semantic axes for structured memory representation.

Explicit Transfer Axes and CSD Schema

Echo’s central contribution is the decomposition of transferable experience into five explicit and interpretable axes:

Structural: Encodes spatial and hierarchical organization.
Attribute: Captures physical and observable properties of entities.
Procedural: Represents causality and transformation rules.
Functional: Encodes affordances and task-related capabilities.
Interaction: Models perception-action feedback and agent-environment coupling.

These axes are unified within the CSD schema—each CSD instance compactly encodes semantic content and global embeddings for fast retrieval. This design ensures efficient multi-dimensional retrieval and facilitates analogical reasoning for new tasks.

CSDs are maintained in a continuously updated memory bank, supporting both symbolic and vectorized representations for joint interpretability and computational efficiency.

Structured In-Context Analogy Learning (ICAL)

To leverage the structured memory, Echo employs an ICL-based analogical learning framework. This framework operates as follows (Figure 3):

Task Selection: Identify a representative, previously solved task and extract its CSD.
Example Retrieval: Using multi-axis similarity, retrieve top- $K$ semantically relevant experiences from memory.
Context Construction: Compose an ICL prompt with exemplars, embedding explicit structural information via CSDs.
Induction and Execution: The MLLM generalizes action sequences for new tasks, which are verified and, if successful, consolidated into the long-term memory.

This approach promotes autonomous discovery of transferable patterns and supports rapid adaptation, as exemplified in case studies on analogical transfer between wooden and stone pickaxe crafting (Figure 4).

Figure 3: ICL-based analogical learning workflow utilizing the CSD memory bank for pattern-driven transfer.

Figure 4: Transferring procedural knowledge from crafting a wooden pickaxe to a stone pickaxe by leveraging functional and procedural axes.

Iterative Transfer and System Architecture

Echo’s iterative reasoning loop comprises perception, memory retrieval, hierarchical planning, self-verification, execution, and structured memory updating (Figure 5). The framework keeps both symbolic task graphs and vectorized semantic descriptors synchronized for robust case-based transfer.

Figure 5: Overall agent framework integrating perception, retrieval, planning, verification, and memory update phases with multi-axis CSD representations.

Experimental Results

Rapid Experience Transfer and Chain-Unlocking

Empirical evaluation on learning-from-scratch scenarios in Minecraft shows that Echo achieves a 1.3×–1.7× speed-up in item unlocking compared to advanced baselines (MP5, Voyager, JARVIS-1, MrSteve). Especially after a brief cold start, Echo demonstrates a "burst-like" unlocking phenomenon—unlocking groups of structurally or functionally analogous items in rapid succession (Figure 6).

Figure 6: Echo exhibits a rapid unlocking phase, consistently outpacing all baselines following cold start.

Component Analysis and Ablation

Detailed ablation (Figure 7) reveals that each transfer axis is critical: removing any single axis causes significant drops in associated task families, with procedural and attribute axes exerting the strongest influence on long-horizon and recipe-based tasks, respectively. Multi-axis alignment outperforms holistic similarity approaches, confirming the importance of explicit representation.

Figure 7: Task performance sensitivity to individual transfer axes, showing critical contributions in multi-step and functionally diverse scenarios.

Continual and Few-Shot Learning

In continual learning (Figure 8), Echo exhibits rapid mid-phase adaptation and ultimately surpasses all baselines in success rate at convergence. Few-shot experiments show that even 1–2 in-context exemplars yield competitive generalization, with diminishing returns beyond 4 exemplars in most scenarios.

Figure 8: Continuous learning curves demonstrate Echo’s accelerated adaptation and higher final performance compared to robust baselines.

Theoretical and Practical Implications

Echo demonstrates that explicit, interpretable memory structuring along semantic axes enables efficient analogical reasoning and experience transfer in complex interactive environments. The use of structured ICAL provides both performance gains and fine-grained interpretability, clarifying the causal factors behind transfer success and failure.

Practically, this framework suggests pathways for embodied agents capable of robust cross-task and cross-domain generalization without requiring frequent retraining or manual curation of skill libraries. Theoretically, the explicit decomposition provides a model for analyzing the compositional and analogical reasoning capabilities of large multimodal models.

However, current limitations include reduced exploratory capabilities, initial learning latency, and potential challenges in bridging the transfer mechanism from idealized (Minecraft) to real-world physical environments, which exhibit greater causal ambiguity and sensory noise.

Conclusion

This work establishes a methodology for explicit, structured experience transfer in multimodal LLM agents, validated in the Minecraft environment. By decomposing transferable knowledge into interpretable semantic axes and leveraging ICL-based analogy, Echo demonstrates significant improvements in learning efficiency, generalization, and interpretability over established baselines. While real-world applicability remains an open challenge, the transfer-oriented paradigm introduced here provides a robust foundation for future research into explainable, continually learning, embodied AI systems.