EchoVLA: Memory-Based Mobile Manipulation

Updated 29 November 2025

EchoVLA is a robotic vision-language-action model that uses declarative scene and episodic memory to coordinate navigation and manipulation over extended horizons.
Its architecture employs independent memory modules and a two-stage attention-based fusion to integrate spatial, temporal, and semantic cues for precise policy control.
Experimental evaluations, including on the MoMani benchmark, demonstrate enhanced success rates and longer trajectory horizons, validating its effectiveness in complex mobile tasks.

EchoVLA is a robotic Vision-Language-Action (VLA) model that introduces synergistic declarative memory mechanisms to address the challenges of long-horizon mobile manipulation. Unlike previous VLA agents, which are restricted to short-horizon and table-top scenarios, EchoVLA enables embodied agents to coordinate navigation and manipulation over extended temporal and spatial contexts through specialized memory architectures and attention-based fusion for policy control (Lin et al., 22 Nov 2025).

1. Architectural Overview

EchoVLA is distinguished by its memory-supported framework for long-horizon mobile manipulation. The core architecture integrates two principal components of declarative memory:

Scene Memory: Retains collections of spatial-semantic maps that encode the evolving physical layout and semantic labels of the environment.
Episodic Memory: Stores temporally indexed task-level experiences, capturing multimodal contextual features derived from the agent's interaction history, observations, and instructions.

These memories are independently stored, updated, and queried according to the agent's current observations, recent task history, and high-level language instructions. The representations retrieved from both scene and episodic memory are then fused through a hierarchy of coarse- and fine-grained attention mechanisms. This fusion generates the context encoding that informs the agent's mobile-arm diffusion policy—a stochastic control policy that governs navigation and manipulation across the embodied platform.

2. Memory Mechanisms: Scene and Episodic Memory

EchoVLA implements two synergistic forms of declarative memory inspired by neurocognitive models of human navigation and planning:

Scene Memory: Dynamically constructs and updates spatial-semantic maps, capturing not only the geometric configuration of the workspace but also object-level semantics relevant to the current task. The memory is accessible throughout both training and inference.
Episodic Memory: Aggregates a running buffer of multimodal contextual representations spanning previous instructions, observations, and decisions at the temporal granularity of entire episodes. This supports long-horizon reasoning, such as re-identifying previously observed objects or recalling temporal dependencies beyond the reach of recurrent policies.

During inference, the agent attends to relevant elements across both memories, integrating cross-modal cues from current sensory input, task language, and prior experiences to disambiguate state and plan effective policies for temporally extended behaviors.

3. Memory Retrieval and Attention-Based Fusion

The retrieval process for scene and episodic memory is governed by current contextual cues, leveraging both the agent's partial observations and the explicit instruction stream. The two memory streams are processed separately, with representations from each subjected to their respective update and retrieval cycles.

To integrate the retrieved memory states, EchoVLA employs a two-stage fusion strategy:

Coarse-Grained Attention: Aggregates broad spatial or temporal features from each memory stream, flagging salient contexts relevant to the task at hand.
Fine-Grained Attention: Refines this fusion by focusing on specific subsets of the retrieved features, enabling nuanced integration of spatial, semantic, and episodic cues.

The output of the fusion pipeline parameterizes the agent’s policy, which is realized as a conditional diffusion process over the robot’s joint- and base-space action trajectories. This design enables EchoVLA to synthesize navigation and manipulation actions that are adaptive to changing spatial layouts and shifting high-level goals (Lin et al., 22 Nov 2025).

4. Experimental Evaluation and Benchmarking

Comprehensive evaluation of EchoVLA is performed using both simulated environments and real-world robotic platforms. The model demonstrates improvements in long-horizon mobile manipulation, as quantified by average success rate (SR) and step-count metrics.

EchoVLA achieves the following performance:

Manipulation/Navigation SR: 0.52
Mobile Manipulation SR: 0.31
Both metrics exceed a baseline ( $\pi_{0.5}$ ) by margins of +0.08 and +0.11, respectively.

Step-count analyses are summarized for various benchmarks, emphasizing the increase in trajectory horizon relative to prior datasets:

Dataset	Avg. Steps (Simulation)
Nav-Only Baseline	103.62
Robocasa (orig.)	278.00
TOF	630.99
PnPS2C	755.56

The PnPS2C tasks, in particular, exhibit the longest execution horizons (755.56 steps), representing a 2.7× increase over the original Robocasa dataset. The navigation-only baseline (103.62 steps) is described as a lower bound representing scenarios involving pure navigation without manipulation. For real-world experiments, step counts for specific manipulation tasks are reported, with door-opening tasks requiring the most extended execution sequences (e.g., Open Refrigerator at 191.0 steps) (Lin et al., 22 Nov 2025).

5. MoMani Benchmark: Statistical Analysis

EchoVLA introduces the MoMani automated benchmark to facilitate large-scale training and evaluation of long-horizon VLA agents. However, only limited information is provided about the benchmark itself. The available statistical data are summarized as follows:

No details are reported regarding overarching design, taxonomy, task composition, or data-generation pipeline.
The only explicit statistics are average trajectory lengths (in steps) for simulation and hardware experiments as shown above.
No formal metric definitions (e.g., SR, SPL, MMR), dataset schemas, or code release information is provided.
No illustrative example trajectories or step-by-step discussions are included.
It is noted that PnPS2C achieves the longest horizon, and that tasks with door operations require more execution steps in the real world.

A plausible implication is that MoMani’s primary contribution in this context lies in enabling the quantitative assessment of policy performance across substantially longer manipulation horizons than previous benchmarks. All other operational details (pipeline, sampling, LLM selection) are not documented (Lin et al., 22 Nov 2025).

6. Significance and Research Context

EchoVLA addresses key limitations of previous embodied VLA models, most notably the inability to maintain or leverage temporally deep task context in mobile manipulation scenarios. The integration of synergistic declarative memory modules enables the agent to interpret language-derived task goals, update structured environment representations, and recall relevant episodic information across extended temporal horizons.

Measured improvements in both average trajectory length and success rates provide quantitative evidence for EchoVLA’s efficacy in complex, spatially and temporally extended tasks. The use of attention-based fusion across heterogeneous memory representations establishes a modular paradigm for reasoning in multimodal, partially observable domains.

The introduction of the MoMani benchmark provides early evidence of the scalability and practical impact of EchoVLA’s design. However, further research and documentation will be required to establish MoMani as a standard resource for the community, owing to the absence of public task design, metric schematics, and access protocols in the available documentation (Lin et al., 22 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

EchoVLA: Robotic Vision-Language-Action Model with Synergistic Declarative Memory for Mobile Manipulation (2025)

EchoVLA: Memory-Based Mobile Manipulation

1. Architectural Overview

2. Memory Mechanisms: Scene and Episodic Memory

3. Memory Retrieval and Attention-Based Fusion

4. Experimental Evaluation and Benchmarking

5. MoMani Benchmark: Statistical Analysis

6. Significance and Research Context

Whiteboard

Follow Topic

Continue Learning

EchoVLA: Memory-Based Mobile Manipulation

1. Architectural Overview

2. Memory Mechanisms: Scene and Episodic Memory

3. Memory Retrieval and Attention-Based Fusion

4. Experimental Evaluation and Benchmarking

5. MoMani Benchmark: Statistical Analysis

6. Significance and Research Context

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics