MemoryVLA: Dual-Memory for Robotic Manipulation

Updated 30 August 2025

MemoryVLA is a dual-memory architecture that fuses short-term working memory with long-term episodic and semantic memory, enabling temporally coherent decision-making in robotic manipulation.
It employs parallel visual token extraction via DINOv2 and SigLIP, combined with language processing by LLaMA-7B, to create context-rich perceptual-cognitive tokens fused with temporal encodings.
Empirical evaluations demonstrate enhanced performance, achieving success rates up to 96.5% in simulation and 84.0% in real-world tasks under non-Markovian conditions.

MemoryVLA refers to a perceptual-cognitive memory architecture tailored for Vision-Language-Action (VLA) models in long-horizon robotic manipulation. The framework is motivated by cognitive neuroscience, specifically the distinction between working memory and long-term episodic/semantic memory, and introduces hybrid short-term and long-term memory mechanisms for temporally aware decision making in robotics. MemoryVLA systematically fuses immediate perceptual and cognitive state representations with context-aware retrieval from a dedicated memory bank to optimize task performance, particularly under the non-Markovian conditions common in manipulation tasks.

1. Cognition–Memory–Action Framework

MemoryVLA instantiates a dual-memory paradigm for robotic vision-language-action models. At each episode timestep, a pretrained vision–LLM (VLM) encodes RGB observations through parallel backbones (DINOv2, SigLIP) to produce a high-dimensional set of perceptual tokens $p \in \mathbb{R}^{N_p \times d_p}$ , with $N_p = 256$ . These are projected into language space and concatenated with tokenized instructions ( $L$ ) yielding a semantic cognitive token $c \in \mathbb{R}^{1 \times d_c}$ via LLaMA-7B.

The system maintains two memory subsystems:

Working Memory: Transient, holding the current perceptual and cognitive tokens $(p, c)$ for immediate control context.
Perceptual–Cognitive Memory Bank (PCMB): Long-term, storing consolidated episodic perceptual memories $m^p$ and high-level cognitive memories $m^c$ , each equipped with temporal encodings $\text{TE}(t)$ .

Temporal context is integrated by retrieving decision-relevant entries from PCMB, then adaptively fusing them with working memory via learned gating functions. The consolidated memory bank is periodically updated, merging redundant tokens based on cosine similarity.

2. Tokenization and Memory Encoding

During each step, MemoryVLA encodes the image observation and instruction as follows:

Visual Token Extraction: Parallel vision models (DINOv2, SigLIP) yield $N_p$ perceptual features. These are compressed into a token matrix $p$ .
Cognitive Tokenization: Projected visual tokens plus language inputs serve as input to LLaMA-7B, producing $c$ .

Both current tokens and historical memory entries incorporate temporal information using sinusoidal position encodings $\text{TE}(t)$ , enabling temporally-aware retrieval.

The PCMB holds historical token sequences: $m^x = \{m^x_i \in \mathbb{R}^{N_x \times d_x}\}_{i=1,\ldots, L}, \quad x \in \{\text{per}, \text{cog}\}$ Each entry is tagged with its episode step.

3. Memory Retrieval and Adaptive Fusion

Decision-relevant history is retrieved from the PCMB by scaled dot-product attention:

Compute keys: $K^x = [m^x_1 + \text{TE}(t_1); \ldots; m^x_L + \text{TE}(t_L)]$
Compute values: $V^x = [m^x_1; \ldots; m^x_L]$
Given query $q^x$ (either the latest $p$ or $c$ ), produce retrieved representation: $\hat{H}^x = \operatorname{softmax}\left(\frac{q^x K^{x\top}}{\sqrt{d_x}}\right) V^x$

A learned gating mechanism (MLP followed by sigmoid activation) fuses retrieved and working memory tokens: $g^x = \sigma(\operatorname{MLP}([\text{concat}(x, \hat{H}^x)]))$

$\tilde{x} = g^x \odot \hat{H}^x + (1 - g^x) \odot x$

where $\odot$ denotes elementwise multiplication.

PCMB capacity is controlled; when full, similarity between adjacent memory entries is measured and the most similar pair $(\tilde{x}_i, \tilde{x}_{i+1})$ is merged: $i^* = \arg\max_{i=1}^{L-1} \cos(\tilde{x}_i, \tilde{x}_{i+1})$

$m^x_{i^*} \leftarrow \frac{1}{2} (\tilde{x}_{i^*} + \tilde{x}_{i^*+1})$

This process adaptively reduces redundancy while maintaining episodic and semantic content.

4. Memory-Conditioned Diffusion Action Expert

The temporally enriched, memory-fused tokens $(\tilde{p}, \tilde{c})$ are used to condition a diffusion-based policy for continuous control:

A transformer-based diffusion expert (DiT) denoises a sequence of noisy action tokens over $T$ steps.
At each denoising step, the action vector is concatenated with a sinusoidal step encoding. Memory tokens condition the model via distinct attention heads for perceptual and cognitive features.
Output is a multi-step trajectory of 7-DoF actions (translations, rotations, gripper state), with the model trained on MSE loss against demonstration trajectories.

This enables temporally coherent action sequences that fully exploit accumulated, memory-conditioned representations.

5. Empirical Evaluation and Task Performance

MemoryVLA was validated in simulation and real-world manipulation tasks:

Simulation: On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites with WidowX and Franka robots:
- Achieved 71.9%, 72.7%, and 96.5% success rates respectively.
- Outperformed strong baselines (CogACT, π₀), e.g., +14.6 points on Bridge.
Real-World Manipulation: Across 12 tasks, achieved an overall success rate of 84.0%, and +26 points improvement on long-horizon tasks compared to baselines.
Utilizes only third-person RGB visual input; does not rely on proprioceptive sensing.

These results demonstrate robust generalization under challenging settings, including temporal dependencies, distractors, and out-of-distribution visual conditions.

6. Technical Significance and Extensions

MemoryVLA offers:

An explicit computational analog of dual-memory (working, episodic-semantic) systems from cognitive science, adaptable to VLA models.
Mechanisms for controlled memory consolidation, balancing history retention and redundancy.
End-to-end differentiable fusion of temporally aware context via attention and gating.
A practical foundation for long-horizon manipulation, supporting action planning where previous context (beyond immediate observations) influences task outcomes.

The framework establishes a template for integrating memory banks and memory-conditioned policies in robotics, potentially extensible to reflection or chain-of-thought generation in future research. A plausible implication is the adaptation of similar dual-memory retrieval and consolidation strategies in multimodal agents and continual learning models.

7. Applications and Future Directions

MemoryVLA directly addresses limitations in existing VLA models for temporally extended, non-Markovian tasks. Applications include:

Sequential robotic manipulation (e.g., multi-step assembly, cooking, cleaning) where context over many timesteps affects success.
Service robots operating in visually static or ambiguous scenes.
Lifelong learning agents requiring consolidation and retrieval of temporally distant yet decision-relevant episodes.

Future developments may explore integrating dynamic memory reflection, more efficient consolidation (potentially biologically inspired mechanisms), and scaling to larger, more diverse multimodal environments. This suggests MemoryVLA can serve as a bridge between short-term, token-based reasoning and global semantic-episodic storage, further advancing generalizable and robust robotics.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MemoryVLA.