Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 28 tok/s

GPT-5 High 34 tok/s Pro

GPT-4o 72 tok/s

GPT OSS 120B 441 tok/s Pro

Kimi K2 200 tok/s Pro

2000 character limit reached

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation (2508.19236v1)

Published 26 Aug 2025 in cs.RO and cs.CV

Abstract: Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

Collections

Summary

The paper introduces a dual-memory approach that integrates perceptual and cognitive streams to tackle non-Markovian challenges in robotic manipulation.
It employs a Perceptual–Cognitive Memory Bank (PCMB) with retrieval, gate fusion, and token consolidation to merge historical and current observations effectively.
Experimental results show significant improvements, achieving up to 96.5% success on diverse tasks while maintaining robust performance under out-of-distribution conditions.

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Motivation and Problem Statement

Robotic manipulation tasks are fundamentally non-Markovian, often requiring temporally extended reasoning and memory of past states and actions. Mainstream Vision-Language-Action (VLA) models typically operate on single-frame observations, neglecting temporal dependencies and thus underperforming on long-horizon tasks. The MemoryVLA framework addresses this limitation by introducing a dual-memory system inspired by cognitive science, explicitly modeling both short-term and long-term temporal context for robust robotic control.

Figure 1: (a) Push Buttons tasks require temporal modeling due to visually indistinguishable pre- and post-action states. (b) Human dual-memory system. (c) MemoryVLA's Perceptual–Cognitive Memory Bank. (d) MemoryVLA outperforms baselines.

Architecture and Methodology

Vision-Language Cognition Module

MemoryVLA leverages a 7B-parameter Prismatic VLM, pretrained on Open-X Embodiment, to encode RGB observations and language instructions into two streams:

Perceptual tokens: Extracted via DINOv2 and SigLIP backbones, compressed to $N_p=256$ channels using SE-bottleneck.
Cognitive token: Projected into the LLM embedding space and processed by LLaMA-7B, yielding a compact high-level semantic representation.

These tokens constitute the working memory, analogous to transient neural activity in the cortex.

Perceptual–Cognitive Memory Bank (PCMB)

The PCMB stores temporally indexed entries for both perceptual and cognitive streams, maintaining up to $L$ entries per stream. At each timestep, the working memory queries the PCMB via cross-attention with sinusoidal positional encoding, retrieving decision-relevant historical context.

Figure 2: MemoryVLA architecture: VLM encodes inputs, working memory queries PCMB, fuses retrieved context, and conditions a diffusion transformer for action prediction.

Memory Retrieval, Fusion, and Consolidation

Retrieval: Cross-attention between current tokens and PCMB entries, incorporating temporal encoding.
Gate Fusion: Adaptive integration of retrieved and current tokens via learned gates, producing memory-augmented features.
Consolidation: When PCMB exceeds capacity, temporally adjacent and semantically similar entries are merged by averaging, maintaining compactness and reducing redundancy.
Figure 3: Memory module details: retrieval via cross-attention, gate fusion, and consolidation by token merging.

Memory-Conditioned Diffusion Action Expert

The memory-augmented tokens condition a diffusion-based Transformer (DiT) policy, implemented with DDIM and 10 denoising steps, to predict a sequence of $T=16$ future 7-DoF actions. The architecture supports foresight and multi-step planning, with cognition-attention and perception-attention layers guiding semantic and fine-grained control, respectively.

Experimental Evaluation

Benchmarks and Setup

MemoryVLA is evaluated on three robots (Franka, WidowX, Google Robot) across 10 suites, 150+ tasks, and 500+ variations in both simulation and real-world settings.

Figure 4: Experimental setup: three simulation benchmarks and two real-world suites, spanning 150+ tasks and 500+ variations.

SimplerEnv Results

On SimplerEnv-Bridge (WidowX), MemoryVLA achieves 71.9% average success, a +14.6 point gain over CogACT-Large and surpassing $\pi_0$ . On SimplerEnv-Fractal (Google Robot), it attains 72.7% overall, outperforming CogACT by +4.6 points, with notable improvements on Open/Close Drawer and Put in Drawer tasks.

LIBERO Results

On LIBERO (Franka), MemoryVLA reaches 96.5% average success across five suites, exceeding CogACT by +3.3 points and outperforming $\pi_0$ , despite using only third-person RGB (no wrist or proprioceptive inputs).

Real-World Results

On 12 real-world tasks (six general, six long-horizon temporal), MemoryVLA achieves 85% and 83% success, respectively, outperforming CogACT by +9 and +26 points. Gains are especially pronounced on long-horizon tasks requiring temporal reasoning, such as Seq. Push Buttons (+43) and Change Food (+38).

Robustness and Generalization

MemoryVLA demonstrates strong robustness under out-of-distribution (OOD) conditions, maintaining high success rates across unseen backgrounds, distractors, lighting, objects, containers, and occlusions in both real-world and simulation. The largest performance degradation is observed under severe camera view changes.

Figure 5: Real-world OOD robustness: MemoryVLA maintains high success rates across diverse variants.

Figure 6: Simulation OOD robustness: strong generalization except for unseen camera views.

Figure 7: Simulation OOD robustness for hinge-like manipulation: moderate shifts are handled well, camera view changes cause notable drops.

Ablation Studies

Ablations on SimplerEnv-Bridge reveal:

Combining perceptual and cognitive memory yields the highest success (71.9%).
Optimal memory length is 16; shorter or longer lengths degrade performance.
Timestep positional encoding, gate fusion, and token-merge consolidation each contribute positively to performance.

Qualitative Results

MemoryVLA produces coherent, temporally aware action sequences in both general and long-horizon tasks, as visualized in qualitative rollouts.

Figure 8: Real-world long-horizon temporal tasks: Seq Push Buttons, Change Food, Guess Where.

Figure 9: Real-world long-horizon temporal tasks: Clean Table Count, Pick Place Order, Clean Restaurant Table.

Figure 10: Real-world general tasks: Insert Circle, Egg in Pan, Egg in Oven, Stack Cups, Stack Blocks, Pick Diverse Fruits.

Figure 11: SimplerEnv-Bridge tasks: Spoon on Tower, Carrot on Plate, Stack Cube, Eggplant in Basket.

Figure 12: SimplerEnv-Fractal tasks: Pick Coke Can, Move Near, Open/Close Drawer, Put in Drawer.

Figure 13: LIBERO tasks: representative trajectories from all five suites.

Implementation Considerations

Computational Requirements: Training requires 8 NVIDIA A100 GPUs, batch size 256, and a 7B VLM backbone. Inference uses DDIM with 10 steps for efficient trajectory generation.
Data Modalities: Only third-person RGB and language instructions are used, simplifying deployment and reducing hardware dependencies.
Memory Management: PCMB consolidation is critical for scalability, preventing memory bloat and maintaining salient information.
Deployment: The architecture is compatible with ROS and standard RGB camera setups, facilitating real-world integration.

Implications and Future Directions

MemoryVLA establishes the importance of explicit temporal memory modeling in VLA frameworks, yielding substantial gains on long-horizon and OOD tasks. The dual-memory design, inspired by cognitive science, enables robust, generalizable manipulation policies without reliance on proprioceptive or wrist-camera inputs.

Future research directions include:

Memory Reflection: Aligning long-term memory with LLM input space for embedding-space chain-of-thought reasoning.
Lifelong Memory: Biologically inspired consolidation to distill frequently reused experiences into permanent representations, supporting scalable generalization across scenes, tasks, and embodiments.

Conclusion

MemoryVLA introduces a principled Cognition-Memory-Action framework for robotic manipulation, integrating perceptual and cognitive memory with diffusion-based action prediction. The approach achieves state-of-the-art performance across diverse benchmarks, demonstrating strong robustness, generalization, and real-world applicability. The explicit modeling of temporal dependencies is shown to be essential for long-horizon robotic control, and the architecture provides a foundation for future advances in memory-augmented embodied AI.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (10)

GitHub

MemoryVLA

Tweets

https://twitter.com/NobuakiKuwabara/status/1961363274043462044

alphaXiv

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation (53 likes, 0 questions)