SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

Published 24 Apr 2026 in cs.CV | (2604.22409v1)

Abstract: Multimodal LLMs (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a hierarchical diagnostic benchmark that reveals significant limitations in current VLMs for dynamic spatial reasoning.
It employs a physically-grounded, multi-modal dataset to evaluate spatial grounding, temporal memory, and episodic recall.
Results highlight the need for persistent 3D state representations and innovative architectural designs to overcome visual-memory deficits.

SpaMEM: Hierarchical Benchmarking for Dynamic Spatial Reasoning in Embodied Environments

Introduction and Motivation

The SpaMEM framework introduces a rigorous hierarchical diagnostic benchmark for evaluating dynamic spatial reasoning within embodied AI environments. Existing multimodal LLMs (MLLMs), while successful in static visual-spatial tasks, are systematically limited when deployed for embodied settings requiring persistent spatial memory and belief revision. The paper delineates and isolates key failure modes in current vision-LLM (VLM) architectures: reliance on statistical co-occurrence, conflation of perceptual and memory failures, and lack of longitudinal spatial belief maintenance in environments undergoing dynamic scene changes. SpaMEM is constructed atop a physically-grounded dataset of over 10 million images encompassing RGB, depth, instance, and semantic modalities, spanning action-conditioned transformations in procedurally-generated indoor environments. This enables granular assessment across atomic grounding, temporal reasoning, and episodic memory.

Hierarchical Evaluation Protocol

SpaMEM formalizes embodied spatial reasoning as a three-level hierarchy:

Level 1 (Atomic Perception): Evaluates single-frame spatial grounding, including semantic object recognition (SOR), visual grounding and localization (VGL), depth/proximity estimation (DPE), relative spatial relationship reasoning (RSR), and counting tasks (CC/IC). Results reveal that semantic recognition is moderately robust, but coordinate-consistent localization (mean IoU) is functionally non-existent across model families, indicating a foundational bottleneck in spatial competence.
Level 2 (Text-Conditioned Temporal Memory): Provides ground-truth symbolic history alongside visual input, isolating pure temporal reasoning and belief update mechanisms. Models exhibit a dramatic improvement in semantic and inventory-based tasks (SOR-M F1 ~ 0.90), but spatial grounding and trajectory reconstruction remain constrained, showing that symbolic scaffolding, not visual perception, underpins successful memory maintenance.
Level 3 (Visual-Conditioned Episodic Memory): Removes textual history, requiring models to maintain persistent world state from raw visual evidence. Performance collapses precipitously in semantic recall, spatial tracking, and cumulative state reconstruction. For example, InternVL3's F1 drops from 0.36 in static frames to 0.13 in dynamic episodic streams, and migration-path tracking (STT) universally fails, demonstrating categorical breakdown of visual memory.

Temporal probing, both short-term (step-wise) and long-term (episodic), further exposes rapid memory decay and inability to aggregate local event detections into coherent global state representations.

Experimental Results and Diagnostic Findings

Benchmarking diverse state-of-the-art VLMs (InternVL2/2.5/3, LLaVA-NeXT/OneVision, Qwen2/2.5/3) in both RGB and RGB-D configurations uncovered several fundamental architectural limitations:

Static-to-Dynamic Degradation: Models geared for static visual recognition cannot generalize to continuous embodied streams. Dynamic changes (motion blur, viewpoint variance, occlusions) overwhelm pretraining biases, collapsing spatial consistency and semantic anchoring.
Logic-Perception Paradox and Symbolic Scaffolding Dependency: Text-based history transforms spatial memory tasks into bookkeeping over logical descriptions, masking deficiencies in visual memory. When stripped of symbolic anchors (L3), integration scores (CSR) decline by over 70%, revealing reliance on LLM backbones for reasoning rather than autonomous world modeling.
Space-Time Dissonance: Models can sequence temporal events reliably (temporal IoU up to 0.65), but spatial metric mapping (spatial IoU) remains near zero. This shows a decoupling between temporal sequencing and actionable spatial grounding, attributed to transformer-based architectures’ lack of geometric inductive biases.
Identity Continuity and Memory Integration: Object identity continuity—especially across occlusions and container interactions—is universally challenging. Trajectory tracking, inventory maintenance, and cumulative belief revision fail without symbolic hints, underscoring the absence of long-horizon spatial memory mechanisms.
Modality Channel (RGB vs. RGB-D): Depth cues marginally improve geometric signals but do not resolve episodic integration or semantic recall bottlenecks. The primary deficit lies in temporal fusion, not sensory channel sufficiency.

Results consistently highlight that short-term event perception does not compose into robust long-term spatial integration. Counting and inventory tasks exhibit some resilience, but spatial manipulation, reasoning, and recall collapse in dynamic and occluding contexts, particularly for thin or background objects due to tokenization bottlenecks.

Dataset Design and Diagnostic Coverage

SpaMEM leverages a large-scale, procedurally-generated dataset using automated LLM-driven agents interacting in the ProcTHOR-10K environment. The structured world graph representation guides action-conditioned causal reasoning (spawn, place, remove) and ensures both semantic and geometric scene diversity. Multi-view snapshots, rich annotations, and structured state updates provide the basis for high-fidelity spatial reasoning evaluation.

The dataset contains challenging occlusion scenarios, constrained-volume receptacles, and fine-grained object categories, systematically dismantling static semantic priors and exposing the failure modes in spatial-semantic aliasing, resolution bottleneck, and background bias.

Implications and Future Directions

The SpaMEM benchmark exposes a stacked bottleneck in embodied spatial reasoning, which is not overcome by current multimodal fusion recipes or architectural scale alone. The findings strongly indicate the necessity for:

Explicit persistent 3D state representations, decoupled from next-token prediction pipelines.
Egocentric inductive biases and structured spatial buffers linking episodic timelines to geometric representations.
Architectural innovations targeting identity continuity, robust geometric alignment, and error-correcting visual memory mechanisms.

From a theoretical perspective, SpaMEM motivates the development of fusion modules capable of integrating sequential visual evidence with causal action logs, and the adoption of map-like memory systems for persistent state updates.

Practically, advances in this domain are essential for embodied agents operating in non-static, real-world environments, with implications for robotics, AR navigation, and interactive scene manipulation.

Conclusion

SpaMEM establishes a critical diagnostic hierarchy for spatial memory in dynamic embodied environments, substantially advancing the granularity and rigor of benchmarking for spatial reasoning. The results unequivocally show that current VLMs are fundamentally constrained by symbolic scaffolding and episodic integration deficits. Overcoming these limitations demands architectural and algorithmic progress toward persistent, coordinate-consistent spatial memory with explicit reasoning over dynamic scene evolution. Future research must prioritize causal integration of perceptual updates, robust geometric state tracking, and scalable episodic memory architectures for embodied AI.