Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 190 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Mindmap: 3D Action Policies with Spatial Memory

Updated 1 October 2025
  • The paper introduces a spatial memory mechanism that aggregates metric-semantic 3D scene information using dual-stream encoders and attention-based diffusion policies to enhance task performance.
  • It leverages a continuously updated TSDF reconstruction and cross-attention between current and past observations to address occlusion and out-of-view reasoning in dynamic settings.
  • Empirical results show a significant success rate boost to 79% in complex tasks like cube stacking and object placement compared to memory-less baselines.

Spatial memory in deep feature maps for 3D action policies refers to the mechanism by which a robot or agent aggregates and retains metric-semantic information about its environment within structured neural representations that persist across time. The approach described in mindmap directly addresses the fundamental limitation of memory-less policies, enabling 3D manipulation and navigation tasks that depend on context spanning multiple non-overlapping observations. Mindmap achieves this by integrating a continuously updated 3D scene reconstruction into a deep diffusion policy, leveraging attention over both instantaneous and accumulated feature maps to inform sequential decision making.

1. Motivation and Foundational Concepts

The problem addressed by mindmap is the inability of standard end-to-end neural control policies to reason about out-of-view objects in dynamic 3D environments. Without an integrated memory mechanism, such policies can only react to the robot’s current RGB-D observation, severely limiting performance on tasks that require the agent to remember scene details not present in the current frame—for example, the spatial location of previously observed objects or the configuration of occluded regions.

Mindmap introduces a spatial memory mechanism by (a) progressively aggregating visual features into a metric-semantic reconstruction (the “memory”), using data from sequences of robot observations, and (b) conditioning the policy (a 3D diffusion transformer) on both the current observation and this persistent memory. This enables planning and execution that jointly leverage instantaneous and cumulative environmental knowledge.

2. Architecture and Feature Aggregation

At the architectural level, mindmap incorporates two parallel encoders for memory integration:

  • The first encoder extracts features from the current RGB-D observation using a frozen Vision Foundation Model (VFM), such as AM-RADIO. Formally, for RGB input II:

Fi=ϕ(Ii),FiRh×w×fF_i = \phi(I_i), \quad F_i \in \mathbb{R}^{h \times w \times f}

Here ϕ\phi is the VFM and ff is the feature depth.

  • The second encoder processes tokens derived from the global 3D reconstruction (a mesh constructed from fused past observations), extracting features associated with 3D mesh vertices by projecting them into the current camera’s feature map:

fi=Fi[Π(p)]f_i = F_i[\Pi(p)]

where Π\Pi denotes the camera projection and Fi[]F_i[\cdot] is a nearest-neighbor lookup in feature space.

The 3D reconstruction is constructed using an extended version of nvblox, which fuses incoming RGB-D frames into a Truncated Signed Distance Field (TSDF) and computes mesh vertices using marching cubes. For semantic integration, the feature vector for each voxel is updated as:

fvoxel(p)αF[Π(p)]+(1α)fvoxel(p)f_\text{voxel}(p) \leftarrow \alpha \cdot F[\Pi(p)] + (1-\alpha)\cdot f_\text{voxel}(p)

where α[0,1]\alpha \in [0,1] controls the momentum of the update. This metric-semantic TSDF representation forms the spatial memory, storing both observed geometry and the semantic encoding of the scene over time.

3. 3D Diffusion Policy and Attention Mechanisms

At decision time, mindmap leverages its dual-stream memory via a denoising transformer-based diffusion policy. The policy predicts the sequence of robot end-effector positions by iteratively refining a noisy initial trajectory. Both current-observation features and memory-derived features are provided as tokens:

  • The two streams are concatenated and serve as input to cross-attention and self-attention layers, allowing the model to attend dynamically to both real-time visual information and enriched spatial memory.
  • The policy architecture is based on, and extends, the 3D Diffuser Actor baseline by enabling memory access to tokens representing past observations now outside the field of view.

This attention-based conditioning allows the robot to plan complex manipulation trajectories using scene details that may not be directly visible, but were previously committed to memory through the reconstruction pathway.

4. Simulation Experiments and Quantitative Performance

Mindmap’s effectiveness is demonstrated in a suite of egocentric manipulation and navigation tasks designed to stress spatial memory:

  • Cube Stacking: The robot stacks cubes whose positions may only come into view intermittently.
  • Mug in Drawer: Success depends on returning a mug to the correct drawer, necessitating memory of which is suitable.
  • Drill in Box and Stick in Bin: These require humanoid robots to scan and recall the state of containers before appropriately placing objects.

In these scenarios, solely reactive policies—those using only the current observation—were unable to reliably complete tasks requiring memory. Mindmap achieved an average success rate of 79%, in contrast to 20–46% for baselines without memory mechanisms. The performance approached a “privileged” baseline with external camera access, underscoring the practical efficacy of integrated metric-semantic memory.

5. Significance for Generalist 3D Action Policies

The results show that persistent spatial memory encoded in deep feature maps enables robust, end-to-end learning of controllers that can solve tasks involving occlusion, out-of-view reasoning, and complex spatial relationships between scene elements. Unlike approaches relying exclusively on recurrent dynamics or histories, direct aggregation and access of a semantic 3D reconstruction allow for more precise spatial reasoning, especially in manipulation contexts.

Importantly, mindmap’s structure accommodates both instantaneous and ongoing visual information, permitting seamless integration of updates as new views are observed. The attention mechanism ensures that the diffusion policy can “focus” on relevant objects/features regardless of their current visibility, improving both success rates and sample efficiency in a variety of 3D task domains.

6. Released Tools and Future Research Directions

Mindmap’s accompanying open-source release includes:

  • The nvblox-based 3D reconstruction system
  • Training code for the memory-augmented diffusion policy
  • Simulation tasks explicitly structured to evaluate spatial memory

These resources provide a standardized foundation for benchmarking and further methodological innovation in spatial memory and deep action policies.

Potential future directions highlighted include scaling to more complex and diverse scenarios, realizing fully differentiable reconstruction and memory systems, and unifying spatial memory integration with trajectory chunking or action hierarchy frameworks for improved generalization and planning efficiency.


In summary, mindmap establishes that deep feature map-based spatial memory, realized through metric-semantic 3D reconstruction and attention-based integration in policy networks, is a key enabler of performant and general-purpose 3D robot action policies. This architecture substantially improves the agent's capacity for out-of-view reasoning and task completion in previously intractable manipulation scenarios (Steiner et al., 24 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Mindmap (Spatial Memory in Deep Feature Maps for 3D Action Policies).