- The paper introduces a novel framework that integrates episodic spatial memory with an adaptive execution policy to overcome catastrophic forgetting in long-horizon mobile manipulation.
- It employs spatio-temporal fusion mapping and memory-driven target grounding for precise localization and robust planning, significantly boosting success rates on the ALFRED benchmark.
- The framework balances global navigation with local opportunism, reducing redundant actions and achieving up to +2.97% improvement over previous state-of-the-art methods.
ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation
Motivation and Problem Statement
Long-horizon mobile manipulation tasks in embodied AI require integrated navigation and robust manipulation abilities within complex, dynamic indoor environments. State-of-the-art approaches face persistent issues: catastrophic forgetting due to lack of persistent spatial memory, spatial inconsistency from inadequate 2D/3D semantic reasoning, and rigid execution policies incapable of adapting to opportunistic targets. The ALFRED benchmark highlights these deficits, demanding both semantic comprehension and efficient action planning over extended episodic sequences.
Figure 1: ESCAPE addresses catastrophic forgetting, spatial inconsistency, and rigid execution through persistent spatial memory and an adaptive execution policy.
ESCAPE Framework Overview
The ESCAPE framework directly targets these limitations with two principal modules: Episodic Spatial Memory and Adaptive Execution Policy, tightly coupled in a perception–grounding–execution paradigm. Episodic Spatial Memory comprises a Spatio-Temporal Fusion Mapping module that autoregressively constructs a depth-free, persistent 3D memory, and a Memory-Driven Target Grounding module for precise interaction mask generation. The Adaptive Execution Policy orchestrates proactive global navigation with concurrent reactive local manipulation, dynamically shifting between long-term planning and opportunistic actions.
Figure 2: ESCAPE integrates spatial memory, target grounding, global planning, and local monitoring for efficient task completion in ALFRED environments.
Architecture and Methodology
Episodic Spatial Memory
The Spatio-Temporal Fusion Mapping module leverages spatio-temporal cross-attention in the 3D domain—eschewing the dependency on depth estimation and instead projecting 3D points onto the 2D image plane for direct feature extraction. Observation-to-Memory Encoding (OME) utilizes deformable attention to fuse local regions of visual observation into the spatial memory grid. Memory Retrieval and Update (MRU) maintains temporal consistency by retrieving features from previous memory states, preventing information decay and facilitating spatial reasoning across timesteps.
Semantic map segmentation is executed via a 3D map semantic segmentation head trained using grid-wise binary cross-entropy, incorporating multi-hot object category vectors to capture spatial distribution.
Memory-Driven Target Grounding
This module enables fine-grained 2D localization by aligning 3D memory-derived object queries with current 2D observations. Dynamically generated queries from pooled memory features are mapped via MLPs, which are then cross-compared with 2D image features to yield pixel-wise interaction masks.
Joint supervision combines map segmentation and image segmentation losses, optimizing the memory for both spatial relationships and precise localization. Training is conducted on ALFRED expert trajectories using ResNet50, deformable attention layers, and grid sizes matching agent step granularity.
Adaptive Execution Policy
The Adaptive Execution Policy resolves the trade-off between global planning and local opportunism. An initial exploration phase constructs a navigable map; the proactive global planner uses BFS on high-probability target regions for waypoint generation, while the reactive local monitor scans for immediate manipulation opportunities. The monitor employs a binary classifier on scene maps and visual features to output manipulable object vectors, preemptively interrupting global plans for expedited task fulfillment.
For multi-instance targets, the reactive module enhances efficiency by opportunistically discovering new goals; for unique targets, spatial memory assures optimal localization. Hierarchical fallback maintains robustness in case of occlusions or pose failures.
Figure 3: Persistent spatial memory enables robust task completion, avoiding failures due to missed memory updates (left) and demonstrating success with memory-centric reasoning (right).
Experimental Results and Numerical Analysis
ESCAPE exhibits superior performance across all metrics on the ALFRED benchmark. With step-by-step instructions, it achieves 65.09% success rate on test seen environments and 60.79% on unseen, surpassing previous bests by up to +2.97%. Without detailed instructions, ESCAPE maintains high performance, reaching 61.24% and 56.04% SR (seen/unseen), strong PLWSR/PLWGC scores (up to 52.42%/58.29% seen), and displays robust generalization with minimum accuracy degradation in unseen settings.
The ablation study quantifies component contributions: removing MRU causes up to 34% SR drop; OME removal triggers 8% decrease; eliminating AEP reduces SR by ~4% and severely impacts efficiency factors. Map segmentation achieves 0.758 mIOU (seen), while image segmentation attains 0.869 mIOU—indicating high-fidelity object localization. Dynamic queries outperform static ones, and interaction mask generation is not a performance bottleneck versus ground-truth masks.
Figure 4: Adaptive Execution Policy greatly reduces action steps, completing tasks in fewer moves compared to rigid, sequential execution.
Qualitative and Behavioral Analysis
Qualitative visualizations illustrate ESCAPE's memory-driven spatial reasoning and reactive planning. The Spatio-Temporal Fusion Mapping module robustly enables contained object retrieval, and the Adaptive Execution Policy swiftly opportunizes local manipulation targets, yielding significant reductions in redundant actions.
Practical and Theoretical Implications
Practically, ESCAPE's robust spatial memory and adaptive policy facilitate efficient embodied agents in complex, dynamic environments without explicit, granular human instruction. Continued scaling and integration with sophisticated language interpretation models will further enhance language-vision-action grounding. Theoretically, the tight coupling of episodic spatial memory with adaptive hierarchical planning milestones a shift toward persistent, situational awareness and opportunistic execution—traits critical for real-world mobile manipulation.
Conclusion
ESCAPE advances the state-of-the-art in long-horizon mobile manipulation via persistent spatial memory and a hierarchical execution policy. By overcoming catastrophic forgetting, spatial inconsistency, and rigid action strategies, ESCAPE delivers superior success rates, substantial efficiency improvements, and robust generalization. These innovations underscore the importance of dynamic memory reasoning and opportunistic policy design for next-generation embodied agents (2604.13633).