AstraNav-Memory: Lifelong Visual Navigation Memory
- AstraNav-Memory is an image-centric memory architecture designed for lifelong embodied navigation using visual context compression.
- It employs a ViT backbone with frozen DINOv3, PixelUnshuffle+Conv tokenization, and transformer-based policy to efficiently compress and integrate hundreds of visual observations.
- Empirical results demonstrate more than 15 percentage points improvement in Success Rate and SPL on benchmark navigation tasks compared to previous approaches.
AstraNav-Memory is an image-centric, end-to-end memory architecture designed for lifelong embodied navigation. It utilizes visual context compression to enable agents to accumulate, retain, and exploit spatial-semantic experience over long time horizons without explicit object-centric detection or mapping. By leveraging transformer-based visual tokenization and efficient context management, AstraNav-Memory supports scalable memory over hundreds of visual observations, facilitating improved exploration and rapid goal-reaching in diverse navigation environments (Ren et al., 25 Dec 2025).
1. Architecture Overview
AstraNav-Memory is structured around three principal components: a ViT backbone with frozen DINOv3 feature extraction, a PixelUnshuffle+Conv-based visual tokenizer for configurable compression, and a Qwen2.5-VL navigation policy. The input to the system consists of RGB frames, which are patch-embedded and processed through a frozen DINOv3-ViT-Base network, resulting in mid-level feature maps. These are subsequently compressed by a sequential stack of PixelUnshuffleâ‚‚+Conv blocks, which halve spatial dimensions and remap channels.
Each compression block achieves a spatial compression ratio for blocks. After blocks, a final patch-merger converts each patch into one token:
For input resolution , , patch size , and , each frame is compressed from tokens to tokens (16× reduction). These tokens are injected into the first transformer block of Qwen2.5-VL-3B, together with serialized camera pose and instruction, while leaving downstream blocks unchanged (Ren et al., 25 Dec 2025).
2. Visual Context Compression Mechanism
The visual compression module is central to AstraNav-Memory. It consists of:
- DINOv3-ViT feature extraction: Patch-embedded frames generate features of shape .
- PixelUnshuffle+Conv blocks: Each block reduces by $2$ and increases channels by $4$; Conv (3×3)→BatchNorm→SiLU enables learning of spatially-remapped features critical for navigation.
- Task-aligned training: The compressor is trained end-to-end with the navigation policy via sequence-to-sequence cross-entropy (and behavioral-cloning) loss on navigation actions. No reconstruction or bottleneck losses are used, ensuring retention of only spatial-semantic cues relevant to navigation tasks.
Configurable compression rates () allow the agent to gate memory length against computational efficiency by setting , empirically optimizing trade-offs between context length and semantic fidelity (Ren et al., 25 Dec 2025).
3. Memory Read/Write Protocol
AstraNav-Memory implements implicit, token-integrated memory by concatenating compressed tokens from all past frames with associated pose and instruction into a single sequence for the Qwen2.5-VL transformer:
1 2 3 4 5 6 7 |
memory_tokens = [] for τ in 1…t: Z_τ = compress_image(I_τ) # PixelUnshuffle+Conv×N + merge memory_tokens += [pose_tokens(P_τ), Z_τ] input_sequence = [SYS_prompt] + memory_tokens + [INSTR_t] action_output = Qwen2.5-VL(input_sequence) execute(action_output) |
This protocol incurs quadratic cost , allowing frames for a $128$-token context under $16×$ compression, whereas the uncompressed window admits frames. This suggests that compressed token memory is a key enabler for long-horizon policy learning in transformer architectures without architectural changes downstream (Ren et al., 25 Dec 2025).
4. Empirical Performance and Ablations
Multi-Benchmark Results
AstraNav-Memory achieves state-of-the-art performance on GOAT-Bench (lifelong navigation) and HM3D-OVON (open-vocabulary object navigation):
| Method | Val-Seen (SR/SPL) | Val-Seen-Syn (SR/SPL) | Val-Unseen (SR/SPL) |
|---|---|---|---|
| Modular GOAT | 26.3 / 17.5 | 33.8 / 24.4 | 24.9 / 17.2 |
| MTU3D | 52.2 / 30.5 | 48.4 / 30.3 | 47.2 / 27.7 |
| AstraNav-Memory | 65.5 / 49.0 | 66.8 / 54.7 | 62.7 / 56.9 |
For open-vocabulary navigation (HM3D-OVON):
| Method | Val-Seen (SR/SPL) | Val-Seen-Syn (SR/SPL) | Val-Unseen (SR/SPL) |
|---|---|---|---|
| MTU3D | 55.0 / 23.6 | 45.0 / 14.7 | 40.8 / 12.1 |
| AstraNav-Memory | 65.6 / 35.4 | 57.5 / 33.0 | 62.5 / 34.9 |
AstraNav-Memory improves Success Rate (SR) and Success weighted by Path Length (SPL) by >15 pp on Val-Unseen splits compared to previous best approaches (Ren et al., 25 Dec 2025).
Ablation Analysis
Compression and history length ablations show moderate compression (16× or 4× with frames) yield optimal balance—overly aggressive (64×) compression degrades spatial-semantic fidelity. For compressed frames (16×), best SR is achieved with lower computational and memory costs compared to storing raw images.
5. Comparison with Alternative Memory Systems
Contrasted with object-centric modular pipelines and hierarchical spatial-cognition systems (cf. Mem4Nav (He et al., 24 Jun 2025)), AstraNav-Memory dispenses with explicit map-building, object reconstruction, and retrieval indices. Its design is characterized by:
- No external memory lookup or graph construction
- Unified transformer context containing compacted visual tokens
- No dependence on segmentation or landmark annotation
A plausible implication is that for domains with unreliable detection pipelines or high object diversity, image-centric, token-compressing memory offers robustness and scalability not achievable with sparse octree or semantic graph indexing.
6. Practical Implications and Significance
AstraNav-Memory's image-centric, transformer-compatible compression paradigm facilitates agents to:
- Efficiently reason over lengthy visual histories, supporting exploration in unseen environments.
- Accurately recall and exploit prior observations for rapid goal-reaching in familiar scenes.
- Operate via a single unified memory interface, obviating the need for task-specific retrieval, graph building, or explicit map maintenance.
This approach positions compressed, task-aligned visual token memory as a scalable solution for lifelong embodied agents, enabling navigation efficiency reminiscent of human spatial memory. A plausible implication is that lifelong navigation agents operating in large, dynamic environments stand to benefit substantially from such memory architectures, which could generalize across tasks and settings (Ren et al., 25 Dec 2025).
7. Contextualization and Future Directions
AstraNav-Memory provides an alternative to hierarchical spatial-cognition long-short memory approaches exemplified by Mem4Nav (He et al., 24 Jun 2025), which use explicit 3D mapping and reversible memory tokens. The key distinction is the purely image-centric, task-driven compression, which forgoes landmark indexing and prunes architectural dependencies.
Future work may further investigate the efficacy of combining task-aligned compression with hierarchical topological representations, potentially enabling hybrid systems balancing robustness and interpretability. A plausible implication is that adaptive compression mechanisms—potentially modulated by semantic content or downstream uncertainty—could further optimize accuracy and efficiency in expansive or visually repetitive environments.