Papers
Topics
Authors
Recent
2000 character limit reached

AstraNav-Memory: Lifelong Visual Navigation Memory

Updated 30 December 2025
  • AstraNav-Memory is an image-centric memory architecture designed for lifelong embodied navigation using visual context compression.
  • It employs a ViT backbone with frozen DINOv3, PixelUnshuffle+Conv tokenization, and transformer-based policy to efficiently compress and integrate hundreds of visual observations.
  • Empirical results demonstrate more than 15 percentage points improvement in Success Rate and SPL on benchmark navigation tasks compared to previous approaches.

AstraNav-Memory is an image-centric, end-to-end memory architecture designed for lifelong embodied navigation. It utilizes visual context compression to enable agents to accumulate, retain, and exploit spatial-semantic experience over long time horizons without explicit object-centric detection or mapping. By leveraging transformer-based visual tokenization and efficient context management, AstraNav-Memory supports scalable memory over hundreds of visual observations, facilitating improved exploration and rapid goal-reaching in diverse navigation environments (Ren et al., 25 Dec 2025).

1. Architecture Overview

AstraNav-Memory is structured around three principal components: a ViT backbone with frozen DINOv3 feature extraction, a PixelUnshuffle+Conv-based visual tokenizer for configurable compression, and a Qwen2.5-VL navigation policy. The input to the system consists of RGB frames, which are patch-embedded and processed through a frozen DINOv3-ViT-Base network, resulting in mid-level feature maps. These are subsequently compressed by a sequential stack of PixelUnshuffleâ‚‚+Conv blocks, which halve spatial dimensions and remap channels.

Each compression block achieves a spatial compression ratio r=22N=4Nr = 2^{2N} = 4^N for NN blocks. After NN blocks, a final 2×22\times2 patch-merger converts each patch into one token:

Lt=HNWN4=HWp24N+1L_t = \frac{H_N W_N}{4} = \frac{H W}{p^2 4^{N+1}}

For input resolution H=720H=720, W=640W=640, patch size p=16p=16, and N=2N=2, each frame is compressed from ∼598\sim598 tokens to ∼30\sim30 tokens (16× reduction). These tokens are injected into the first transformer block of Qwen2.5-VL-3B, together with serialized camera pose and instruction, while leaving downstream blocks unchanged (Ren et al., 25 Dec 2025).

2. Visual Context Compression Mechanism

The visual compression module is central to AstraNav-Memory. It consists of:

  • DINOv3-ViT feature extraction: Patch-embedded frames generate features of shape X(0)∈RH0×W0×C0X^{(0)} \in \mathbb{R}^{H_0 \times W_0 \times C_0}.
  • PixelUnshuffle+Conv blocks: Each block reduces H,WH, W by $2$ and increases channels by $4$; Conv (3×3)→BatchNorm→SiLU enables learning of spatially-remapped features critical for navigation.
  • Task-aligned training: The compressor is trained end-to-end with the navigation policy via sequence-to-sequence cross-entropy (and behavioral-cloning) loss on navigation actions. No reconstruction or bottleneck losses are used, ensuring retention of only spatial-semantic cues relevant to navigation tasks.

Configurable compression rates (rtotal=4N+1r_{total}=4^{N+1}) allow the agent to gate memory length against computational efficiency by setting NN, empirically optimizing trade-offs between context length and semantic fidelity (Ren et al., 25 Dec 2025).

3. Memory Read/Write Protocol

AstraNav-Memory implements implicit, token-integrated memory by concatenating compressed tokens from all past frames with associated pose and instruction into a single sequence for the Qwen2.5-VL transformer:

1
2
3
4
5
6
7
memory_tokens = []
for τ in 1…t:
    Z_τ = compress_image(I_τ)  # PixelUnshuffle+Conv×N + merge
    memory_tokens += [pose_tokens(P_Ï„), Z_Ï„]
input_sequence = [SYS_prompt] + memory_tokens + [INSTR_t]
action_output = Qwen2.5-VL(input_sequence)
execute(action_output)

This protocol incurs quadratic cost O((T⋅Lt)2)O((T \cdot L_t)^2), allowing T≈200–300T \approx 200–300 frames for a $128$-token context under $16×$ compression, whereas the uncompressed window admits T≈20T \approx 20 frames. This suggests that compressed token memory is a key enabler for long-horizon policy learning in transformer architectures without architectural changes downstream (Ren et al., 25 Dec 2025).

4. Empirical Performance and Ablations

Multi-Benchmark Results

AstraNav-Memory achieves state-of-the-art performance on GOAT-Bench (lifelong navigation) and HM3D-OVON (open-vocabulary object navigation):

Method Val-Seen (SR/SPL) Val-Seen-Syn (SR/SPL) Val-Unseen (SR/SPL)
Modular GOAT 26.3 / 17.5 33.8 / 24.4 24.9 / 17.2
MTU3D 52.2 / 30.5 48.4 / 30.3 47.2 / 27.7
AstraNav-Memory 65.5 / 49.0 66.8 / 54.7 62.7 / 56.9

For open-vocabulary navigation (HM3D-OVON):

Method Val-Seen (SR/SPL) Val-Seen-Syn (SR/SPL) Val-Unseen (SR/SPL)
MTU3D 55.0 / 23.6 45.0 / 14.7 40.8 / 12.1
AstraNav-Memory 65.6 / 35.4 57.5 / 33.0 62.5 / 34.9

AstraNav-Memory improves Success Rate (SR) and Success weighted by Path Length (SPL) by >15 pp on Val-Unseen splits compared to previous best approaches (Ren et al., 25 Dec 2025).

Ablation Analysis

Compression and history length ablations show moderate compression (16× or 4× with T=100T=100 frames) yield optimal balance—overly aggressive (64×) compression degrades spatial-semantic fidelity. For T=100T=100 compressed frames (16×), best SR is achieved with lower computational and memory costs compared to storing raw images.

5. Comparison with Alternative Memory Systems

Contrasted with object-centric modular pipelines and hierarchical spatial-cognition systems (cf. Mem4Nav (He et al., 24 Jun 2025)), AstraNav-Memory dispenses with explicit map-building, object reconstruction, and retrieval indices. Its design is characterized by:

  • No external memory lookup or graph construction
  • Unified transformer context containing compacted visual tokens
  • No dependence on segmentation or landmark annotation

A plausible implication is that for domains with unreliable detection pipelines or high object diversity, image-centric, token-compressing memory offers robustness and scalability not achievable with sparse octree or semantic graph indexing.

6. Practical Implications and Significance

AstraNav-Memory's image-centric, transformer-compatible compression paradigm facilitates agents to:

  • Efficiently reason over lengthy visual histories, supporting exploration in unseen environments.
  • Accurately recall and exploit prior observations for rapid goal-reaching in familiar scenes.
  • Operate via a single unified memory interface, obviating the need for task-specific retrieval, graph building, or explicit map maintenance.

This approach positions compressed, task-aligned visual token memory as a scalable solution for lifelong embodied agents, enabling navigation efficiency reminiscent of human spatial memory. A plausible implication is that lifelong navigation agents operating in large, dynamic environments stand to benefit substantially from such memory architectures, which could generalize across tasks and settings (Ren et al., 25 Dec 2025).

7. Contextualization and Future Directions

AstraNav-Memory provides an alternative to hierarchical spatial-cognition long-short memory approaches exemplified by Mem4Nav (He et al., 24 Jun 2025), which use explicit 3D mapping and reversible memory tokens. The key distinction is the purely image-centric, task-driven compression, which forgoes landmark indexing and prunes architectural dependencies.

Future work may further investigate the efficacy of combining task-aligned compression with hierarchical topological representations, potentially enabling hybrid systems balancing robustness and interpretability. A plausible implication is that adaptive compression mechanisms—potentially modulated by semantic content or downstream uncertainty—could further optimize accuracy and efficiency in expansive or visually repetitive environments.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AstraNav-Memory.