AstraNav-Memory: Lifelong Visual Navigation Memory

Updated 30 December 2025

AstraNav-Memory is an image-centric memory architecture designed for lifelong embodied navigation using visual context compression.
It employs a ViT backbone with frozen DINOv3, PixelUnshuffle+Conv tokenization, and transformer-based policy to efficiently compress and integrate hundreds of visual observations.
Empirical results demonstrate more than 15 percentage points improvement in Success Rate and SPL on benchmark navigation tasks compared to previous approaches.

AstraNav-Memory is an image-centric, end-to-end memory architecture designed for lifelong embodied navigation. It utilizes visual context compression to enable agents to accumulate, retain, and exploit spatial-semantic experience over long time horizons without explicit object-centric detection or mapping. By leveraging transformer-based visual tokenization and efficient context management, AstraNav-Memory supports scalable memory over hundreds of visual observations, facilitating improved exploration and rapid goal-reaching in diverse navigation environments (Ren et al., 25 Dec 2025).

1. Architecture Overview

AstraNav-Memory is structured around three principal components: a ViT backbone with frozen DINOv3 feature extraction, a PixelUnshuffle+Conv-based visual tokenizer for configurable compression, and a Qwen2.5-VL navigation policy. The input to the system consists of RGB frames, which are patch-embedded and processed through a frozen DINOv3-ViT-Base network, resulting in mid-level feature maps. These are subsequently compressed by a sequential stack of PixelUnshuffle₂+Conv blocks, which halve spatial dimensions and remap channels.

Each compression block achieves a spatial compression ratio $r = 2^{2N} = 4^N$ for $N$ blocks. After $N$ blocks, a final $2\times2$ patch-merger converts each patch into one token:

$L_t = \frac{H_N W_N}{4} = \frac{H W}{p^2 4^{N+1}}$

For input resolution $H=720$ , $W=640$ , patch size $p=16$ , and $N=2$ , each frame is compressed from $\sim598$ tokens to $\sim30$ tokens (16× reduction). These tokens are injected into the first transformer block of Qwen2.5-VL-3B, together with serialized camera pose and instruction, while leaving downstream blocks unchanged (Ren et al., 25 Dec 2025).

2. Visual Context Compression Mechanism

The visual compression module is central to AstraNav-Memory. It consists of:

DINOv3-ViT feature extraction: Patch-embedded frames generate features of shape $X^{(0)} \in \mathbb{R}^{H_0 \times W_0 \times C_0}$ .
PixelUnshuffle+Conv blocks: Each block reduces $H, W$ by $2$ and increases channels by $4$; Conv (3×3)→BatchNorm→SiLU enables learning of spatially-remapped features critical for navigation.
Task-aligned training: The compressor is trained end-to-end with the navigation policy via sequence-to-sequence cross-entropy (and behavioral-cloning) loss on navigation actions. No reconstruction or bottleneck losses are used, ensuring retention of only spatial-semantic cues relevant to navigation tasks.

Configurable compression rates ( $r_{total}=4^{N+1}$ ) allow the agent to gate memory length against computational efficiency by setting $N$ , empirically optimizing trade-offs between context length and semantic fidelity (Ren et al., 25 Dec 2025).

3. Memory Read/Write Protocol

AstraNav-Memory implements implicit, token-integrated memory by concatenating compressed tokens from all past frames with associated pose and instruction into a single sequence for the Qwen2.5-VL transformer:

memory_tokens = []
for τ in 1…t:
    Z_τ = compress_image(I_τ)  # PixelUnshuffle+Conv×N + merge
    memory_tokens += [pose_tokens(P_τ), Z_τ]
input_sequence = [SYS_prompt] + memory_tokens + [INSTR_t]
action_output = Qwen2.5-VL(input_sequence)
execute(action_output)

This protocol incurs quadratic cost $O((T \cdot L_t)^2)$ , allowing $T \approx 200–300$ frames for a $128$-token context under $16×$ compression, whereas the uncompressed window admits $T \approx 20$ frames. This suggests that compressed token memory is a key enabler for long-horizon policy learning in transformer architectures without architectural changes downstream (Ren et al., 25 Dec 2025).

4. Empirical Performance and Ablations

Multi-Benchmark Results

AstraNav-Memory achieves state-of-the-art performance on GOAT-Bench (lifelong navigation) and HM3D-OVON (open-vocabulary object navigation):

Method	Val-Seen (SR/SPL)	Val-Seen-Syn (SR/SPL)	Val-Unseen (SR/SPL)
Modular GOAT	26.3 / 17.5	33.8 / 24.4	24.9 / 17.2
MTU3D	52.2 / 30.5	48.4 / 30.3	47.2 / 27.7
AstraNav-Memory	65.5 / 49.0	66.8 / 54.7	62.7 / 56.9

For open-vocabulary navigation (HM3D-OVON):

Method	Val-Seen (SR/SPL)	Val-Seen-Syn (SR/SPL)	Val-Unseen (SR/SPL)
MTU3D	55.0 / 23.6	45.0 / 14.7	40.8 / 12.1
AstraNav-Memory	65.6 / 35.4	57.5 / 33.0	62.5 / 34.9

AstraNav-Memory improves Success Rate (SR) and Success weighted by Path Length (SPL) by >15 pp on Val-Unseen splits compared to previous best approaches (Ren et al., 25 Dec 2025).

Ablation Analysis

Compression and history length ablations show moderate compression (16× or 4× with $T=100$ frames) yield optimal balance—overly aggressive (64×) compression degrades spatial-semantic fidelity. For $T=100$ compressed frames (16×), best SR is achieved with lower computational and memory costs compared to storing raw images.

5. Comparison with Alternative Memory Systems

Contrasted with object-centric modular pipelines and hierarchical spatial-cognition systems (cf. Mem4Nav (He et al., 24 Jun 2025)), AstraNav-Memory dispenses with explicit map-building, object reconstruction, and retrieval indices. Its design is characterized by:

No external memory lookup or graph construction
Unified transformer context containing compacted visual tokens
No dependence on segmentation or landmark annotation

A plausible implication is that for domains with unreliable detection pipelines or high object diversity, image-centric, token-compressing memory offers robustness and scalability not achievable with sparse octree or semantic graph indexing.

6. Practical Implications and Significance

AstraNav-Memory's image-centric, transformer-compatible compression paradigm facilitates agents to:

Efficiently reason over lengthy visual histories, supporting exploration in unseen environments.
Accurately recall and exploit prior observations for rapid goal-reaching in familiar scenes.
Operate via a single unified memory interface, obviating the need for task-specific retrieval, graph building, or explicit map maintenance.

This approach positions compressed, task-aligned visual token memory as a scalable solution for lifelong embodied agents, enabling navigation efficiency reminiscent of human spatial memory. A plausible implication is that lifelong navigation agents operating in large, dynamic environments stand to benefit substantially from such memory architectures, which could generalize across tasks and settings (Ren et al., 25 Dec 2025).

7. Contextualization and Future Directions

AstraNav-Memory provides an alternative to hierarchical spatial-cognition long-short memory approaches exemplified by Mem4Nav (He et al., 24 Jun 2025), which use explicit 3D mapping and reversible memory tokens. The key distinction is the purely image-centric, task-driven compression, which forgoes landmark indexing and prunes architectural dependencies.

Future work may further investigate the efficacy of combining task-aligned compression with hierarchical topological representations, potentially enabling hybrid systems balancing robustness and interpretability. A plausible implication is that adaptive compression mechanisms—potentially modulated by semantic content or downstream uncertainty—could further optimize accuracy and efficiency in expansive or visually repetitive environments.

Markdown Upgrade to Chat

References (2)

AstraNav-Memory: Contexts Compression for Long Memory (2025)

Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AstraNav-Memory.