Zero-Shot Object Navigation (ZSON)

Updated 30 December 2025

Zero-Shot Object Navigation (ZSON) is a task where agents locate and navigate to target object categories in unfamiliar 3D environments using only descriptive labels.
Recent approaches leverage large vision-language models for panoramic scene parsing and dynamic memory, enabling robust mapless spatial reasoning.
Experimental results, such as those from the PanoNav framework, show that integrating memory and panoramic perception significantly boosts success rates in zero-shot settings.

Zero-Shot Object Navigation (ZSON) is the task of directing an embodied agent to locate and navigate to an instance of a specified object category in a previously unseen 3D environment, using only the object description (typically a category label) as input, and with no access to environment-specific pretraining, reward engineering, or in-domain fine-tuning. The principal technical challenge lies in robust perceptual grounding and long-horizon planning under partial observability and strict generalization requirements. The emergence of large vision–LLMs (VLMs) and multimodal LLMs (MLLMs) has enabled recent systems to approach ZSON in sophisticated ways, including panorama-level reasoning, dynamic memory management, and fully mapless policies.

1. Formal Problem Setting and Distinguishing Constraints

In ZSON, the agent is evaluated on new environments without any prior exposure to their data or layouts, nor to new object instances or categories at test time. At each timestep $t$ , the agent receives an observation—commonly an RGB or RGB-D image $I_t$ or panoramic frame $I_{pan,t}$ —and selects an action $a_t$ from a discrete action set $\mathcal{A}$ . The objective is to reach within distance $\tau$ of any instance of the target object as efficiently as possible. No explicit training on these test environments or task-specific fine-tuning of perception, memory, or policy modules is permitted (Jin et al., 10 Nov 2025).

Key constraints of ZSON:

Open-set object categories: The target object may not occur in any training instance or be limited to predefined taxonomies.
No prior map: The agent has no metric/topological map or prior spatial knowledge of the scene.
Zero environmental fine-tuning: All model weights and modules are frozen before evaluation on the target environment.
Partial observability: Only instantaneous local vantage points are available unless explicitly memory-augmented.
Sparse rewards/no RL: Navigation is assessed by end-state proximity/success rather than dense or supervised reward shaping.

This distinguishes ZSON sharply from supervised ObjectNav and ImageNav, which leverage environment-specific rewards or annotated training episodes (Majumdar et al., 2022).

2. Architectural Foundations: Panoramic Perception and Mapless Spatial Reasoning

The PanoNav framework exemplifies advanced ZSON techniques through RGB-only, mapless navigation tightly coupled to a panoramic scene parsing and memory-augmented MLLM (Jin et al., 10 Nov 2025). The system is structured as follows:

Panoramic Scene Parser $\phi_{parse}$ : Ingests a 360° RGB observation $I_{pan}$ , processes it through a ResNet50 + Swin-Transformer backbone with spherical positional encodings, and projects it into spatial-semantic embeddings $F_{pan} \in \mathbb{R}^{C\times H'\times W'}$ and MLLM-compatible visual tokens $\{v_i\}$ . Optionally, an explicit graph-relational submodule computes adjacency matrices $R$ over proposed object detections for relational context.
Dynamic Memory Queue $M_t$ : Maintains a history of past parser features and actions, encoded via a GRU into “history tokens” $h_t$ , capped at capacity $K$ . A gating mechanism $g_j = \sigma(w_g^\top m_j + b_g)$ applies attention decay to older slots.
Decision Module (MLLM-based): At each time step, the MLLM performs joint self-attention over current features $\{v_i\}$ and memory $\{m_j\}$ , producing attention-weighted action logits for discrete action selection. The final policy is

$\pi(a_t|s_t; M_t) = \mathrm{softmax}(\ell_t)$

where $\ell_t$ is constructed from concatenated flattened parser features and memory-attended features.

Action Set: The agent works with MoveForward, TurnLeft, TurnRight, and Stop.

This integrated, mapless design contrasts with approaches that rely on explicit 2D/3D mapping or topometric graph construction (Hou et al., 9 May 2025).

3. Memory and Long-Horizon Exploration

Local-decision mapless agents historically risked short-sightedness and local deadlock. PanoNav’s dynamic memory queue enables the agent to systematically encode and summarize exploration history—the GRU-based update,

$h_t = \mathrm{GRU}\bigl(\mathrm{Flatten}(F_{pan}), h_{t-1}\bigr)$

and memory gating distinguish recency and saliency of information, allowing long-term spatial context to inform every action.

Ablation on the memory module confirms its critical role: removing dynamic memory (“–Mem”) drops SR from 45.2% to 37.0% and SPL from 29.8% to 24.1%, demonstrating an 8 percentage point loss in success rate (Jin et al., 10 Nov 2025).

4. Experimental Methodology and Comparative Evaluation

PanoNav is evaluated under rigorous zero-shot protocols, with architecture pretraining limited to generic panoramic reconstruction tasks and internet-scale panoramic imagery (no navigation-specific exposure or environment-specific fine-tuning).

Benchmarks: Habitat-SIM with Gibson and Matterport3D (MP3D) scenes, holding out 20% of scenes for zero-shot testing.
Metrics: Success Rate (SR) and SPL (Success weighted by Path Length),

$\mathrm{SR} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{d_i^{\mathrm{final}} < \tau\}, \quad \mathrm{SPL} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{d_i^{\mathrm{final}} < \tau\} \frac{\ell_i}{\max(\ell_i, p_i)}$

where $d_i^{\mathrm{final}}$ is final distance to goal, $\ell_i$ is shortest path, $p_i$ is actual path length, $\tau = 1.0$ m.

Results: On unseen scenes, PanoNav achieves SR = 45.2% and SPL = 29.8%, versus SemExp (SR = 32.1%, SPL = 23.0%) and ORGMap (SR = 28.4%, SPL = 19.5%). All differences are statistically significant ( $p<0.01$ , paired $t$ -test).

Ablation shows the panoramic parser (“–Pan”) also contributes a 7 percentage point SR loss versus the full model (Jin et al., 10 Nov 2025).

Method	SR (%)	SPL (%)
Random	3.2±0.7	2.1±0.5
Heuristic Frontier (HF)	15.8±1.1	11.2±0.8
ORGMap (zero‐shot)	28.4±1.5	19.5±1.2
SemExp (zero‐shot)	32.1±1.3	23.0±1.0
PanoNav	45.2±1.0	29.8±0.9

5. Complementarity and Comparative Insights

PanoNav’s principal advances rest on two axes:

Wide-field spatial grounding: Panoramic parsing produces a holistic spatial-semantic embedding from a single RGB input, enabling long-range cues and object–object/layout relationships downstream in the decision process.
Structured episodic memory: The dynamic GRU queue prevents local cycles and deadlock by enabling context-dependent attention to historical exploration, effectively bridging short-term perception and strategic long-term planning within the MLLM decision module.

Ablation establishes that both modules are key: dynamic memory confers nearly 8 pp (points) improvement in SR and panoramic modeling adds a further ~7 pp, with effects strongly additive and complementary.

6. Practical Significance and Research Impact

PanoNav represents the first demonstration of panoramic scene parsing and bounded dynamic memory integration within an MLLM-driven mapless navigation policy for ZSON. It outperforms classical frontier-based and semantic-map-based zero-shot baselines by a wide margin, entirely without reliance on explicit metric/topological mapping, depth sensors, or environment-adapted reward tuning—a significant leap in generalizable embodied spatial reasoning.

By demonstrating that a panoramic transformer + memory-guided decision architecture can match or exceed structured-map-based approaches, it inaugurates a new direction for mapless, multimodal generalization in zero-shot object navigation under realistic constraints (Jin et al., 10 Nov 2025).