MSGNav: Zero-Shot Navigation

Updated 16 November 2025

MSGNav is a zero-shot embodied navigation system leveraging a multi-modal 3D scene graph for open-vocabulary goal specification without task-specific training.
It integrates modules like Key Subgraph Selection, Adaptive Vocabulary Update, and Closed-Loop Reasoning to iteratively improve navigation accuracy using visual cues.
The system achieves state-of-the-art performance on benchmarks by enhancing decision reliability with visibility-based viewpoint selection for robust last-mile navigation.

MSGNav is a zero-shot embodied navigation system leveraging a Multi-modal 3D Scene Graph (M3DSG) representation to support open-vocabulary goal specification and state-of-the-art performance with no task-specific reinforcement learning. By explicitly preserving visual evidence within scene graphs and employing a modular pipeline—including Key Subgraph Selection (KSS), Adaptive Vocabulary Update (AVU), Closed-Loop Reasoning (CLR), and a Visibility-based Viewpoint Decision (VVD) for the “last mile”—MSGNav addresses scalability, generalization, and decision reliability in realistic robotic navigation.

The core data structure in MSGNav is the M3DSG, which composes the agent’s spatial experience using an explicit graph $\mathbf{S}_t = (\mathbf{O}_t, \mathbf{E}_t)$ at time $t$ .

1.1 Formal Specification

Nodes ( $\mathbf{O}_t = \{o_i\}_{i=1}^{N_o}$ ): Each object $o_i$ includes the following fields:

$\ID_i \in \mathbb{N}$: Unique object identifier
$C_i$ : Category label (supports open vocabulary)
$P_i \in \mathbb{R}^3$ : 3D centroid
$B_i$ : 2D bounding box (in current frame)
$M_i$ : Segmentation mask
$PC_i \subset \mathbb{R}^3$ : Object point cloud
$V_i \in \mathbb{R}^d$ : Visual embedding
$R_i$ : Room assignment

Edges ( $\mathbf{E}_t$ ): Each unordered pair $(\ID_x, \ID_y)$ forms an edge if objects $x$ and $y$ are spatially adjacent ( $\|P_x - P_y\| < \theta$ ). Edge features are not textual but a set of RGB-D frames $\mathbf{I}_{xy} = \{\mathcal{I}_k\}_{k=1}^{|\mathbf{I}_{xy}|}$ in which those objects co-occur within threshold $\theta$ . The total number of edges at time $t$ is

$N_e = \sum_{1 \leq x < y \leq N_o} \mathds{1}(\|P_x - P_y\| < \theta)$

1.2 Edge Visual Cue Embedding

To preserve full visual context, each edge collects co-occurrence images. For downstream reasoning, an image encoder $\phi: \mathcal{I} \rightarrow \mathbb{R}^d$ (e.g., the CLIP image tower) computes visual embeddings:

$E_{xy} = \frac{1}{|\mathbf{I}_{xy}|} \sum_{\mathcal{I} \in \mathbf{I}_{xy}} \phi(\mathcal{I})$

Rather than using all images, a small subset $\mathbf{I}_{xy}^s \subset \mathbf{I}_{xy}$ is greedily selected (see KSS), providing compact and informative edge features.

1.3 Real-time Data Structures and Update Algorithms

Data Structures:

$\mathbf{O}_t$ : Hash map $\ID_i \to o_i$
$\mathbf{E}_t$ : Dictionary mapping $(\ID_x, \ID_y) \to \{\mathcal{I}\}$
$\mathbf{H}$ : Inverse map $\mathcal{I} \to \{(\ID_x, \ID_y)\}$

Update Procedure:

Upon receiving a new RGB-D frame $\mathcal{I}_t$ :

Detect objects via YOLO-W, SAM, and CLIP (category + visual embedding).
Match new detections to $\mathbf{O}_{t-1}$ by spatial and visual similarity; merge or append new objects as needed:

$\mathbf{O}_t = \Phi_{\mathrm{merge}}\left(\Phi_{\mathrm{match}}(\mathbf{O}^{\text{frame}_t}, \mathbf{O}_{t-1}) \cup \mathbf{O}^{\text{frame}_t}\right)$

Enumerate unordered pairs within adjacency threshold and update edge co-occurrence sets:
- For each $(o_x, o_y)$ with $\|P_x - P_y\| \leq \theta$ , append $\mathcal{I}_t$ to $\mathbf{E}_t[(\ID_x,\ID_y)]$ and update $\mathbf{H}$ . Overall complexity per frame is $O(n_\text{frame}^2)$ but feasible given the modest number of frame detections.

2. Core MSGNav Modules

MSGNav integrates four sequential modules at every step, ensuring scalable and adaptive reasoning in the navigation loop.

2.1 Key Subgraph Selection (KSS)

This module selects a minimal subgraph $S^k_t=(O^k_t, E^k_t)$ containing $k$ objects most relevant to the current target $g$ , as determined by a VLM (e.g., GPT-4o).

Compress-Focus-Prune Algorithm:

Compress: Generate a compact adjacency list $\hat S_t=(\hat O, \hat E)$ where $\hat o_i=\{\ID_i, C_i\}$.
Focus: Query VLM with $\hat S_t$ and $g$ to rank and select the top- $k$ objects $O^\mathrm{rel}$ .
Prune: Greedily choose the minimal set of images covering all edges among $O^\mathrm{rel}$ via set cover, yielding $\sim4$ images on average for use in VLM prompts.

2.2 Adaptive Vocabulary Update (AVU)

AVU enables open-vocabulary object recognition and scene understanding:

Maintain active label set $V_{t-1}$ .
Each VLM query returns a result $\mathcal{R}_t$ and any newly proposed object classes $\hat V_t$ .
Update $V_t \gets V_{t-1} \cup \hat V_t$ .
Detection and graph updates in future steps will accommodate the expanded object vocabulary.

Formally:

$(\mathcal{R}_t, \hat V_t) = \mathrm{VLM}(S^k_t, \mathbf{F}_t, g, t)\ V_t = V_{t-1} \cup \hat V_t$

2.3 Closed-Loop Reasoning (CLR)

CLR explicitly accumulates all past VLM feedback for robust iterative reasoning. At time $t$ , the agent’s memory $\mathbf{M}_t$ is

$\mathbf{M}_t = \mathbf{M}_{t-1} \cup \{\mathcal{R}_{t-1}\}$

Each new query incorporates $\mathbf{M}_t$ , yielding improved next-step accuracy (from 43.8% to 64.8%).

2.4 Visibility-based Viewpoint Decision (VVD)

The “last-mile” problem refers to correctly selecting the agent’s final viewpoint upon reaching the target. Direct navigation to the nearest traversable point often yields suboptimal or occluded views.

VVD samples candidate viewpoints near the predicted target coordinates $\bar o$ , scores each by the fraction of target point cloud $\mathcal{PC}_{\bar o}$ visible without occlusion, and selects the maximally visible position:

$S(v) = \frac{1}{|\mathcal{PC}_{\bar o}|} \sum_{p \in \mathcal{PC}_{\bar o}} \mathds{1}[\text{ray}(v \rightarrow p) \text{ is occlusion-free}]$

Sampling over radii and azimuths, maintaining only traversable candidates, and computing $S(v)$ for each, VVD improves close-range ( $<0.25$ m) success from 33.9% to 52.0%.

The MSGNav pipeline is defined by the following pseudocode:

Input: 
  target g         # category name or text or reference image
  T                # max steps
Initialize:
  S ← (∅,∅)        # empty M3DSG
  M ← ∅            # empty decision memory
  V ← V0           # initial detector vocabulary (e.g. ScanNet200)
for t in 1..T:
  I_t ← agent.get_RGBD()
  # 1) Update multi-modal 3D scene graph
  S ← update_scene_graph(S, I_t; detectors w/ vocab V)
  # 2) Key Subgraph Selection
  S^k ← KSS(S, g)
  # 3) Closed-Loop & Adaptive Vocabulary VLM query
  (R_t, V_new) ← VLM_prompt(S^k, M, frontier_images(S), g, t)
  V ← V ∪ V_new
  M ← M ∪ {R_t}
  if R_t is a target location:
    # 4) Last-mile viewpoint decision
    v_goal ← VVD(R_t.target_object, full_pointcloud(S))
    return low_level_planner.navigate_to(v_goal)
  else:
    # R_t is next exploration frontier
    low_level_planner.navigate_to(R_t)
end_for
fail  # did not reach in T steps

Key subroutines directly implement the algorithms defined in preceding sections, invoking update, selection, VLM prompt, and viewpoint decision logic as specified.

4. Experimental Results and Evaluation

4.1 Datasets and Metrics

Evaluation is reported on:

GOAT-Bench (“Val-Unseen”): Multi-modal lifelong navigation (category, text, image goals).
HM3D-OVON (“Val-Unseen”): Open-vocabulary object navigation.

Metrics:

Success Rate (SR): Number of successful episodes over total.
SPL:

$\text{SPL} = \frac{1}{N}\sum_{i} S_i \cdot \frac{\ell_i^{\mathrm{short}}}{\max(\ell_i^{\mathrm{short}}, \ell_i^{\mathrm{agent}})}$

4.2 Quantitative Results

MSGNav achieves state-of-the-art results among training-free methods:

HM3D-OVON (Val-Unseen)

Method	training-free	SR (%)	SPL (%)
VLFM	✓	35.2	19.6
TANGO	✓	35.5	19.5
Uni-NaVid	✗	39.5	19.8
MTU3D	✗	40.8	12.1
MSGNav	✓	48.3	27.0

GOAT-Bench (Val-Unseen)

Method	training-free	SR (%)	SPL (%)
TANGO	✓	32.1	16.5
3D-Mem	✓	28.8	15.8
MTU3D	✗	47.2	27.7
MSGNav	✓	52.0	29.6

4.3 Qualitative Findings

Robustness to Perception Error: The visual edge storage in M3DSG allows visual verification of spatial relations, handling ambiguous or low-contrast cases.
Open-vocabulary Generalization: AVU enables discovery of previously unseen classes (e.g., “espresso machine,” “vase”).
“Last-mile” Success: VVD-determined viewpoints consistently provide clear, frontal, and unoccluded target views, outperforming naïve nearest-point strategies.

5. Significance and Research Implications

MSGNav’s explicit, visual-centric scene graph overcomes limitations of prior text-only relational models. By tightly integrating scene modeling, adaptive reasoning, incremental vocabulary enrichment, and geometric viewpoint optimization, MSGNav demonstrates the essential role of multi-modal representations for scalable, generalizable, and reliable embodied navigation. Strong empirical gains, particularly as a training-free approach, highlight the efficacy of modular architectures exploiting high-level visual-LLMs and explicit geometric reasoning, suggesting promising future directions for open-world robotic interaction.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to MSGNav.

MSGNav: Zero-Shot Navigation

1.1 Formal Specification

1.2 Edge Visual Cue Embedding

1.3 Real-time Data Structures and Update Algorithms

2. Core MSGNav Modules

2.1 Key Subgraph Selection (KSS)

2.2 Adaptive Vocabulary Update (AVU)

2.3 Closed-Loop Reasoning (CLR)

2.4 Visibility-based Viewpoint Decision (VVD)

3. Zero-Shot Navigation Pipeline

4. Experimental Results and Evaluation

4.1 Datasets and Metrics

4.2 Quantitative Results

HM3D-OVON (Val-Unseen)

GOAT-Bench (Val-Unseen)

4.3 Qualitative Findings

5. Significance and Research Implications

Follow Topic

Continue Learning

MSGNav: Zero-Shot Navigation

1. Multi-modal 3D Scene Graph Construction

1.1 Formal Specification

1.2 Edge Visual Cue Embedding

1.3 Real-time Data Structures and Update Algorithms

2. Core MSGNav Modules

2.1 Key Subgraph Selection (KSS)

2.2 Adaptive Vocabulary Update (AVU)

2.3 Closed-Loop Reasoning (CLR)

2.4 Visibility-based Viewpoint Decision (VVD)

3. Zero-Shot Navigation Pipeline

4. Experimental Results and Evaluation

4.1 Datasets and Metrics

4.2 Quantitative Results

HM3D-OVON (Val-Unseen)

GOAT-Bench (Val-Unseen)

4.3 Qualitative Findings

5. Significance and Research Implications

Follow Topic

Continue Learning

Related Topics