Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

MSGNav: Zero-Shot Navigation

Updated 16 November 2025
  • MSGNav is a zero-shot embodied navigation system leveraging a multi-modal 3D scene graph for open-vocabulary goal specification without task-specific training.
  • It integrates modules like Key Subgraph Selection, Adaptive Vocabulary Update, and Closed-Loop Reasoning to iteratively improve navigation accuracy using visual cues.
  • The system achieves state-of-the-art performance on benchmarks by enhancing decision reliability with visibility-based viewpoint selection for robust last-mile navigation.

MSGNav is a zero-shot embodied navigation system leveraging a Multi-modal 3D Scene Graph (M3DSG) representation to support open-vocabulary goal specification and state-of-the-art performance with no task-specific reinforcement learning. By explicitly preserving visual evidence within scene graphs and employing a modular pipeline—including Key Subgraph Selection (KSS), Adaptive Vocabulary Update (AVU), Closed-Loop Reasoning (CLR), and a Visibility-based Viewpoint Decision (VVD) for the “last mile”—MSGNav addresses scalability, generalization, and decision reliability in realistic robotic navigation.

1. Multi-modal 3D Scene Graph Construction

The core data structure in MSGNav is the M3DSG, which composes the agent’s spatial experience using an explicit graph St=(Ot,Et)\mathbf{S}_t = (\mathbf{O}_t, \mathbf{E}_t) at time tt.

1.1 Formal Specification

Nodes (Ot={oi}i=1No\mathbf{O}_t = \{o_i\}_{i=1}^{N_o}): Each object oio_i includes the following fields:

  • $\ID_i \in \mathbb{N}$: Unique object identifier
  • CiC_i: Category label (supports open vocabulary)
  • PiR3P_i \in \mathbb{R}^3: 3D centroid
  • BiB_i: 2D bounding box (in current frame)
  • MiM_i: Segmentation mask
  • PCiR3PC_i \subset \mathbb{R}^3: Object point cloud
  • ViRdV_i \in \mathbb{R}^d: Visual embedding
  • RiR_i: Room assignment

Edges (Et\mathbf{E}_t): Each unordered pair $(\ID_x, \ID_y)$ forms an edge if objects xx and yy are spatially adjacent (PxPy<θ\|P_x - P_y\| < \theta). Edge features are not textual but a set of RGB-D frames Ixy={Ik}k=1Ixy\mathbf{I}_{xy} = \{\mathcal{I}_k\}_{k=1}^{|\mathbf{I}_{xy}|} in which those objects co-occur within threshold θ\theta. The total number of edges at time tt is

$N_e = \sum_{1 \leq x < y \leq N_o} \mathds{1}(\|P_x - P_y\| < \theta)$

1.2 Edge Visual Cue Embedding

To preserve full visual context, each edge collects co-occurrence images. For downstream reasoning, an image encoder ϕ:IRd\phi: \mathcal{I} \rightarrow \mathbb{R}^d (e.g., the CLIP image tower) computes visual embeddings:

Exy=1IxyIIxyϕ(I)E_{xy} = \frac{1}{|\mathbf{I}_{xy}|} \sum_{\mathcal{I} \in \mathbf{I}_{xy}} \phi(\mathcal{I})

Rather than using all images, a small subset IxysIxy\mathbf{I}_{xy}^s \subset \mathbf{I}_{xy} is greedily selected (see KSS), providing compact and informative edge features.

1.3 Real-time Data Structures and Update Algorithms

Data Structures:

  • Ot\mathbf{O}_t: Hash map $\ID_i \to o_i$
  • Et\mathbf{E}_t: Dictionary mapping $(\ID_x, \ID_y) \to \{\mathcal{I}\}$
  • H\mathbf{H}: Inverse map $\mathcal{I} \to \{(\ID_x, \ID_y)\}$

Update Procedure:

Upon receiving a new RGB-D frame It\mathcal{I}_t:

  1. Detect objects via YOLO-W, SAM, and CLIP (category + visual embedding).
  2. Match new detections to Ot1\mathbf{O}_{t-1} by spatial and visual similarity; merge or append new objects as needed:

Ot=Φmerge(Φmatch(Oframet,Ot1)Oframet)\mathbf{O}_t = \Phi_{\mathrm{merge}}\left(\Phi_{\mathrm{match}}(\mathbf{O}^{\text{frame}_t}, \mathbf{O}_{t-1}) \cup \mathbf{O}^{\text{frame}_t}\right)

  1. Enumerate unordered pairs within adjacency threshold and update edge co-occurrence sets:
    • For each (ox,oy)(o_x, o_y) with PxPyθ\|P_x - P_y\| \leq \theta, append It\mathcal{I}_t to $\mathbf{E}_t[(\ID_x,\ID_y)]$ and update H\mathbf{H}. Overall complexity per frame is O(nframe2)O(n_\text{frame}^2) but feasible given the modest number of frame detections.

2. Core MSGNav Modules

MSGNav integrates four sequential modules at every step, ensuring scalable and adaptive reasoning in the navigation loop.

2.1 Key Subgraph Selection (KSS)

This module selects a minimal subgraph Stk=(Otk,Etk)S^k_t=(O^k_t, E^k_t) containing kk objects most relevant to the current target gg, as determined by a VLM (e.g., GPT-4o).

Compress-Focus-Prune Algorithm:

  1. Compress: Generate a compact adjacency list S^t=(O^,E^)\hat S_t=(\hat O, \hat E) where $\hat o_i=\{\ID_i, C_i\}$.
  2. Focus: Query VLM with S^t\hat S_t and gg to rank and select the top-kk objects OrelO^\mathrm{rel}.
  3. Prune: Greedily choose the minimal set of images covering all edges among OrelO^\mathrm{rel} via set cover, yielding 4\sim4 images on average for use in VLM prompts.

2.2 Adaptive Vocabulary Update (AVU)

AVU enables open-vocabulary object recognition and scene understanding:

  • Maintain active label set Vt1V_{t-1}.
  • Each VLM query returns a result Rt\mathcal{R}_t and any newly proposed object classes V^t\hat V_t.
  • Update VtVt1V^tV_t \gets V_{t-1} \cup \hat V_t.
  • Detection and graph updates in future steps will accommodate the expanded object vocabulary.

Formally:

(Rt,V^t)=VLM(Stk,Ft,g,t) Vt=Vt1V^t(\mathcal{R}_t, \hat V_t) = \mathrm{VLM}(S^k_t, \mathbf{F}_t, g, t)\ V_t = V_{t-1} \cup \hat V_t

2.3 Closed-Loop Reasoning (CLR)

CLR explicitly accumulates all past VLM feedback for robust iterative reasoning. At time tt, the agent’s memory Mt\mathbf{M}_t is

Mt=Mt1{Rt1}\mathbf{M}_t = \mathbf{M}_{t-1} \cup \{\mathcal{R}_{t-1}\}

Each new query incorporates Mt\mathbf{M}_t, yielding improved next-step accuracy (from 43.8% to 64.8%).

2.4 Visibility-based Viewpoint Decision (VVD)

The “last-mile” problem refers to correctly selecting the agent’s final viewpoint upon reaching the target. Direct navigation to the nearest traversable point often yields suboptimal or occluded views.

VVD samples candidate viewpoints near the predicted target coordinates oˉ\bar o, scores each by the fraction of target point cloud PCoˉ\mathcal{PC}_{\bar o} visible without occlusion, and selects the maximally visible position:

$S(v) = \frac{1}{|\mathcal{PC}_{\bar o}|} \sum_{p \in \mathcal{PC}_{\bar o}} \mathds{1}[\text{ray}(v \rightarrow p) \text{ is occlusion-free}]$

Sampling over radii and azimuths, maintaining only traversable candidates, and computing S(v)S(v) for each, VVD improves close-range (<0.25<0.25m) success from 33.9% to 52.0%.

3. Zero-Shot Navigation Pipeline

The MSGNav pipeline is defined by the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Input: 
  target g         # category name or text or reference image
  T                # max steps
Initialize:
  S  (,)        # empty M3DSG
  M              # empty decision memory
  V  V0           # initial detector vocabulary (e.g. ScanNet200)
for t in 1..T:
  I_t  agent.get_RGBD()
  # 1) Update multi-modal 3D scene graph
  S  update_scene_graph(S, I_t; detectors w/ vocab V)
  # 2) Key Subgraph Selection
  S^k  KSS(S, g)
  # 3) Closed-Loop & Adaptive Vocabulary VLM query
  (R_t, V_new)  VLM_prompt(S^k, M, frontier_images(S), g, t)
  V  V  V_new
  M  M  {R_t}
  if R_t is a target location:
    # 4) Last-mile viewpoint decision
    v_goal  VVD(R_t.target_object, full_pointcloud(S))
    return low_level_planner.navigate_to(v_goal)
  else:
    # R_t is next exploration frontier
    low_level_planner.navigate_to(R_t)
end_for
fail  # did not reach in T steps

Key subroutines directly implement the algorithms defined in preceding sections, invoking update, selection, VLM prompt, and viewpoint decision logic as specified.

4. Experimental Results and Evaluation

4.1 Datasets and Metrics

Evaluation is reported on:

  • GOAT-Bench (“Val-Unseen”): Multi-modal lifelong navigation (category, text, image goals).
  • HM3D-OVON (“Val-Unseen”): Open-vocabulary object navigation.

Metrics:

  • Success Rate (SR): Number of successful episodes over total.
  • SPL:

SPL=1NiSiishortmax(ishort,iagent)\text{SPL} = \frac{1}{N}\sum_{i} S_i \cdot \frac{\ell_i^{\mathrm{short}}}{\max(\ell_i^{\mathrm{short}}, \ell_i^{\mathrm{agent}})}

4.2 Quantitative Results

MSGNav achieves state-of-the-art results among training-free methods:

HM3D-OVON (Val-Unseen)

Method training-free SR (%) SPL (%)
VLFM 35.2 19.6
TANGO 35.5 19.5
Uni-NaVid 39.5 19.8
MTU3D 40.8 12.1
MSGNav 48.3 27.0

GOAT-Bench (Val-Unseen)

Method training-free SR (%) SPL (%)
TANGO 32.1 16.5
3D-Mem 28.8 15.8
MTU3D 47.2 27.7
MSGNav 52.0 29.6

4.3 Qualitative Findings

  • Robustness to Perception Error: The visual edge storage in M3DSG allows visual verification of spatial relations, handling ambiguous or low-contrast cases.
  • Open-vocabulary Generalization: AVU enables discovery of previously unseen classes (e.g., “espresso machine,” “vase”).
  • “Last-mile” Success: VVD-determined viewpoints consistently provide clear, frontal, and unoccluded target views, outperforming naïve nearest-point strategies.

5. Significance and Research Implications

MSGNav’s explicit, visual-centric scene graph overcomes limitations of prior text-only relational models. By tightly integrating scene modeling, adaptive reasoning, incremental vocabulary enrichment, and geometric viewpoint optimization, MSGNav demonstrates the essential role of multi-modal representations for scalable, generalizable, and reliable embodied navigation. Strong empirical gains, particularly as a training-free approach, highlight the efficacy of modular architectures exploiting high-level visual-LLMs and explicit geometric reasoning, suggesting promising future directions for open-world robotic interaction.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MSGNav.