MSGNav: Zero-Shot Navigation
- MSGNav is a zero-shot embodied navigation system leveraging a multi-modal 3D scene graph for open-vocabulary goal specification without task-specific training.
- It integrates modules like Key Subgraph Selection, Adaptive Vocabulary Update, and Closed-Loop Reasoning to iteratively improve navigation accuracy using visual cues.
- The system achieves state-of-the-art performance on benchmarks by enhancing decision reliability with visibility-based viewpoint selection for robust last-mile navigation.
MSGNav is a zero-shot embodied navigation system leveraging a Multi-modal 3D Scene Graph (M3DSG) representation to support open-vocabulary goal specification and state-of-the-art performance with no task-specific reinforcement learning. By explicitly preserving visual evidence within scene graphs and employing a modular pipeline—including Key Subgraph Selection (KSS), Adaptive Vocabulary Update (AVU), Closed-Loop Reasoning (CLR), and a Visibility-based Viewpoint Decision (VVD) for the “last mile”—MSGNav addresses scalability, generalization, and decision reliability in realistic robotic navigation.
1. Multi-modal 3D Scene Graph Construction
The core data structure in MSGNav is the M3DSG, which composes the agent’s spatial experience using an explicit graph at time .
1.1 Formal Specification
Nodes (): Each object includes the following fields:
- $\ID_i \in \mathbb{N}$: Unique object identifier
- : Category label (supports open vocabulary)
- : 3D centroid
- : 2D bounding box (in current frame)
- : Segmentation mask
- : Object point cloud
- : Visual embedding
- : Room assignment
Edges (): Each unordered pair $(\ID_x, \ID_y)$ forms an edge if objects and are spatially adjacent (). Edge features are not textual but a set of RGB-D frames in which those objects co-occur within threshold . The total number of edges at time is
$N_e = \sum_{1 \leq x < y \leq N_o} \mathds{1}(\|P_x - P_y\| < \theta)$
1.2 Edge Visual Cue Embedding
To preserve full visual context, each edge collects co-occurrence images. For downstream reasoning, an image encoder (e.g., the CLIP image tower) computes visual embeddings:
Rather than using all images, a small subset is greedily selected (see KSS), providing compact and informative edge features.
1.3 Real-time Data Structures and Update Algorithms
Data Structures:
- : Hash map $\ID_i \to o_i$
- : Dictionary mapping $(\ID_x, \ID_y) \to \{\mathcal{I}\}$
- : Inverse map $\mathcal{I} \to \{(\ID_x, \ID_y)\}$
Update Procedure:
Upon receiving a new RGB-D frame :
- Detect objects via YOLO-W, SAM, and CLIP (category + visual embedding).
- Match new detections to by spatial and visual similarity; merge or append new objects as needed:
- Enumerate unordered pairs within adjacency threshold and update edge co-occurrence sets:
- For each with , append to $\mathbf{E}_t[(\ID_x,\ID_y)]$ and update . Overall complexity per frame is but feasible given the modest number of frame detections.
2. Core MSGNav Modules
MSGNav integrates four sequential modules at every step, ensuring scalable and adaptive reasoning in the navigation loop.
2.1 Key Subgraph Selection (KSS)
This module selects a minimal subgraph containing objects most relevant to the current target , as determined by a VLM (e.g., GPT-4o).
Compress-Focus-Prune Algorithm:
- Compress: Generate a compact adjacency list where $\hat o_i=\{\ID_i, C_i\}$.
- Focus: Query VLM with and to rank and select the top- objects .
- Prune: Greedily choose the minimal set of images covering all edges among via set cover, yielding images on average for use in VLM prompts.
2.2 Adaptive Vocabulary Update (AVU)
AVU enables open-vocabulary object recognition and scene understanding:
- Maintain active label set .
- Each VLM query returns a result and any newly proposed object classes .
- Update .
- Detection and graph updates in future steps will accommodate the expanded object vocabulary.
Formally:
2.3 Closed-Loop Reasoning (CLR)
CLR explicitly accumulates all past VLM feedback for robust iterative reasoning. At time , the agent’s memory is
Each new query incorporates , yielding improved next-step accuracy (from 43.8% to 64.8%).
2.4 Visibility-based Viewpoint Decision (VVD)
The “last-mile” problem refers to correctly selecting the agent’s final viewpoint upon reaching the target. Direct navigation to the nearest traversable point often yields suboptimal or occluded views.
VVD samples candidate viewpoints near the predicted target coordinates , scores each by the fraction of target point cloud visible without occlusion, and selects the maximally visible position:
$S(v) = \frac{1}{|\mathcal{PC}_{\bar o}|} \sum_{p \in \mathcal{PC}_{\bar o}} \mathds{1}[\text{ray}(v \rightarrow p) \text{ is occlusion-free}]$
Sampling over radii and azimuths, maintaining only traversable candidates, and computing for each, VVD improves close-range (m) success from 33.9% to 52.0%.
3. Zero-Shot Navigation Pipeline
The MSGNav pipeline is defined by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
Input: target g # category name or text or reference image T # max steps Initialize: S ← (∅,∅) # empty M3DSG M ← ∅ # empty decision memory V ← V0 # initial detector vocabulary (e.g. ScanNet200) for t in 1..T: I_t ← agent.get_RGBD() # 1) Update multi-modal 3D scene graph S ← update_scene_graph(S, I_t; detectors w/ vocab V) # 2) Key Subgraph Selection S^k ← KSS(S, g) # 3) Closed-Loop & Adaptive Vocabulary VLM query (R_t, V_new) ← VLM_prompt(S^k, M, frontier_images(S), g, t) V ← V ∪ V_new M ← M ∪ {R_t} if R_t is a target location: # 4) Last-mile viewpoint decision v_goal ← VVD(R_t.target_object, full_pointcloud(S)) return low_level_planner.navigate_to(v_goal) else: # R_t is next exploration frontier low_level_planner.navigate_to(R_t) end_for fail # did not reach in T steps |
Key subroutines directly implement the algorithms defined in preceding sections, invoking update, selection, VLM prompt, and viewpoint decision logic as specified.
4. Experimental Results and Evaluation
4.1 Datasets and Metrics
Evaluation is reported on:
- GOAT-Bench (“Val-Unseen”): Multi-modal lifelong navigation (category, text, image goals).
- HM3D-OVON (“Val-Unseen”): Open-vocabulary object navigation.
Metrics:
- Success Rate (SR): Number of successful episodes over total.
- SPL:
4.2 Quantitative Results
MSGNav achieves state-of-the-art results among training-free methods:
HM3D-OVON (Val-Unseen)
| Method | training-free | SR (%) | SPL (%) |
|---|---|---|---|
| VLFM | ✓ | 35.2 | 19.6 |
| TANGO | ✓ | 35.5 | 19.5 |
| Uni-NaVid | ✗ | 39.5 | 19.8 |
| MTU3D | ✗ | 40.8 | 12.1 |
| MSGNav | ✓ | 48.3 | 27.0 |
GOAT-Bench (Val-Unseen)
| Method | training-free | SR (%) | SPL (%) |
|---|---|---|---|
| TANGO | ✓ | 32.1 | 16.5 |
| 3D-Mem | ✓ | 28.8 | 15.8 |
| MTU3D | ✗ | 47.2 | 27.7 |
| MSGNav | ✓ | 52.0 | 29.6 |
4.3 Qualitative Findings
- Robustness to Perception Error: The visual edge storage in M3DSG allows visual verification of spatial relations, handling ambiguous or low-contrast cases.
- Open-vocabulary Generalization: AVU enables discovery of previously unseen classes (e.g., “espresso machine,” “vase”).
- “Last-mile” Success: VVD-determined viewpoints consistently provide clear, frontal, and unoccluded target views, outperforming naïve nearest-point strategies.
5. Significance and Research Implications
MSGNav’s explicit, visual-centric scene graph overcomes limitations of prior text-only relational models. By tightly integrating scene modeling, adaptive reasoning, incremental vocabulary enrichment, and geometric viewpoint optimization, MSGNav demonstrates the essential role of multi-modal representations for scalable, generalizable, and reliable embodied navigation. Strong empirical gains, particularly as a training-free approach, highlight the efficacy of modular architectures exploiting high-level visual-LLMs and explicit geometric reasoning, suggesting promising future directions for open-world robotic interaction.