SGImagineNav: Zero-shot Scene-Graph Navigation

Updated 22 May 2026

SGImagineNav is a scene-graph-based system that uses hierarchical world modeling and VLM-powered imagination for open-vocabulary, zero-shot navigation in large 3D environments.
It employs a bird’s-eye view projection and speculative inference to augment its scene graph, improving unobserved object recall by approximately 6–7%.
The adaptive navigation strategy balances semantic exploitation and geometric exploration, achieving over 7% higher success rates than previous state-of-the-art methods.

SGImagineNav is a scene-graph-based, imaginative navigation system for embodied agents performing open-vocabulary object-goal navigation in large, previously unseen 3D environments. The framework unifies semantic world modeling, VLM-powered imagination, and adaptive decision-making to achieve robust, zero-shot navigation both in simulation and real-world settings. SGImagineNav's architecture centers on a hierarchical, evolving scene graph representation augmented by speculative inference about unexplored regions, allowing agents to balance semantic exploitation and geometric exploration.

1. Hierarchical Scene Graph World Model

At the core of SGImagineNav is the hierarchical scene graph $g_t = (V_t, E_t)$ , where $t$ denotes the current time step.

Node types:
- Object nodes: individual detected objects (e.g., bed, chair, stove) with 3D location, category label, and learned visual feature.
- Region nodes: clusters of objects within functional regions (kitchen, bedroom, living room), capturing co-occurrences and spatial proximity.
- Floor nodes: root markers for each floor; new floor nodes are instantiated when the agent traverses a staircase.
Edge structure:
- Edges exist only between adjacent semantic levels (object-region, region-floor), making the graph a rooted tree rather than a general graph.

This structure is updated incrementally as the agent observes new RGB-D frames. An explicit graph-generation function $\varphi(g_{t-1}, o_t)$ builds the new state, maintaining synchrony between the agent's internal model and observed world. Over time, the graph $g_t$ is optimized to approximate the environment's true semantic structure $\bar{g}$ by minimizing an empirical semantic cost function $c$ : $g_t^* = \arg\min_{g_t} c(g_t, \bar{g} \mid o_{1:t})$ where $c(\cdot)$ is typically based on node/edge recall and precision (Hu et al., 9 Aug 2025).

2. Imaginative World Modeling and Viewpoint Completion

SGImagineNav augments standard scene graph construction by explicitly modeling unobserved areas through an imaginative inference process. For each step:

The current graph $g_t$ is projected to a bird's-eye view (BEV) image $I_t$ , including locations of known nodes and boundaries of unknown regions on the occupancy map.
For each unknown region $t$ 0, a text prompt describing both geometry and semantic context (e.g., "unknown region near bedroom and bathroom, searching for q=bed") is constructed.
The agent queries a vision-LLM (VLM), such as GPT-4o-mini, with this prompt and the BEV image. The VLM returns candidate region captions and typical object categories that might exist in the unexplored space.

The imaginative completion step $t$ 1 yields an augmented graph $t$ 2, which empirically improves recall of unobserved objects by 6–7% absolute over standard graph-building pipelines (Hu et al., 9 Aug 2025). This serves as a proxy for $t$ 3, providing an anticipatory prior for downstream reasoning.

Action selection in SGImagineNav is governed by a hybrid strategy dynamically switching between exploitation (goal verification) and exploration (frontier selection). The process is as follows:

A. Goal Verification

If a candidate object node $t$ 4 meeting the language query $t$ 5 (e.g., "bed") is observed, it is cropped and verified by a LLM using its visual context.
If successfully verified, the agent invokes Fast Marching Method (FMM) planning to the corresponding location.

B. Frontier Selection

When no verified goal is present, the system detects frontier points $t$ 6—locations at the boundary between known and unknown free space.
For each candidate frontier $t$ $t$ 7, two scores are computed:
- Semantic (exploitation) gain:
$t$ 8

where $t$ 9 is a semantic matching score (CLIP/LLM-based) for region/object nodes $\varphi(g_{t-1}, o_t)$ 0 near frontier $\varphi(g_{t-1}, o_t)$ 1. - Geometric (exploration) gain:

$\varphi(g_{t-1}, o_t)$ 2

where $\varphi(g_{t-1}, o_t)$ 3 is the area of newly visible unknown cells accessible along a path to $\varphi(g_{t-1}, o_t)$ 4, and $\varphi(g_{t-1}, o_t)$ 5 normalizes for region size.
A fallback rule selects $\varphi(g_{t-1}, o_t)$ 6 with the highest exploitation gain if it exceeds a threshold $\varphi(g_{t-1}, o_t)$ 7, otherwise choosing the best exploration gain.

A summary of the loop structure is:

$\varphi(g_{t-1}, o_t)$ 9 (Hu et al., 9 Aug 2025)

4. Implementation Details and Pipeline

Key components and implementation specifics include:

Object/region/floor detection: Grounded-SAM detector; region grouping employs a k-d-tree and wall-crossing heuristics.
Vision-language reasoning: GPT-4o-mini/4o prompted with custom BEV images and context-aware natural language queries.
Occupancy mapping: Grid at 0.05 m resolution (size 480×480); up to 500 planning steps per episode.
Navigation controller: FMM for path planning; low-level control acts via MoveAhead, Turn, and LookUp/Down primitives (Hu et al., 9 Aug 2025).
Zero-shot operation: No task-specific RL or imitation learning. All components use frozen detectors and pre-trained VLMs.

5. Experimental Evaluation and Results

SGImagineNav has been validated in both Habitat-based simulation and real-world deployments:

Dataset	Success Rate (SR)	SPL	SoftSPL
HM3D	65.4%	30.0	--
HSSD	66.8%	30.2	--

Outperforms previous SOTA zero-shot methods by >7% SR margin.
Ablation findings: Semantic imagination accounts for ≈6.75% SR gain; the exploitation/exploration fallback policy adds ≈2.75%, and LLM-based goal verification yields ≈1.75% (Hu et al., 9 Aug 2025).
Real-world tests (Unitree GO1 robot with Intel D435/T265) demonstrate robust, cross-room, and cross-floor navigation despite odometry noise and incomplete mapping.

6. Relation to Prior and Contemporary Methods

SGImagineNav generalizes and extends both connectionist and symbolic approaches:

ImagineNav (Zhao et al., 2024): transforms navigation into a best-view selection problem for a VLM using imagined future views, but does not maintain a persistent global representation or anticipate unobserved structure. ImagineNav achieves SR = 53.0% on HM3D, showing the necessity of explicit world modeling for further gains.
MSGNav (Huang et al., 13 Nov 2025): constructs a multi-modal 3D scene graph with image-based relational edges, and introduces closed-loop reasoning, adaptive vocabularies, and last-mile view selection. It achieves SR = 48.3% on HM3D-OVON, highlighting the efficacy of structured memory and VLM-based control.
SGN-CIRL (Oskolkov et al., 4 Jun 2025): leverages scene-graph-embedded RL with imitation and curriculum learning, reinforcing the value of object-centric, relational representations for policy optimization.
SGImagineNav unifies the above by (i) maintaining a hierarchical symbolic world model, (ii) investing in proactive, speculative completion of its model via VLM queries, and (iii) combining semantic and geometric information gain for action selection.

7. Limitations and Prospective Developments

Primary limitations identified in SGImagineNav include:

False positives in detection (e.g., confusing beds with sofas).
Incomplete or noisy mesh environments leading to occluded walls or missing semantic nodes.
Difficulty in stair-climbing or multi-floor transition planning.
Redundant exploration due to lack of explicit temporal planning.
No auxiliary RL or imitation learning: while zero-shot capability ensures semantic generalization, tailored policies might further improve efficiency.

Proposed research directions include sequence-level planning over the imagined graph to reduce trajectory redundancy, adaptive exploitation thresholds, environment-specific semantic prior adaptation, and extension of imagination to predictive belief updates $\varphi(g_{t-1}, o_t)$ 8. Integrating sign-based semantic guidance, multi-agent memory, and affordance prediction are also discussed as fertile paths building on aligned research in SignNav (Sun et al., 17 Mar 2026).

Key references:

"Imaginative World Modeling with Scene Graphs for Embodied Agent Navigation" (Hu et al., 9 Aug 2025)
"ImagineNav: Prompting Vision-LLMs as Embodied Navigator through Scene Imagination" (Zhao et al., 2024)
"MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation" (Huang et al., 13 Nov 2025)
"SGN-CIRL: Scene Graph-based Navigation with Curriculum, Imitation, and Reinforcement Learning" (Oskolkov et al., 4 Jun 2025)
"SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments" (Sun et al., 17 Mar 2026)