SG-Nav: Semantic-Geometric Navigation

Updated 3 June 2026

SG-Nav is a unified navigation approach that integrates rich 3D scene graphs and hierarchical planners to combine semantic context with precise geometric mapping.
The framework employs composite cost functions that blend distance, clearance, and semantic penalties, optimizing route efficiency and reducing collision risks.
SG-Nav leverages large vision-language models and adaptive memory modules, enhancing decision-making and achieving state-of-the-art performance on embodied navigation benchmarks.

Semantic-Geometric Navigation (SG-Nav) frameworks represent a class of navigation architectures that fuse semantic (high-level, context-rich) scene understanding with geometric (metric or topological) reasoning. Originating from the need to improve the efficiency, robustness, and generalization of embodied agents—particularly in object-goal, vision-language, and lifelong navigation tasks—SG-Nav approaches employ rich 3D scene graph representations, hierarchical planners, and multimodal prompt-based decision modules. This synthesis enables agents to reason spatially (about connectivity, traversability, or search efficiency) and semantically (about object classes, spatial context, and open vocabulary goals) in open and potentially unknown environments.

1. The Rationale Beyond Pure Geometry: Motivation and Formulation

Traditional geometric navigation approaches typically exploit only low-level spatial features such as occupancy, signed distance fields, or localization over a map. While such planners (e.g., IRRT*, A*) excel at collision avoidance and metric optimality, they cannot differentiate between contextually distinct regions or incorporate task-relevant semantics, such as room types, preferred paths, or forbidden transitions. Conversely, planners that exclusively rely on visual semantics lack robust geometric parsing or obstacle sensitivity.

SG-Nav frameworks address these deficiencies by jointly minimizing a composite cost that fuses geometric and semantic objectives. In the prototypical setting (Kremer et al., 2023):

$J[x(\cdot),a(\cdot)] = \sum_{t=0}^{T-1} C(x_t, a_t), \ C(x,a) = w_g C_\mathrm{geom}(x,a) + w_s C_\mathrm{sem}(x,a)$

where $C_\mathrm{geom}$ encodes physical path length and clearance, and $C_\mathrm{sem}$ penalizes semantically costly transitions (e.g., door crossings), controlled by weights $w_g, w_s$ . The resulting policy simultaneously enforces collision-free dynamical constraints, semantic admissibility (e.g., room or object constraints), and global task goals.

2. 3D Scene Graph and Semantic Representation Paradigms

A core element underlying all SG-Nav frameworks is the construction and exploitation of a hierarchical 3D scene graph. This graph formalizes the relational, semantic, and geometric structure of the environment at multiple levels:

S-Graphs (Kremer et al., 2023): Nodes for rooms and doorways; edges labeling adjacency and predicates (e.g., $\mathrm{connects}(r_i,d_j)\in\{0,1\}$ ) with associated traversal cost $c_{ij}$ . Room and doorway nodes carry centroids, widths, and relational attributes.
Multi-modal 3D Scene Graphs (M3DSG) (Huang et al., 13 Nov 2025): Object-centric nodes augmented with RGB-D, CLIP embeddings, point clouds, segmentation masks, and room/region tags; edge features are dynamic image sets capturing co-occurrence rather than static text.
Spatial Scene Graphs (SSG) (Zhang et al., 11 Jan 2026): Nodes for floors, rooms, and objects each paired with geometric (centroids, bounding volumes) and semantic (category, learned embedding) feature vectors; containment and adjacency relations.
Online 3D Scene Graphs (SG-Nav-GPT) (Yin et al., 2024): Nodes for objects, object-groups, and rooms with spatial and affiliation edges; used for chain-of-thought traversal and context-driven LLM reasoning.
Visual-Geometry GP Space (Ali et al., 2024): Not a discrete graph, but fuses dual Gaussian process fields modeling geometric traversability and semantic/navigability likelihood.
Semantic Skeleton Memory Graph (SSMG) (Niu et al., 2 Mar 2026): Persistent memory as a free-space skeleton graph with topological keypoints (junctions, endpoints), each node carrying both object-level and space-level semantic descriptors.
3D Gaussian Splatting Memory (3DGSNav) (Zheng et al., 12 Feb 2026): Continuous, actively optimized Gaussians in $\mathbb{R}^3$ supporting physically grounded, view-synthesized scene graphs enabling both spatial and semantic querying.

These representations enable efficient reasoning, persistent memory, dynamic vocabulary expansion (via interaction with VLMs), and flexibility in associating visual, geometric, and symbolic cues.

3. Hierarchical and Modular Planning Architectures

SG-Nav frameworks deploy hierarchical planning architectures to decouple high-level structural reasoning from low-level geometric policy. The canonical workflow consists of:

High-level semantic search: A* or greedy search on the abstracted semantic graph (e.g., rooms, junctions, subgraphs), yielding waypoints or “candidate destinations.”
Subproblem decomposition: The semantic path is split into subproblems (e.g., room-to-doorway traversals or region-restricted local goals), enabling local geometric planners to focus sampling or optimization inside constrained, task-relevant domains (Kremer et al., 2023, Niu et al., 2 Mar 2026).
Geometric/metric optimization: Path planning (IRRT*, OMPL-A*, FMM) using only the geometric or traversable portion of the space, often with region-restricted sampling, signed distance fields, or GP-predicted safe zones (Ali et al., 2024).
Long-horizon policy and memory utilization: Persistent graph memory (e.g., SSMG) supports global policies such as expected-cost minimization over candidate destinations; plans are generated to minimize not just distance but also expected target visitation cost, leveraging local search heuristics over the skeleton (Niu et al., 2 Mar 2026).
Active viewpoint and re-verification modules: To handle perceptual ambiguity, modules may optimize for maximum target-visibility (via geometric ray-tracing, free-viewpoint optimization, or VLM-driven action selection) and re-validate candidate detections (Zheng et al., 12 Feb 2026).

Typical pipelines demonstrate modularity; e.g., perception, planning, and (re-)verification are loosely coupled but information is exchanged through graph memory, multimodal prompts, and active observation strategies.

4. Integration with Large Vision-LLMs and Multimodal Reasoning

A distinguishing property of recent SG-Nav variants is close integration with LLMs or VLMs, enabling multimodal zero-shot object and instruction-following navigation.

Scene graph prompting: The current graph (or its subgraphs) is serialized to LLM-friendly format; hierarchical chain-of-thought (CoT) prompting invokes LLMs to reason about subgraph-to-goal proximity, ask clarifying questions, and interpolate scores over frontiers (Yin et al., 2024).
Closed-loop context: Decision memory accumulates past model outputs, which are fed to the VLM along with key subgraphs for more consistent, history-aware reasoning (Huang et al., 13 Nov 2025).
Adaptive vocabulary expansion: At each reasoning round, new object-category proposals are dynamically injected into the vocabulary through VLM prompting, enhancing generalization to out-of-vocabulary targets (Huang et al., 13 Nov 2025, Zhang et al., 11 Jan 2026).
Structured visual prompts: Composite images (BEV maps with trajectory overlays, annotated novel-viewpoint renders) are integrated with CoT textual analysis; VLMs output ranking or action choices (Zheng et al., 12 Feb 2026).
Re-perception and target re-verification: By actively seeking observationally optimal (high-visibility) viewpoints for ambiguous or low-confidence detections, or conducting multiple confirmation rounds, frameworks mitigate false positives and perception-induced failures (Yin et al., 2024, Zheng et al., 12 Feb 2026).

Across these modules, the agent’s decision-making is explainable, as reasoning traces and scoring justifications are accessible at each step.

5. Cost Functions, Heuristics, and Semantic-Geometric Fusion Strategies

All SG-Nav frameworks employ explicit fusion of geometric and semantic cues at the level of cost function design, candidate scoring, or navigable-space definition:

Composite costs: Additive or weighted geometric (distance, clearance) and semantic (boundary penalties, region preferences) terms guide policies; tuning (e.g., $p_d$ penalty in S-Nav) modulates topological versus metric optimization (Kremer et al., 2023).
Gaussian Process-based fields: Spatial predictions from LiDAR (geometry) and camera-based segmentation (semantics) fields are fused, either as joint thresholds or multiplicative navigability scores, defining the safely traversable region (Ali et al., 2024).
Belief–cost joint planning: Probabilities over candidate destinations (from VLM-prompted scoring) are combined with skeleton distances, and the plan with minimum expected search cost is chosen (Niu et al., 2 Mar 2026).
Visibility-based candidate selection: For “last-mile” action selection, visibility scores based on ray-tracing through reconstructed point clouds or 3DGS memory are maximized when selecting ultimate candidate view positions (Huang et al., 13 Nov 2025, Zheng et al., 12 Feb 2026).
Uncertainty-aware scoring: GP-predicted variance penalizes high-uncertainty regions, encouraging exploration and planning in more reliable subspaces (Ali et al., 2024).

This variety reflects the flexibility of the SG-Nav paradigm in supporting application-specific optimization criteria.

6. Experimental Benchmarks, Performance, and Limitations

SG-Nav architectures have been evaluated across standard embodied navigation benchmarks—including MP3D, HM3D, RoboTHOR, GOAT-Bench, R2R, and RxR-CE—using metrics such as Success Rate (SR), Success weighted by Path Length (SPL), path efficiency, and back-and-forth maneuver rate.

Empirical findings report:

Consistent improvements over both pure-geometry (G-Nav), visual-only (V-Nav), and early zero-shot baselines (e.g., ESC, L3MVN, CoW). For instance, SG-Nav-GPT yields $+11.5\%$ SR on MP3D and $+14.8\%$ SR on HM3D over the best prior zero-shot baselines, even exceeding supervised methods on MP3D (Yin et al., 2024).
Efficiency in planning and path quality: Hierarchical decomposition and region-based sampling drastically reduce variance and improve sample efficiency (Kremer et al., 2023).
Robustness to perception and reasoning failures: Modules for re-perception, persistent memory, and active viewpoint search reduce false stops and backtracking, yielding higher SPL and fewer direction reversals (Niu et al., 2 Mar 2026, Zheng et al., 12 Feb 2026).
Zero-shot and lifelong learning gains: On lifelong and open-vocabulary tasks, adaptive memory and vocabulary modules enable state-of-the-art generalization (Huang et al., 13 Nov 2025, Niu et al., 2 Mar 2026).

However, performance is bounded by inference latency (due to multiple VLM calls), detector or segmentation failures, and, for GP-based fusion, the quality of input sensor modalities. Real-time deployment may be limited by computational overhead, and certain “last-mile” ambiguities persist without explicit active recognition modules.

7. Variants, Extensions, and Future Directions

Contemporary research continues to extend the SG-Nav paradigm:

Memory architectures: Experiments with more persistent and topologically aware memories (SSMG) support policy improvements in lifelong settings.
Modeling modalities: Ongoing work targets integration of volumetric, non-planar, or occluded space (3DGS), and more sophisticated uncertainty modeling.
Policy learning: While most systems remain modular with hand-tuned fusion, future approaches may leverage reinforcement learning or differentiable cost shaping over multi-modal graph structures.
Open-world generalization: Dynamic vocabulary expansion and more powerful VLM integration enable generalization to unseen objects, free-form human instructions, or cross-modal targets.
Active and embodied recognition: Bundling navigation and perceptual policy (e.g., via active viewpoint re-verification and free-viewpoint rendering) is a principal frontier in reducing persistent ambiguities in open-ended embodied AI.

The SG-Nav class constitutes the dominant trend in unified semantic-geometric navigation, providing both practical advances and a blueprint for compositional, context-aware embodied agents (Kremer et al., 2023, Huang et al., 13 Nov 2025, Zhang et al., 11 Jan 2026, Yin et al., 2024, Ali et al., 2024, Niu et al., 2 Mar 2026, Zheng et al., 12 Feb 2026).