3D Scene Graphs for Scene Understanding

Updated 16 June 2026

3D scene graphs are structured representations that model environments using nodes for objects and edges for semantic and spatial relations.
They integrate geometric cues with semantic labels via segmentation, feature encoding, and relational classifiers to build multi-level hierarchies.
Emerging techniques leverage graph neural networks and diffusion models to enhance scene synthesis, robotic mapping, and interactive simulation.

A 3D scene graph is a graph-based, metric-semantic representation of a three-dimensional environment in which nodes model scene entities (objects, parts, places, rooms, or cameras) and edges represent semantic or spatial relationships among these entities. This structured abstraction integrates 3D geometry, semantic categories, and relational information to support perception, reasoning, planning, and simulation in both embodied robotics and interactive computer vision. The 3D scene graph paradigm is foundational to contemporary research in physical scene understanding, language-grounded interaction, simulation-driven generative modeling, and semantic mapping.

1. Formal Structure and Definition

A 3D scene graph $G = (V, E)$ is a directed (sometimes typed) graph where:

$V$ , the set of nodes, represents physical scene elements—typically objects (with geometry, category, and appearance), and depending on the abstraction, parts, rooms, regions, or agents.
$E \subseteq V \times V$ , the set of edges, encodes relationships such as spatial (e.g., “on top of,” “next to”), functional (e.g., “has part,” “affords pulling”), kinematic, or even comparative (“larger than”) predicates. Edges are commonly directed and typed.

Each object node encapsulates properties such as:

Semantic label(s) and associated confidence (from classifiers)
3D position and bounding box (ranging from SE(3)-registered pose, centroids, to Gaussian ellipsoids)
Additional attributes: color distributions, instance masks, affordance types, material composition, or multi-modal descriptors
Optionally, multi-scale or hierarchical indicators (object/part/scene context)

Relationship edges may carry:

Predicate names (text)
Spatial/kinematic parameters (relative pose, proximity, attachment)
Probability or confidence scores
Attributes relevant for specific tasks (e.g., physical constraint, room membership)

This unified structure can hierarchically integrate multiple semantic layers (e.g., building → room → object → part in (Armeni et al., 2019); object → functional part in (Rotondi et al., 10 Mar 2025, Li et al., 7 Jun 2026); spatial → topological in (Samuelson et al., 6 Jun 2025)).

2. Scene Graph Construction Methodologies

2.1 Geometric-Semantic Fusion

The standard pipeline for constructing 3D scene graphs from sensory data (RGB-D, point clouds, images) involves:

Segmentation: Partition the observed scene into objects (and/or parts), often via over-segmentation, instance segmentation via 2D/3D detectors, or clustering/graph-cut over superpoints (Wu et al., 2021, Zhan et al., 16 Jun 2025, Qiu et al., 2023).
Feature Encoding: Compute per-node descriptors by fusing geometric cues (3D point clouds, bounding boxes, splatting parameters) and visual-semantic embedding (CLIP, DINO, VLM/VLM+SAM) (Li et al., 7 Jun 2026, Wang et al., 6 Mar 2025, Zhan et al., 16 Jun 2025).
Edge Initialization: Generate candidate relationship edges based on adjacency, geometric proximity, or semantic cues; labels are predicted via relational classifiers, LLMs/VLMs, or geometric primitives (Wu et al., 2021, Madhavaram et al., 31 Jan 2026, Rotondi et al., 10 Mar 2025).
Graph Merging/Pruning: Enforce temporal consistency and avoid duplicate or spurious nodes/edges by similarity thresholds, confidence aggregation, and non-max suppression (Kim et al., 2019, Olivastri et al., 2024).

A representative pipeline pseudocode appears in (Armeni et al., 2019), including 2D detection, framing, multi-view projection, clustering, semantic attribute assignment, and graph construction.

2.2 Functional and Physical Augmentation

Recent work extends node definitions to incorporate:

Functional Segmentation: Detect affordance-relevant functional parts (e.g., handles, switches, levers) as explicit nodes and join to parent objects via “has-part” edges (Rotondi et al., 10 Mar 2025, Li et al., 7 Jun 2026). Detection relies on projecting 3D functional annotations into 2D, fine-tuning open-vocabulary detectors, and reconstructing 3D part segments.
Physics-aware Augmentation: Attach physical properties (material class, mass, joint articulations, elasticity) and kinematic edges. PhysGraph (Li et al., 7 Jun 2026) uses 3D Gaussian splatting, part segmentation, LLM reasoning over materials, and geometric/LLM-guided joint inference.

2.3 Hierarchical Structure

Multiple layers can exist:

Layer	Node Type	Attributes/Geometry
0 (root)	Global	Map/frame of reference
1	Region	Cells/rooms/regions
2	Object	3D box, semantic label
3	Part	Segments, joint params
4	Place	Terrain patches/patches

Edges can express containment (object–room), spatial adjacency (place–place), support/afford (object–part), kinematic (part–part) relations (Samuelson et al., 6 Jun 2025, Armeni et al., 2019).

3. Learning and Inference: Graph Neural and Retrieval-Augmented Models

Modern 3D scene graph systems leverage powerful neural architectures and large-scale knowledge for scene understanding and synthesis:

Graph Neural Networks (GNN): Encode mutual object–relation context, propagate features for both recognition and relation reasoning (Wu et al., 2021, Dhamo et al., 2021, Koch et al., 2023).
Relational Graph Convolutions: RGCN blocks enable relation-type-aware propagation (Naanaa et al., 2023), while feature-wise (multi-head) attention mechanisms selectively attend to salient neighbor features in partial graphs (Wu et al., 2021).
Retrieval-Augmented Generation and LLM/VLM Integration: Systems such as SGR³ (Wang et al., 4 Mar 2026) bypass explicit reconstruction by leveraging a multi-modal LLM retrieved prior graphs for relational structure, integrating retrieved edges directly via cross-attention during graph token generation.
Self-supervised and Knowledge-Augmented Training: Reconstruction-based pre-training with geometric bottlenecks (SGRec3D (Koch et al., 2023)) and message-passing from external commonsense KGs (Qiu et al., 2023) enhance label-efficiency and relational accuracy.
Zero-shot and Open-vocabulary Methods: Recent models employ large vision-LLMs (LVLMs, LLMs) and patch-based similarity retrieval to align arbitrary language queries with 3D semantics (Zhan et al., 16 Jun 2025, Zhu et al., 17 Mar 2026).

4. Generative Models, Control, and Scene Synthesis

3D scene graphs serve as both input and output interfaces in generative modeling for scene synthesis:

Graph-to-Scene: End-to-end GCN-based VAEs map a semantic graph (object nodes + relation edges) to a distribution over scene layouts and shapes, supporting one-to-many layout/shape generation, editing by graph manipulation, and adversarial constraint satisfaction (Dhamo et al., 2021, Ruiz et al., 18 Nov 2025).
Diffusion-based Synthesis: Scene graph–conditioned diffusion models (both discrete and continuous) permit high-fidelity sampling of structure-constrained 3D layouts (Ruiz et al., 18 Nov 2025, Naanaa et al., 2023, Liu et al., 10 Mar 2025). Methodologies include relational GCN denoisers, classifier-free guidance, and SE(3)-equivariant fusion for text-and-graph conditioning.
Sparse-to-Dense Control: Outdoor-scale systems (e.g., for urban scene generation) map a sparse user-authored graph (object nodes, road connectivity) to a dense BEV embedding, which acts as conditioning signal for cascaded 2D+3D diffusion (Liu et al., 10 Mar 2025).
Physics and Simulation Integration: Scene graphs augmented with physical parameters can be automatically translated into executable simulation environments, e.g., MuJoCo XML with full joint and material specification (Li et al., 7 Jun 2026).

5. Applications: Mapping, Reasoning, Interaction, and Affordance

3D scene graphs function across multiple downstream applications:

Semantic Mapping and Robotic Reasoning: Scene graphs provide the basis for spatial queries (“Where is object X?”), object grounding, and plan synthesis in dynamic, shared environments (Kim et al., 2019, Olivastri et al., 2024).
Language-based and Free-form Querying: Systems like FreeQ-Graph (Zhan et al., 16 Jun 2025) enable chain-of-thought LLM reasoning over graphs for arbitrarily complex queries involving semantic labels, attributes, and spatial relations.
Affordance and Task-driven Interaction: Incorporating functional element nodes with affordance attributes allows open-vocabulary grounding of interaction queries (e.g., “open the freezer drawer handle” resolves to the appropriate part node via the graph structure) (Rotondi et al., 10 Mar 2025, Li et al., 7 Jun 2026).
Incremental and Dynamic Update: In lifelong operation, multimodal 3DSG updaters ingest new perceptions, actions, human inputs, and time-based priors to maintain graph consistency as environments evolve (Olivastri et al., 2024).
Open-world and Real-time Robotics: Methods such as OGScene3D (Zhu et al., 17 Mar 2026), GaussianGraph (Wang et al., 6 Mar 2025), and SceneGraphFusion (Wu et al., 2021) perform incremental, open-vocabulary mapping and scene graph construction at interactive rates for real-world robots.

6. Evaluation Protocols, Metrics, and Empirical Results

Research evaluates scene graph construction and utility across several axes:

Node and Edge Prediction: Metrics include object/predicate/relationship recall@K (R@K), node/edge mAP, and semantic segmentation scores (mIoU, mAcc).
Grounding and Querying: Acc@IoU for object localization, open-vocabulary query success (e.g., top-1 recall in text-queried grounding on Replica/Nr3D datasets).
Scene Graph Consistency: Relationship Alignment Score (RAS) measures how well generated scenes satisfy edge constraints (Naanaa et al., 2023).
Functional/Affordance Tasks: IoU-based measures for task-driven retrieval of affordance elements, as well as manual evaluation on free-form interaction queries.
Graph and System Efficiency: Runtime per frame, scalability, completeness, and labor savings for semi-automatic pipelines (Armeni et al., 2019, Kim et al., 2019).

Empirically, integration of knowledge graphs (Qiu et al., 2023), self-supervised graph bottlenecks (Koch et al., 2023), and retrieval augmentation (Wang et al., 4 Mar 2026) each yield significant improvements in recall, label efficiency, and relation accuracy.

7. Limitations, Open Challenges, and Future Directions

Segmentation and Association Quality: Scene graph accuracy is bounded by the quality of geometric segmentation, semantic grounding, and cross-view/data association. Partial or over-merged segments, unrecognized objects, and missing modalities degrade performance (Wu et al., 2021, Wang et al., 6 Mar 2025).
Ambiguity and View-Dependence: Conventional “left/right” predicates are viewpoint dependent; recent work such as VIZOR (Madhavaram et al., 31 Jan 2026) addresses this with object-centric axes.
Scalability and Real-time Operation: While current systems run at interactive rates, open-world incremental mapping at city or campus scale remains a challenge. Hierarchical and online graph summarization are promising directions.
Commonsense and Physical Reasoning: Integration of external KGs improves relationship prediction, but coverage and granularity remain issues for novel or rare relations (Qiu et al., 2023). Physics-aware graphs (Li et al., 7 Jun 2026) show potential for bridging perception and simulation.
Open Vocabulary and Language Grounding: Reliance on LLM/VLMs increases flexibility but introduces sensitivity to prompt design and inference latency. Hybrid neuro-symbolic approaches and explicit retrieval of visual-structural priors (e.g., SGR³ (Wang et al., 4 Mar 2026)) are active research areas.
Simulation and Generative Quality: Physically plausible, controllable scene generation (including fine-grained geometry/material realism) is under continuous improvement (Ruiz et al., 18 Nov 2025, Liu et al., 10 Mar 2025, Dhamo et al., 2021).

In summary, 3D scene graphs provide an expressive, extensible, and operational paradigm for structured 3D scene understanding. Their integration with semantic/geometric perception, language and knowledge models, generative frameworks, and physical reasoning enables new levels of abstraction, control, and generalization across robotics, simulation, and vision. Ongoing research advances the representational richness, scalability, zero-shot generalization, and interactive capacity of 3D scene graphs, consolidating their foundational role in the future of embodied AI and semantic 3D modeling.