3D Scene Graphs: Foundations and Advances

Updated 29 November 2025

3D scene graphs are structured, layered graphs that represent objects, spaces, and their relationships in 3D environments.
Advances integrate hierarchical semantics, open-vocabulary labels, and multi-modal fusion to enhance scene reasoning and generative synthesis.
They enable practical applications in robotics, VR, and planning by unifying spatial configurations with semantic and functional insights.

A 3D scene graph is a layered, attributed graph representation that formally organizes entities, spatial configurations, and semantic relationships in three-dimensional environments. Nodes correspond to meaningful entities—such as objects, rooms, humans, functional elements, terrains, or places—while directed edges encode spatial, semantic, functional, or social relations. This abstraction supports reasoning, generation, manipulation, and planning tasks spanning robotics, vision-language understanding, interactive simulation, and embodied AI. 3D scene graphs unify metric geometry, hierarchical semantics, and relational structures, with recent advances extending them to open-vocabulary, zero-shot, and multi-agent contexts.

1. Formal Definitions and Representational Foundations

A 3D scene graph is defined as a directed (occasionally undirected), attributed graph $G = (V, E, A)$ , where

$V$ is the set of nodes partitioned by semantic type (objects, rooms, floors, functional elements, humans, terrain, robot agents, etc.),
$E$ is the set of edges encoding relationships—spatial (e.g., "left of," "on top of"), semantic (e.g., "used for sitting"), supportive, comparative, part-whole, or social/activity relations, and
$A$ maps nodes and edges to high-dimensional attribute vectors (including geometric features, semantic class probabilities, CLIP or VLM embeddings, behavior labels, etc.) (Kim et al., 2019, Werby et al., 1 Oct 2025, Bartoli et al., 29 Sep 2025, Rotondi et al., 10 Mar 2025, Samuelson et al., 6 Jun 2025, Wald et al., 2020).

Hierarchies are common. For instance, KeySG defines five levels: building, floor, room, object, and functional element, with containment edges only (Werby et al., 1 Oct 2025). Social 3D Scene Graphs introduce human-agent nodes and activity edges (e.g., "Person –speaking to→ AnotherPerson") using open-vocabulary relations from VerbAtlas (Bartoli et al., 29 Sep 2025). Outdoor graphs use terrain-aware layers: metric-semantic point cloud, objects, generalized Voronoi place nodes, regions, and a map root (Samuelson et al., 6 Jun 2025).

Attributes:

Geometric: 3D position, size, centroid, orientation, bounding boxes, mesh segmentation.
Semantic: Class labels (fixed or open-set), CLIP/text embeddings, affordance labels, human posture/gaze.
Relational: Relation label (typed by predicate), geometric vector, support, proximity, style, functionality.

Edges are often represented by typed adjacency tensors ( $A \in \{0,1\}^{N \times N \times R}$ ), supporting multi-relational modeling (Naanaa et al., 2023, Liu et al., 10 Mar 2025).

2. Construction Pipelines and Prediction Methodologies

Scene graph construction pipelines process raw RGB-D, LiDAR, panoramic images, or point clouds through the following stages:

Instance Segmentation and Detection: Objects and primitives are identified, typically via PointNet/PointNet++, Mask R-CNN, YOLO, or FastSAM. Keyframes and segmentation masks ensure geometric diversity and high coverage (Werby et al., 1 Oct 2025, Wu et al., 2021, Armeni et al., 2019, Rotondi et al., 10 Mar 2025).
Attribute Extraction: Per-node embeddings are computed (PointNet features, CLIP embeddings, class probabilities, pose, size). Label hierarchies and attributes (color, material, affordances, behavior descriptors) are aggregated (Wald et al., 2020, Bartoli et al., 29 Sep 2025).
Edge Creation: Spatial edges are determined by proximity, containment, or explicit relational annotation (e.g., "next to," "support," "adjacent"). Advanced pipelines prompt VLMs or LLMs on multi-view images for open-vocabulary and long-range relation detection (Rotondi et al., 10 Mar 2025, Koch et al., 19 Feb 2024, Saxena et al., 24 Oct 2025).
Multi-Modal Fusion: Cross-modal embeddings fuse 2D and 3D features for open-set classification and relationship reasoning via joint vision-language co-embedding spaces; grounding aligns predictions with foundation models (Koch et al., 19 Feb 2024, Saxena et al., 24 Oct 2025).
Incremental Updates: New detections are incrementally fused; node features and spatial locations are updated by exponential moving average. Redundant nodes/edges are eliminated by attribute similarity and spatial proximity (Saxena et al., 24 Oct 2025, Wu et al., 2021).

Graph neural networks (GCN, GAT, R-GCN, transformers with self- and cross-attention, Feature-wise Attention, message-passing architectures) propagate context, enabling object and predicate prediction, high-order relational inference, and denoising in generative pipelines (Dhamo et al., 2021, Naanaa et al., 2023, Wu et al., 2021, Kamarianakis et al., 2023, Werby et al., 1 Oct 2025). Commonsense knowledge graphs—external (Visual Genome, ConceptNet, WordNet) or internal—are injected for improved recall and robustness (Qiu et al., 2023).

3. Generative Modeling, Manipulation, and Control

3D scene graphs are employed as semantic control structures for scene generation and manipulation. They serve as input interfaces to conditional generative models, enabling controllable, diverse, and semantically consistent 3D scene synthesis.

Generative Methods:

Conditional Variational Autoencoders (cVAE): Encode node and edge features; decode scene geometry/layout subject to semantic and relational constraints (Dhamo et al., 2021, Zhai et al., 2023, Naanaa et al., 2023).
Diffusion Models: Denoising networks parameterized by scene graph features (via R-GCN, self/cross-attention) iteratively refine noisy scene matrices, enabling high-quality, graph-guided synthesis. Classifier-free guidance interpolates conditional and unconditional predictions to balance diversity and alignment (Naanaa et al., 2023, Liu et al., 10 Mar 2025, Zhai et al., 2023, Zhang et al., 2023).
Graph Transformers: Capture high-level dependencies, estimate object size, position, and orientation while respecting graph relationships.
Message-Passing and Relational Encoding: Propagate local and global relational context (triplet-GCN, relational box/shape discriminators, adversarial constraints) (Dhamo et al., 2021, Zhai et al., 2023).

Manipulation:

Editing the input scene graph allows direct modification of the output scene (addition/deletion of objects, change of relations). Latent graph transformers re-encode edited graphs and propagate updated latent states for re-generation (Dhamo et al., 2021).

Metrics for Control:

Relationship Alignment Score (RAS): Proportion of generated scene relations matching ground-truth predicates (Naanaa et al., 2023).
Control Capacity: MAE and Jaccard index for alignment of object counts/types between graph and generated scene (Liu et al., 10 Mar 2025).
Constraint Fulfillment: Fraction of geometric or semantic relations satisfied (Dhamo et al., 2021, Zhai et al., 2023).

Recent research extends scene graphs to encompass hierarchical containment, dynamic entities, social/human-agent relations, and functional/part-level resolution:

Hierarchical Scene Graphs: KeySG encodes environments as multilevel graphs (building, floor, room, object, functional element) leveraging keyframes for coverage. Retrieval-Augmented Generation (RAG) enables fast context extraction for complex queries, bypassing LLM context window limitations (Werby et al., 1 Oct 2025).
Social Scene Graphs: S³DSG integrates human nodes with posture, gaze, activity descriptors, and open-vocabulary social relations. VLMs and LLMs permit long-range remote relation reasoning; benchmark datasets enable reproducible evaluation on spatial, activity, and functional queries (Bartoli et al., 29 Sep 2025).
Functionality-Aware Graphs: FunGraph models affordance-relevant parts (handles, knobs) as first-class graph nodes linked to objects. Detectors trained on synthesized 2D projections allow fine-grained grounding for robot manipulation tasks. Explicit affordance edges ("rotate," "pull") support language-prompted interaction (Rotondi et al., 10 Mar 2025).
Open-Vocabulary and Zero-Shot Reasoning: ZING-3D and Open3DSG leverage VLM and LLM foundation models for open-class detection and unconstrained relationship assignment, supporting continuous, incremental graph growth and queryable open-set semantics in point clouds (Saxena et al., 24 Oct 2025, Koch et al., 19 Feb 2024).
Geometric Algebra-Unified Graphs: UniSG^GA encodes object transforms as Projective/Conformal GA multivectors, streamlining scene synthesis and topology prediction via GNNs and CGVAEs for generative tasks (Kamarianakis et al., 2023).

5. Applications: Planning, Reasoning, Robotics, and Scene Understanding

3D scene graphs underpin a range of downstream embodied, planning, and reasoning tasks:

Path Planning and Spatial Reasoning: S-Path and the Hydra family exploit the metric-semantic structure for decomposing indoor environments, supporting high-level semantic A* search and parallel sampling-based planning. Efficient replanning mechanisms bias reuse of sub-solved subproblems (Ejaz et al., 8 Aug 2025, Chang et al., 2023, Strader et al., 9 Jun 2025).
Task and Motion Planning: Scene graphs are exported to PDDL for symbolic reasoning, with LLM-driven translation of natural-language instructions into graph-grounded goals. Multi-robot fusion and relocalization enable robust operation in large-scale, dynamic environments (Strader et al., 9 Jun 2025).
Visual Question Answering (VQA): Graph queries facilitate answering spatial, attribute, and count-based questions (e.g., "How many mugs on the shelf?") (Kim et al., 2019, Werby et al., 1 Oct 2025).
Scene Retrieval and Matching: Cross-domain retrieval leverages graph similarity (node/edge sets, Jaccard/Simpson metrics) for 2D→3D and 3D→3D matching despite object rearrangement or occlusion (Wald et al., 2020).
Navigation and Manipulation: Room/place graphs define topological maps. Affordance-aware functional elements support precision planning and manipulation routines for robotics (Rotondi et al., 10 Mar 2025, Ejaz et al., 8 Aug 2025).
Simulated Scene Generation and VR: Generative models conditioned on scene graphs synthesize realistic indoor and outdoor scenes with controlled diversity, supporting training and evaluation for simulation environments (Zhai et al., 2023, Liu et al., 10 Mar 2025).

6. Limitations, Evaluation, and Future Directions

Current challenges and open issues include:

Semantic Diversity and Robustness: Closed-set class and relation vocabularies constrain expressivity; open-vocabulary approaches address rare/long-tail objects and relations but suffer from hallucinations and evaluation bottlenecks (Koch et al., 19 Feb 2024, Saxena et al., 24 Oct 2025).
Incrementality and Scalability: Real-time fusion, multi-robot online graph construction, robust cross-agent alignment, and hierarchical retrieval address scalability, but distributed backends and dynamic graph adaptation remain in development (Werby et al., 1 Oct 2025, Chang et al., 2023, Strader et al., 9 Jun 2025).
Complex Relations and Functional Richness: Many graphs do not capture part hierarchies, articulation, affordance dynamics, or remote/human-human interactions. Addressing these requires richer annotation, behavior data integration, and advanced reasoning (Rotondi et al., 10 Mar 2025, Bartoli et al., 29 Sep 2025).
Metric Evaluation: RAS, FID/KID, mIoU, AP, constraint fulfillment, and human evaluation studies provide standard metrics but lack consensus on open-world graph assessment (Naanaa et al., 2023, Zhai et al., 2023, Werby et al., 1 Oct 2025).
Dynamic and Outdoor Scenes: Terrain-aware and unstructured region modeling—essential for field robots—introduce new segmentation/graphing pipelines (Samuelson et al., 6 Jun 2025).

7. Key Research and Benchmarks

Notable datasets and frameworks:

3RScan/3DSSG: Large-scale annotated scans with ground-truth scene graphs and object/relation labels (Wald et al., 2020).
SG-FRONT: Enriched synthetic indoor scenes for scene graph-generation benchmarks (Zhai et al., 2023).
FunGraph3D, SocialGraph3D: Benchmarks for functional/affordance and social/human-centric graph tasks (Rotondi et al., 10 Mar 2025, Bartoli et al., 29 Sep 2025).
CarlaSG: Paired scene graphs and semantic 3D scenes for outdoor scene generation (Liu et al., 10 Mar 2025).

Research directions likely to expand 3D scene graph capabilities include open-ended semantic parsing, behavior/affordance injection, distributed multi-agent fusion, outdoor graphing, active graph learning, and cross-dual (text/image/graph) reasoning.

*References: (Kim et al., 2019, Wald et al., 2020, Armeni et al., 2019, Wu et al., 2021, Dhamo et al., 2021, Chang et al., 2023, Zhai et al., 2023, Kamarianakis et al., 2023, Naanaa et al., 2023, Qiu et al., 2023, Koch et al., 19 Feb 2024, Liu et al., 10 Mar 2025, Rotondi et al., 10 Mar 2025, Samuelson et al., 6 Jun 2025, Strader et al., 9 Jun 2025, Ejaz et al., 8 Aug 2025, Bartoli et al., 29 Sep 2025, Werby et al., 1 Oct 2025, Saxena et al., 24 Oct 2025).