Papers
Topics
Authors
Recent
2000 character limit reached

3D Scene Graphs: Foundations and Advances

Updated 29 November 2025
  • 3D scene graphs are structured, layered graphs that represent objects, spaces, and their relationships in 3D environments.
  • Advances integrate hierarchical semantics, open-vocabulary labels, and multi-modal fusion to enhance scene reasoning and generative synthesis.
  • They enable practical applications in robotics, VR, and planning by unifying spatial configurations with semantic and functional insights.

A 3D scene graph is a layered, attributed graph representation that formally organizes entities, spatial configurations, and semantic relationships in three-dimensional environments. Nodes correspond to meaningful entities—such as objects, rooms, humans, functional elements, terrains, or places—while directed edges encode spatial, semantic, functional, or social relations. This abstraction supports reasoning, generation, manipulation, and planning tasks spanning robotics, vision-language understanding, interactive simulation, and embodied AI. 3D scene graphs unify metric geometry, hierarchical semantics, and relational structures, with recent advances extending them to open-vocabulary, zero-shot, and multi-agent contexts.

1. Formal Definitions and Representational Foundations

A 3D scene graph is defined as a directed (occasionally undirected), attributed graph G=(V,E,A)G = (V, E, A), where

Hierarchies are common. For instance, KeySG defines five levels: building, floor, room, object, and functional element, with containment edges only (Werby et al., 1 Oct 2025). Social 3D Scene Graphs introduce human-agent nodes and activity edges (e.g., "Person –speaking to→ AnotherPerson") using open-vocabulary relations from VerbAtlas (Bartoli et al., 29 Sep 2025). Outdoor graphs use terrain-aware layers: metric-semantic point cloud, objects, generalized Voronoi place nodes, regions, and a map root (Samuelson et al., 6 Jun 2025).

Attributes:

  • Geometric: 3D position, size, centroid, orientation, bounding boxes, mesh segmentation.
  • Semantic: Class labels (fixed or open-set), CLIP/text embeddings, affordance labels, human posture/gaze.
  • Relational: Relation label (typed by predicate), geometric vector, support, proximity, style, functionality.

Edges are often represented by typed adjacency tensors (A∈{0,1}N×N×RA \in \{0,1\}^{N \times N \times R}), supporting multi-relational modeling (Naanaa et al., 2023, Liu et al., 10 Mar 2025).

2. Construction Pipelines and Prediction Methodologies

Scene graph construction pipelines process raw RGB-D, LiDAR, panoramic images, or point clouds through the following stages:

  1. Instance Segmentation and Detection: Objects and primitives are identified, typically via PointNet/PointNet++, Mask R-CNN, YOLO, or FastSAM. Keyframes and segmentation masks ensure geometric diversity and high coverage (Werby et al., 1 Oct 2025, Wu et al., 2021, Armeni et al., 2019, Rotondi et al., 10 Mar 2025).
  2. Attribute Extraction: Per-node embeddings are computed (PointNet features, CLIP embeddings, class probabilities, pose, size). Label hierarchies and attributes (color, material, affordances, behavior descriptors) are aggregated (Wald et al., 2020, Bartoli et al., 29 Sep 2025).
  3. Edge Creation: Spatial edges are determined by proximity, containment, or explicit relational annotation (e.g., "next to," "support," "adjacent"). Advanced pipelines prompt VLMs or LLMs on multi-view images for open-vocabulary and long-range relation detection (Rotondi et al., 10 Mar 2025, Koch et al., 19 Feb 2024, Saxena et al., 24 Oct 2025).
  4. Multi-Modal Fusion: Cross-modal embeddings fuse 2D and 3D features for open-set classification and relationship reasoning via joint vision-language co-embedding spaces; grounding aligns predictions with foundation models (Koch et al., 19 Feb 2024, Saxena et al., 24 Oct 2025).
  5. Incremental Updates: New detections are incrementally fused; node features and spatial locations are updated by exponential moving average. Redundant nodes/edges are eliminated by attribute similarity and spatial proximity (Saxena et al., 24 Oct 2025, Wu et al., 2021).

Graph neural networks (GCN, GAT, R-GCN, transformers with self- and cross-attention, Feature-wise Attention, message-passing architectures) propagate context, enabling object and predicate prediction, high-order relational inference, and denoising in generative pipelines (Dhamo et al., 2021, Naanaa et al., 2023, Wu et al., 2021, Kamarianakis et al., 2023, Werby et al., 1 Oct 2025). Commonsense knowledge graphs—external (Visual Genome, ConceptNet, WordNet) or internal—are injected for improved recall and robustness (Qiu et al., 2023).

3. Generative Modeling, Manipulation, and Control

3D scene graphs are employed as semantic control structures for scene generation and manipulation. They serve as input interfaces to conditional generative models, enabling controllable, diverse, and semantically consistent 3D scene synthesis.

Generative Methods:

Manipulation:

  • Editing the input scene graph allows direct modification of the output scene (addition/deletion of objects, change of relations). Latent graph transformers re-encode edited graphs and propagate updated latent states for re-generation (Dhamo et al., 2021).

Metrics for Control:

4. Hierarchical, Social, and Functionality-Aware Extensions

Recent research extends scene graphs to encompass hierarchical containment, dynamic entities, social/human-agent relations, and functional/part-level resolution:

  • Hierarchical Scene Graphs: KeySG encodes environments as multilevel graphs (building, floor, room, object, functional element) leveraging keyframes for coverage. Retrieval-Augmented Generation (RAG) enables fast context extraction for complex queries, bypassing LLM context window limitations (Werby et al., 1 Oct 2025).
  • Social Scene Graphs: S³DSG integrates human nodes with posture, gaze, activity descriptors, and open-vocabulary social relations. VLMs and LLMs permit long-range remote relation reasoning; benchmark datasets enable reproducible evaluation on spatial, activity, and functional queries (Bartoli et al., 29 Sep 2025).
  • Functionality-Aware Graphs: FunGraph models affordance-relevant parts (handles, knobs) as first-class graph nodes linked to objects. Detectors trained on synthesized 2D projections allow fine-grained grounding for robot manipulation tasks. Explicit affordance edges ("rotate," "pull") support language-prompted interaction (Rotondi et al., 10 Mar 2025).
  • Open-Vocabulary and Zero-Shot Reasoning: ZING-3D and Open3DSG leverage VLM and LLM foundation models for open-class detection and unconstrained relationship assignment, supporting continuous, incremental graph growth and queryable open-set semantics in point clouds (Saxena et al., 24 Oct 2025, Koch et al., 19 Feb 2024).
  • Geometric Algebra-Unified Graphs: UniSGGA encodes object transforms as Projective/Conformal GA multivectors, streamlining scene synthesis and topology prediction via GNNs and CGVAEs for generative tasks (Kamarianakis et al., 2023).

5. Applications: Planning, Reasoning, Robotics, and Scene Understanding

3D scene graphs underpin a range of downstream embodied, planning, and reasoning tasks:

  • Path Planning and Spatial Reasoning: S-Path and the Hydra family exploit the metric-semantic structure for decomposing indoor environments, supporting high-level semantic A* search and parallel sampling-based planning. Efficient replanning mechanisms bias reuse of sub-solved subproblems (Ejaz et al., 8 Aug 2025, Chang et al., 2023, Strader et al., 9 Jun 2025).
  • Task and Motion Planning: Scene graphs are exported to PDDL for symbolic reasoning, with LLM-driven translation of natural-language instructions into graph-grounded goals. Multi-robot fusion and relocalization enable robust operation in large-scale, dynamic environments (Strader et al., 9 Jun 2025).
  • Visual Question Answering (VQA): Graph queries facilitate answering spatial, attribute, and count-based questions (e.g., "How many mugs on the shelf?") (Kim et al., 2019, Werby et al., 1 Oct 2025).
  • Scene Retrieval and Matching: Cross-domain retrieval leverages graph similarity (node/edge sets, Jaccard/Simpson metrics) for 2D→3D and 3D→3D matching despite object rearrangement or occlusion (Wald et al., 2020).
  • Navigation and Manipulation: Room/place graphs define topological maps. Affordance-aware functional elements support precision planning and manipulation routines for robotics (Rotondi et al., 10 Mar 2025, Ejaz et al., 8 Aug 2025).
  • Simulated Scene Generation and VR: Generative models conditioned on scene graphs synthesize realistic indoor and outdoor scenes with controlled diversity, supporting training and evaluation for simulation environments (Zhai et al., 2023, Liu et al., 10 Mar 2025).

6. Limitations, Evaluation, and Future Directions

Current challenges and open issues include:

7. Key Research and Benchmarks

Notable datasets and frameworks:

Research directions likely to expand 3D scene graph capabilities include open-ended semantic parsing, behavior/affordance injection, distributed multi-agent fusion, outdoor graphing, active graph learning, and cross-dual (text/image/graph) reasoning.


*References: (Kim et al., 2019, Wald et al., 2020, Armeni et al., 2019, Wu et al., 2021, Dhamo et al., 2021, Chang et al., 2023, Zhai et al., 2023, Kamarianakis et al., 2023, Naanaa et al., 2023, Qiu et al., 2023, Koch et al., 19 Feb 2024, Liu et al., 10 Mar 2025, Rotondi et al., 10 Mar 2025, Samuelson et al., 6 Jun 2025, Strader et al., 9 Jun 2025, Ejaz et al., 8 Aug 2025, Bartoli et al., 29 Sep 2025, Werby et al., 1 Oct 2025, Saxena et al., 24 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to 3D Scene Graphs.