Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Scene Graphs: Structured Scene Abstraction

Updated 5 June 2026
  • Semantic Scene Graphs (SSG) are graph-structured abstractions that encode scene entities and their semantically meaningful relationships as nodes and directed, attributed edges.
  • They enable multi-modal, fine-grained scene understanding and reasoning across applications such as robotics, computer vision, embodied AI, and autonomous systems.
  • SSG construction leverages techniques from entity detection to transformer-based pipelines, incorporating relational, hierarchical, and cross-modal enhancements for dynamic environments.

A Semantic Scene Graph (SSG) is a graph-structured abstraction that encodes entities (objects, agents, places) present in a scene as nodes and explicit, semantically meaningful relationships among them as directed, attributed edges. SSGs serve as high-level, structured representations for visual, textual, or multimodal environments, supporting fine-grained scene understanding, multi-modal reasoning, and downstream decision making in fields such as computer vision, robotics, embodied AI, natural language processing, and autonomous systems. SSGs generalize classic scene graphs by insisting on rich, interpretable semantic predicates, algebraic relation properties (e.g., symmetry, transitivity), spatial and temporal alignment, and explicit grounding in source data.

1. Formal Definitions and Core Components

Mathematically, an SSG is a labeled directed graph G=(V,E)G = (V, E), where:

  • V={v1,...,vN}V = \{ v_1, ..., v_N \} is the set of nodes corresponding to entities (objects, agents, regions).
  • EV×R×VE \subseteq V \times R \times V is a set of directed edges, each edge (vi,r,vj)(v_i, r, v_j) encoding a semantic relationship rRr \in R (the predicate set), such as “on”, “holds”, “assist”, “next to”, or more domain-specific relations (e.g., “cuts”, “feeds”, “supports”) (Özsoy et al., 2022).

Node types are application-specific (e.g., human roles, vehicles, furniture) and may include composite or virtual nodes to aggregate small or indistinct entities ("instrument" in surgical SSGs) (Özsoy et al., 2022).

Edges encapsulate both semantic class and, optionally, continuous attributes (e.g., relative distances, velocities, geometric offsets) to support fine-grained context (Zipfl et al., 2021).

Extensions to SSGs include:

2. SSG Construction Methodologies

2.1 Two-Stage and End-to-End Pipelines

The dominant paradigm decomposes SSG construction into modular steps:

  1. Entity Detection: Localize and classify objects with bounding boxes, segmentation masks, or 3D regions (Zhu et al., 2022, Hou et al., 26 Jul 2025).
  2. Feature Extraction: Encode each entity with appearance, geometric, spatial, and semantic embeddings (CNN/PointNet backbones, CLIP/LLMs for semantics, DINOv3 for vision–language features) (Renz et al., 15 Sep 2025, Günther et al., 3 Feb 2026).
  3. Relationship Prediction: For each candidate subject–object (or multi-node) pair, compute fused features and predict predicate class(es) using multilayer perceptrons, GNNs, or transformer architectures (Zhu et al., 2022, Özsoy et al., 2022, Lv et al., 2023, Kim et al., 2023).
  4. Graph Construction: Form the final SSG by thresholding predicate scores and, if necessary, fusing across multiple views/frames or modalities (Hou et al., 26 Jul 2025, Wu et al., 2023, Fang et al., 13 Feb 2026).

Joint approaches—one-stage, transformer-style, autoregressive graph generators—bypass explicit factorization by directly mapping raw inputs (images, point clouds, image sequences) to SSG structures using end-to-end differentiable architectures (Zhang et al., 2024, Garg et al., 2021, Lv et al., 2023).

2.2 Relational and Semantic Enhancements

  • Relation-Centric Design: Edge-dual and dual-MPNN architectures propagate information not just across object nodes but also directly among relationship-encoding nodes, mitigating long-tail bias and enabling higher-order reasoning (Kim et al., 2023).
  • Implicit Language Reasoning: Recent advances leverage LLMs as scene reasoners by discretizing vision features into tokenized pseudo-language, then decoding implicit scene structure via transformer decoders (Zhang et al., 2024).
  • Zero-Shot and Cross-Modal Transfer: Universal SSGs extend to multiple modalities (image, video, 3D, text) and their combinations. Modality-specific decoders and associators align nodes and edges across domains (e.g., textual “Peter” ↔ image “person”) (Wu et al., 19 Mar 2025).
  • Pixel-Level Grounding: Segmentation-grounded models infer object masks and spatially ground predicates at pixel-level via cross-domain transfer and learned attention over object region masks (Khandelwal et al., 2021).

3. Spatio-Temporal and Hierarchical Extensions

  • Dynamic SSGs: For video, multi-frame, or robotic settings, SSGs capture the temporal evolution of entities and relationships. Architectures such as SceneLLM incorporate video-to-language mapping, spatial aggregation, and optimal transport to encode spatio-temporal context into discrete scene tokens (Zhang et al., 2024).
  • Hierarchical SSGs: Explicit multilevel graphs (Floor–Room–Area–Object) enable logical, semantic, and retrieval operations aligned with human intent. Event-triggered updates and asynchronous processing maintain graph sparsity and temporal coherence during scene evolution (Fang et al., 13 Feb 2026).
  • Open-Set and Incremental Mapping: Online systems fuse current sensor data with global memory, using semantic and geometric matching, cross-modal embeddings (e.g., CLIP), and recursive update routines to scale to open environments and new object categories (Günther et al., 3 Feb 2026, Renz et al., 15 Sep 2025).

4. Evaluation Metrics and Benchmarking

SSG quality is assessed using:

5. Strengths, Challenges, and Limitations

Advantages

Challenges

Future Directions

6. Major Application Domains and Impact

SSGs are established as a foundational abstraction for:

  • Robotics and Embodied AI: Navigation, interaction, and adaptive planning via environment understanding and task-level reasoning (Hou et al., 26 Jul 2025, Wu et al., 2023, Günther et al., 3 Feb 2026, Kueble et al., 26 Mar 2026).
  • Surgical and Safety-Critical Environments: Automated monitoring, role prediction, and intelligent assistance in complex, multi-actor domains (e.g., operating rooms) (Özsoy et al., 2022).
  • Autonomous Driving and Traffic Scene Understanding: Topological abstraction and reasoning independent of raw coordinates or geometry; scenario-based validation (Zipfl et al., 2021).
  • Vision–Language and Multimodal Systems: Unified understanding of images, text, video, and 3D observations by merging all observed semantics and resolving disambiguities (Wu et al., 19 Mar 2025, Wang et al., 4 Mar 2026).
  • Generative Scene Modeling and Completion: Autoregressive, unconditional, and completion-based synthesis of novel, semantically grounded scenes (Garg et al., 2021).

7. Representative Methods, Architectures, and Benchmarks

Reference Domain / Input Modality SSG Methodology Key Innovations Metric/Result Highlights
(Zhang et al., 2024) (SceneLLM) Video V2L + LLM + LoRA + OT Implicit language reasoning, dynamic SGG R@20 (SGCLS): 55.0%
(Özsoy et al., 2022) (4D-OR) Surgery (OR, RGB-D, 3D) End-to-end PointNet/GNN Annotated 4D-OR, clinical role prediction Macro-F1 rel: 0.75; role 0.85
(Kim et al., 2023) (EdgeSGG) Image (VG, OpenImages) Dual-MPNN on edge-dual graph Relation-centric context, long-tail handling VG mR@50: 34.7
(Hou et al., 26 Jul 2025) (FROSS) RGB-D stream, 3D 2D SGG + 3D Gaussian lifting Latency ~7ms/frame, ReplicaSSG benchmark 3DSSG RelR: 27.9%, 144 FPS
(Günther et al., 3 Feb 2026) (Open Set) RGB-D, 3D, open-set mapping Incremental matching, CLIP features SSG as backbone for whole mapping process 30 Hz, real-world examples
(Wu et al., 19 Mar 2025) (USG) Image/Text/Video/3D Modular USG-Par, associator Universal SSG, text-centric contrastive loss PSG (Img) R@50: 46.4
(Lv et al., 2023) (SGFormer) 3D Point Cloud, 3DSSG Graph Transformer + LLM injection Global attention, zero-shot, long-tail gains 3DSSG Rel R@50: 56.25
(Khandelwal et al., 2021) (Segm-SGG) Image (+aux segm), pix-level SSG Multi-task, lingual similarity Gaussian attention, mask refinement mR@20 (VCTree): +12.6%
(Fang et al., 13 Feb 2026) (INHerit-SG) RGB-D, 3D, navigation 4-level hierarchy, RAG, event-update Hard-soft filtering, LLM-guided retrieval HM3DSem-SQR: best-in-class

The SSG formalism continues to generalize and unify high-level semantic scene abstractions, supporting advances in open-world perception, compositional reasoning, and multi-agent/human–machine collaboration.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Scene Graph (SSG).