Papers
Topics
Authors
Recent
2000 character limit reached

Dynamic Scene Graphs (DSGs)

Updated 3 December 2025
  • Dynamic Scene Graphs are structured, temporally evolving graphs that encode entities, spatial context, and dynamic interactions in both physical and simulated environments.
  • They are created through multi-modal data fusion pipelines involving detection, semantic parsing, and online graph update mechanisms for real-time adaptation.
  • Advancements in DSGs enable precise long-term forecasting, task planning, and embodied AI applications, enhancing human–robot interaction and dynamic scene reasoning.

Dynamic Scene Graphs (DSGs) are structured, temporally-evolving graph-based representations that explicitly model the entities, spatial structure, semantic types, and dynamic interactions existing within physical or simulated environments. DSGs capture both static and dynamic aspects of the scene, supporting fine-grained reasoning about changing environments in robotics, computer vision, trajectory prediction, and embodied AI. DSGs generalize static scene graphs by introducing temporally indexed node and edge sets, temporal and causal relationships, and mechanisms for online graph update and adaptation. Recent advances position DSGs as a core substrate for sequential perception, task planning, human–robot interaction, simulation, and long-horizon prediction.

1. Mathematical Formalism and Core Structure

Let Gt=(Vt, Et, At)G_t = (V_t,\,E_t,\,A_t) denote the dynamic scene graph at time tt, where VtV_t comprises nodes (e.g., places, objects, agents, rooms), EtE_t encodes edges representing spatial, hierarchical, or spatio-temporal relations, and AtA_t gives node and edge attributes (Olivastri et al., 5 Nov 2024, Rosinol et al., 2021, Rosinol et al., 2020). The graph typically manifests a layered or hierarchical structure:

  • Low-level/Mesh nodes: fine spatial sampling points (e.g., mesh vertices), supporting metric accuracy.
  • Object nodes: labeled entities, each with pose, bounding box, class label (static or dynamic).
  • Agent nodes: humans/robots, often parameterized by pose-graphs and temporal trajectories.
  • Place/room/building nodes: abstracted spatial regions for hierarchical querying and planning (Rosinol et al., 2021, Rosinol et al., 2020).

Temporal evolution is modeled by indexed sequences {Gt}t=1T\{G_t\}_{t=1}^T. Edges (vi→vj)(v_i \to v_j) are annotated by predicates (e.g., inclusion, adjacency, support, traversability, temporal association). Temporal links often connect agent states at different times (e.g., xvt→xvt+1x^t_v \to x^{t+1}_v), enabling action tracklet formation and dynamic relationship reasoning (Ruschel et al., 3 Dec 2024).

Node and edge attributes can include semantic labels, 3D positions, shape representations, visual-language embeddings, decay rates, and action/event histories. The DSG supports high-order relations via composite edges or graph tensors A\mathcal{A} (Gorlo et al., 1 May 2024).

2. Algorithms for DSG Construction and Maintenance

DSGs are constructed from multisensory data (RGB(-D) video, depth, IMU, language, action logs) via staged pipelines. Key steps include detection, semantic parsing, feature aggregation, data association, and dynamic update:

A representative high-level pipeline is:

1
2
3
4
5
6
for t in time_steps:
    Gather multi-modal inputs: perception, human text, actions, temporal priors
    Apply detection and semantic segmentation
    Associate measurements to existing DSG nodes (using similarity metrics)
    Emit change reports (add/move/delete) and execute graph-editing primitives
    Update edges via local recomputation around changed nodes

3. Temporal and Spatio-Semantic Reasoning

DSGs enable explicit temporal linking of object/agent identities and inter-object relationships, resolving fundamental challenges in dynamic visual reasoning:

The complexity of dynamic scene understanding is decomposed into entity detection, relationship inference, association/matching, update, and planning modules, each grounded in well-defined mathematical operations.

4. Embodiment, Planning, and Real-World Deployment

DSGs serve as a substrate for high-level reasoning and planning in robotics and embodied AI:

  • Language-guided task planning: DSG structure enables grounding of abstract subtask plans (e.g., "pick the cup on the table") by matching semantic embeddings to DSG nodes and exploiting spatial relations (Yan et al., 15 Oct 2024).
  • Efficient retrieval and subgraph extraction: To mitigate LLM context and inference costs, retrieved subgraphs adapt to task relevance and environment changes, leveraging vector-store indexing and retrieval-augmented pipelines (Booker et al., 31 Oct 2024).
  • Real-time, multi-agent simulation: DES-fused DSGs allow simulation of complex, partially observed, stochastic environments for benchmarking and training of embodied agents (Ohnemus et al., 10 Oct 2025).
  • Long-term autonomy and memory management: DSGs enable selective forgetting/pruning, hierarchical compression, and symbolic summarization (Rosinol et al., 2021, Rosinol et al., 2020).
  • Hierarchical path-planning and collision checking: Multilevel DSG structure accelerates global and local planning in large-scale environments via coarse-to-fine search and collision refinement (Rosinol et al., 2021, Rosinol et al., 2020).

This integration of DSGs with LLM-driven task planning, robotic perception, simulation engines, and reasoning tools supports scalable applications in manipulation, navigation, human interaction, and surveillance.

5. Benchmark Tasks, Metrics, and Empirical Results

The evaluation of DSG-related methods is carried out on tasks such as:

Typical quantitative metrics include Recall@K, mean Recall (long-tail), temporal Recall@K for action tracklets, negative log-likelihood and displacement error for trajectories, semantic segmentation accuracy (mIoU), memory/latency, and qualitative visualizations.

Table: Empirical Results on Action Genome (Selected Tasks)

Method Task R@50 (%) tR@50 (%) mIoU (%) Task Success (%)
TCDSG (Ruschel et al., 3 Dec 2024) DSGG Tracklets 47.8 30.2 – –
FDSG (Yang et al., 2 Jun 2025) SG Forecast 49.8 – – –
DovSG (Yan et al., 15 Oct 2024) Manip. (Nav.) – – – 95.1
DynamicGSG (Ge et al., 21 Feb 2025) 3D Segmentation – – 31.1 88.8†

†: Environment adaptation success rate (lab), as defined in (Ge et al., 21 Feb 2025).

Advances in one-stage set-prediction (Wang et al., 27 May 2024, Yang et al., 2 Jun 2025), temporally consistent matching (Ruschel et al., 3 Dec 2024), and multi-modal fusion (Olivastri et al., 5 Nov 2024, Yan et al., 15 Oct 2024) consistently yield state-of-the-art performance across dynamic video understanding, embodied robotic manipulation, and environment forecasting tasks.

6. Limitations, Open Challenges, and Future Directions

Current research on DSGs identifies several limitations and directions for extension:

  • Perceptual bottlenecks: Scene graph update reliability is limited by detection failures in RGB-D, particularly for small or occluded objects (Olivastri et al., 5 Nov 2024).
  • Learning/fusion of multimodal confidence: Most present systems rely on rule-based multimodal fusion; learning-based calibration and temporal persistence modeling are nascent areas (Olivastri et al., 5 Nov 2024).
  • Dataset and annotation gaps: Large-scale, richly annotated datasets with persistent object/agent IDs, human–robot dialogs, and fine-grained spatio-temporal relations remain few, though efforts such as MEVA augmentation are progressing (Ruschel et al., 3 Dec 2024).
  • Scalability and efficiency: LLM-based planners face context bottlenecks with large or complex DSGs; retrieval-augmented and local-update mechanisms partially alleviate this (Booker et al., 31 Oct 2024, Yan et al., 15 Oct 2024).
  • Semantics–geometry coupling: Integrating symbolic reasoning (language, affordances, intent) with high-fidelity geometric adaptation (e.g., differentiable Gaussians) to support robust lifelong learning and planning (Ge et al., 21 Feb 2025).
  • Theoretical understanding of temporal dynamics: Temporal modeling decisions (delta supervision, SDE dynamics, causal relation propagation) merit further interpretability and learning-theoretic paper (Wang et al., 2023, Yang et al., 2 Jun 2025).
  • Online, real-robot validation: Many multimodal update and perception pipelines are still validated primarily in simulation and await robust real-world deployment (Olivastri et al., 5 Nov 2024, Ohnemus et al., 10 Oct 2025).

Planned directions include integrating time-driven active reobservation, learned modality fusion, richer hierarchical structures, and deployment in shared human–robot environments (Olivastri et al., 5 Nov 2024, Ohnemus et al., 10 Oct 2025).

7. Comparative Perspectives and Research Landscape

DSGs integrate concepts from geometric SLAM, static scene graphs, temporal action localization, and symbolic AI. Key distinctions among approaches include:

The field is rapidly converging on highly structured, incrementally updated DSGs as the substrate for large-scale, temporally consistent perception, reasoning, and action, supported by both deep learning and classical algorithmic tools. The integration of open-vocabulary, geometric, and multi-modal cues is a distinctive feature of current state-of-the-art systems.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Dynamic Scene Graphs (DSGs).