Traffic Scene Graphs for Autonomous Driving

Updated 21 November 2025

Traffic scene graphs are structured graph-based representations that model dynamic traffic scenarios by encoding entities and their spatial, semantic, and temporal relationships.
They are constructed using multi-sensor data fusion and map-based extraction, enabling precise trajectory prediction and robust scene understanding in autonomous driving.
They support applications such as behavior analysis, synthetic scene generation, and HD map construction, thereby enhancing explainability and real-time decision-making.

A traffic scene graph is a structured, graph-based representation of a dynamic traffic scenario, in which entities (such as vehicles, pedestrians, cyclists, lanes, and traffic infrastructure) are modeled as nodes and their spatial, semantic, or temporal relationships are encoded as edges. This abstraction provides a unified, machine-readable representation for downstream machine learning tasks in autonomous driving, including trajectory prediction, scene understanding, behavior analysis, similarity retrieval, synthetic scene generation, and high-definition map construction. The traffic scene graph framework is foundational in recent research addressing the complexity, diversity, and explainability requirements of autonomous systems.

1. Formal Definitions and Taxonomies

Traffic scene graphs are generally formalized as directed, attributed, and often heterogeneous graphs: $G = (V, E, X_V, X_E, \tau, \phi)$ , with

$V$ : nodes representing traffic participants, infrastructure, or map elements;
$E \subseteq V \times V$ : directed edges corresponding to pairwise relations;
$X_V \in \mathbb{R}^{|V| \times d_v}$ : node feature matrix (position, velocity, type, etc.);
$X_E \in \mathbb{R}^{|E| \times d_e}$ : edge feature matrix (relation type, distance, probabilities, etc.);
$\tau$ : node-type mapping (e.g., vehicle, pedestrian, lane, crosswalk, light, stop) (Monninger et al., 2023, Mlodzian et al., 2023, Sun et al., 2024);
$\phi$ : edge relation mapping, e.g. follows, lateral, intersects, isOnMapElement, controls (Mlodzian et al., 2023, Zipfl et al., 2022, Sun et al., 2024).

Semantics of edges are tightly coupled to domain knowledge: topological (e.g., lane adjacency, longitudinal/lateral/intersecting relations), behavioral (e.g., following/overtaking, yield/go/ignore), and infrastructure-based (e.g., light controls lane, stop causes stop at area) (Monninger et al., 2023, Mlodzian et al., 2023, Kumar et al., 2020). Heterogeneous graphs explicitly encode multiple node and edge types and support high-fidelity representation of both dynamic context (agents’ positions, velocities) and static map context (lanes, connectors, signals) (Meyer et al., 2023, Sun et al., 2024).

Prominent taxonomies—such as those underpinning the nuScenes Knowledge Graph (nSKG) (Mlodzian et al., 2023) and SemanticFormer (Sun et al., 2024)—define 20+ node types and 30+ relation types, enabling detailed cross-layer semantic reasoning.

2. Construction Methodologies

The construction pipeline for a traffic scene graph depends on input modality and representation scope.

A. Sensor/Perception Data Extraction

Dynamic nodes (agents) are extracted from multi-sensor fusion (LiDAR, radar, camera), tracking histories, and inference of current kinematics (Monninger et al., 2023, Tian et al., 2020).
Static nodes (lanes, crosswalks, signals) originate from processed HD maps, map polylines, or synthetic map generation (Mlodzian et al., 2023, Sun et al., 2024, Lv et al., 2024).
Vehicle and actor nodes are populated with geometric and motion features, optionally including past/future trajectory histories (Zipfl et al., 2022, Wang et al., 2024).

B. Edge and Relation Extraction

Map-based topological search (graph traversal, Dijkstra/A*) is used to determine semantic actor–actor relationships, employing Frenet or lane adjacency criteria (Zipfl et al., 2021, Zipfl et al., 2022).
Probabilistic assignment of participants to lanes (using distance and orientation likelihoods) enables computation of relation probabilities (Zipfl et al., 2021, Zipfl et al., 2022).
For traffic infrastructure, logical and geometric predicates infer relations such as controls, signals, stops, overlaps, and road connectivity (Monninger et al., 2023, Mlodzian et al., 2023, Lv et al., 2024).
In synthetic or simulation contexts, nodes and edges encode additional data: spatial grid positions, masks, occlusion classes, and depth ordering (Savkin et al., 2023).

C. Engineering and Serialization

Scene graphs are serialized using adjacency and attribute matrices (COO, PyTorch Geometric HeteroData), or exported as RDF/OWL-based triples in knowledge graph frameworks for large-scale data sharing (Mlodzian et al., 2023, Meyer et al., 2023).
Task-specific subgraphs can be extracted dynamically for model input (neighborhood pruning, anchor selection, temporal slicing) (Wang et al., 2024, Grimm et al., 2023).

3. Algorithmic Processing and Learning Architectures

Traffic scene graphs underpin a spectrum of deep learning architectures, leveraging both classical and advanced GNN designs:

A. Message Passing Neural Networks (MPNNs) and Graph Attention

MPNN frameworks update node embeddings via neighbor aggregation and edge-feature transformation, enabling encoding of both local and relational semantics (Zipfl et al., 2022, Zipfl et al., 2022, Wang et al., 2024).
Heterogeneous Graph Attention (HetEdgeGAT, HAN) explicitly fuses information across multiple node and relation types, with meta-path attention supporting high-level reasoning over allowed maneuvers (e.g., lane-changes, permitted connectors) (Monninger et al., 2023, Sun et al., 2024).
High-order aggregation methods (variance, moments, median) further refine context extraction in evolving temporal graphs (Humnabadkar et al., 2024).

B. Hybrid Architectures: Transformers, GCNs, and Multimodal Pipelines

Scene graph node embeddings are often processed with spatial and temporal modules (e.g., Temporal Transformers, bidirectional LSTMs) to support temporal dependencies in prediction and classification (Wu et al., 2023, Lohner et al., 2024).
For collaborative decision-making, scene graph outputs are fused with occupancy grid representations using Transformer encoders, then integrated into multi-agent MDP/RL frameworks (Hu et al., 2024).
Multimodal pipelines align graph representations with vision and language using contrastive embedding spaces, augmenting visual-linguistic perception in anomaly detection and accident understanding (Lohner et al., 2024).
Generative models based on scene graphs are applied to synthetic data generation and direct downstream photo-realistic synthesis (Savkin et al., 2023).

C. Training Objectives and Evaluation

Embedding learning employs contrastive (triplet, NT-Xent) and self-supervised losses for capturing scene similarity and clustering (Zipfl et al., 2023, Zipfl et al., 2022).
Reconstruction and downstream tasks utilize regression/classification losses on node or trajectory targets, or edge existence/type, and are regularly evaluated using precision/recall, ADE/FDE, and clustering metrics such as silhouette score and triplet accuracy (Zipfl et al., 2023, Mlodzian et al., 2023, Sun et al., 2024).

4. Application Domains and Practical Impact

Traffic scene graphs furnish a common abstraction for a range of high-impact autonomous vehicle tasks:

Trajectory and Behavior Prediction: Heterogeneous traffic scene graphs enable state-of-the-art forecasting accuracy and provide interpretable reasoning via explicit relation semantics (e.g., “vehicle i yields to vehicle j”) (Kumar et al., 2020, Wu et al., 2023, Grimm et al., 2023, Zipfl et al., 2022, Sun et al., 2024). Structured context (agents, lanes, anchors) improves uncertainty modeling and reduces off-road rate (Grimm et al., 2023).
Scenario Clustering and Test Space Reduction: Embedding and clustering of scene graphs supports scenario-based test case reduction, identifying representative, non-redundant traffic situations for validation of automated driving systems (Zipfl et al., 2023, Zipfl et al., 2022). Clustered traffic situations correspond to interpretable traffic patterns (e.g., short queues, platoons).
Synthetic Data Generation: Graph-conditioned generative models synthesize realistic images or semantic layouts, supporting domain-invariant simulation and data augmentation (Savkin et al., 2023).
Scene Understanding and Accident Analysis: Spatio-temporal scene graphs facilitate accident classification, risk detection, and accident sequence understanding through multi-modal learning and graph-based reasoning (Lohner et al., 2024).
Topology Reasoning and HD Map Construction: Scene graphs that incorporate explicit lane topology (Traffic Topology Scene Graph: T²SG) provide strong performance for map building and topology reasoning, leveraging dedicated transformer modules for geometry-guided attention and causal interventions (Lv et al., 2024).

5. Limitations, Challenges, and Future Directions

Notwithstanding the rapid progress, several open problems and limitations remain:

Scalability and Complexity: Large scene graphs (on the order of thousands of nodes or tens of thousands of edges) challenge both memory and convergence, particularly in meta-path attention and heterogeneous graph transformers (Mlodzian et al., 2023, Sun et al., 2024). Overfitting and stability are recurrent themes in ablation studies.
Semantic Coverage: Many approaches omit certain elements (road geometry, static infrastructure, rich motion cues) due to annotation or modeling complexity (Zipfl et al., 2023, Zipfl et al., 2022, Grimm et al., 2023). The inclusion of more comprehensive infrastructure (signs, traffic lights, temporal signal phases) is vital for broader context capture (Sun et al., 2024, Mlodzian et al., 2023).
Dynamic and Temporal Reasoning: The majority of models focus on per-timestep snapshots; integration of spatio-temporal graphs, recurrent architectures, or evolving graphs is an active area for capturing maneuvers and longer-term scene dynamics (Meyer et al., 2023, Wu et al., 2023, Humnabadkar et al., 2024, Zipfl et al., 2022).
Data and Annotation: Current datasets are limited in scope (number of annotated frames, rare event inclusion, scene diversity), constraining the generalizability and transferability of learned models (Tian et al., 2020).
Explainability and Interpretation: While scene graphs improve interpretability by design, further elaboration of causal, temporal, and counterfactual reasoning capabilities remains a focus area (e.g., via meta-paths and explicit edge-mode inference) (Sun et al., 2024, Lv et al., 2024, Kumar et al., 2020).
Fusion with Other Modalities: Although initial efforts show promise in aligning scene graph embeddings with visual and language modalities, further exploration of hybrid and end-to-end models is ongoing (Lohner et al., 2024, Savkin et al., 2023).

6. Benchmarking, Standardization, and Reproducibility

Several benchmarks and open-source frameworks have emerged:

Name / Paper	Graph Types	Node Types	Edge Types	Dataset	Availability
nSKG (Mlodzian et al., 2023)	Heterogeneous KG	20+	30+ semantic/map/temporal	nuScenes	Released
CommonRoad-Geometric (Meyer et al., 2023)	Heterogeneous	vehicles, lanes	v2v, v2l, l2l, l2v, vtv	CommonRoad/NuPlan	Released
Road Scene Graph (Tian et al., 2020)	Multigraph	4–8	8–12, incl. kinematic, signal	nuScenes, CARLA	Released
SCENE (Monninger et al., 2023)	Ontology-directed	agents, lanes,…	agent-agent, agent-lane,…	In-house	Proprietary
T²SG / TopoFormer (Lv et al., 2024)	Lane topology	lanes	adjacency, signal-control, etc.	OpenLane-V2	Pending

Standardized datasets and public codebases—together with PyTorch-Geometric or similar libraries—have underpinned rapid progress and reproducibility. Datasets in this domain typically range from hundreds to tens of thousands of scenes, containing up to thousands of nodes and edges per graph (Mlodzian et al., 2023, Meyer et al., 2023).

In conclusion, the traffic scene graph is a foundational data structure for real-time understanding, reasoning, and generation of complex driving environments. Its evolution aligns with advances in graph representation learning, self-supervised and contrastive pre-training, meta-path reasoning, and interpretable modeling, positioning it as a critical abstraction for safe and explainable automated driving systems (Monninger et al., 2023, Mlodzian et al., 2023, Meyer et al., 2023, Sun et al., 2024, Lv et al., 2024).