Papers
Topics
Authors
Recent
2000 character limit reached

NuScenes Knowledge Graph (nSKG)

Updated 13 January 2026
  • NuScenes Knowledge Graph (nSKG) is a comprehensive semantic graph-based representation of traffic scenes that integrates entities, spatial relationships, and temporal dynamics for autonomous driving.
  • It employs a detailed OWL-DL ontology and systematic graph transformation pipeline to formalize scene participants, map elements, and temporal contexts.
  • nSKG supports scalable trajectory prediction and scene understanding through seamless integration with graph neural networks and transformer-based models.

The NuScenes Knowledge Graph (nSKG) is a comprehensive semantic graph-based representation of traffic scenes designed for downstream trajectory prediction and scene understanding in autonomous driving. Constructed from the nuScenes dataset, nSKG formalizes all scene participants and road elements along with their semantic and spatial relationships, supporting symbolic reasoning, graph neural network (GNN) processing, and language-based foundation models. It is structured around a rich OWL-DL ontology covering a detailed hierarchy of agents, map elements, and temporal containers, with explicit schema, construction pipelines, and serialization protocols facilitating integration into contemporary GNN and transformer architectures (Mlodzian et al., 2023, Zhou et al., 24 Mar 2025).

1. Formal Structure and Ontological Schema

nSKG is built as a typed, attributed directed graph G=(V,E,XV,XE,R)G = (V, E, X^V, X^E, R), where:

  • VV is the set of scene entities (lanes, agents, traffic lights, etc.), each a node in the ontology.
  • E⊂V×VE \subset V \times V is the set of directed edges capturing semantic and spatial relationships.
  • R={r1,…,rk}R = \{r_1, \ldots, r_k\} denotes the finite set of relation (edge) types.
  • XV∈R∣V∣×dvX^V \in \mathbb{R}^{|V| \times d_v} and XE∈R∣E∣×deX^E \in \mathbb{R}^{|E| \times d_e} are node and edge feature matrices, embedding attributes (geometry, orientation, etc).
  • Data is represented and stored as RDF triples (subject–predicate–object) with numeric properties (positions, widths, lengths) and object-properties for linkage.

The OWL-DL ontology adopts the SROIQ(D) profile and defines 42 classes, 10 object properties, and 24 data-type properties. The top-level modules are:

  • Agent Module—Vehicle (Car, Truck, Bus, EmergencyPolice, Trailer, ConstructionVehicle), Human (Adult, Child, ConstructionWorker), Micro-mobility (Bicycle, Motorcycle, Stroller, PushablePullable), Static Obstacle (Barrier, TrafficCone, Debris).
  • Map Module—RoadTopology (RoadSegment, Lane, LaneSnippet, LaneSlice), Intersection, PedestrianCrossing, Walkway, TurnStopArea, CarParkArea, TrafficLightStopArea, PedCrossStopArea, StopSignArea.
  • Scene Module—encodes temporal context, scene state, timestamp, scene participants and linkage to previous/next scenes.

Ontology-driven reasoning axioms (transitivity, reflexivity) instantiate mappings (e.g., LaneSlice inherits Lane membership, Scene objects derive from previous states) (Mlodzian et al., 2023, Zhou et al., 24 Mar 2025).

2. Entity Types, Attributes, and Relations

Entity categories are organized into five groups:

  1. Temporal Containers: Sequence and Scene. Scene objects carry hasTimestamp∈Z\in \mathbb{Z} and are linked by hasNextScene/hasPreviousScene (temporal transitions).
  2. Traffic Participants: Participant (time-agnostic, e.g., Car, Bicycle, Pedestrian; 23 subclasses), and SceneParticipant (agent at a scene instant). Relation properties: isSceneParticipantOf (time association), inNextScene (chronological agent link).
  3. Lanes and Topology: Lane (edges: hasNextLane, hasPreviousLane, hasLeftLane, hasRightLane), LaneConnector, LaneSlice (geometry, width via laneSliceHasWidth), LaneSnippet (max 20 m, border-type, snippetHasLength).
  4. Road Infrastructure: StopArea (PedCrossingStopArea, TrafficLightStopArea), TrafficLight (hasTrafficLightType, trafficLightHasPose), PedCrossing (connectsWalkways), Walkway, CarparkArea, Intersection, RoadBlock.
  5. Geometric Primitives: sf:Point, sf:Polygon, sf:Geometry (GeoSPARQL).

Principal relations include temporal transitions, agent-scene associations, inter-agent dynamics (follows, intersecting, parallel), map-topology edges, stop/crossing inducements, spatial proximity, and block connectivity. Edge features are typically encoded one-hot or by learned type embeddings, without continuous weights for heuristics; spatial edge types encode semantic but not numeric distances.

3. Graph Construction and Transformation Pipeline

The construction pipeline involves:

  • Raw Input Integration: Using nuScenes sample_annotation (agent tracks), ego_pose, lane/arcline data, stop_line, traffic_light, walkway, carpark_area records.
  • Ontology Transformation: Instantiating OWL classes and constraints for each record, assigning data-properties and object-properties through spatial tests/lookup.
  • LaneSnippet Extraction: Traversal of each lane’s border-type, splitting at border changes or segments >20 m, connecting adjacent snippets via switchViaX relations.
  • LaneSlice Geometry Extraction: For every 2 m-spaced center arcline point pp: extract pL,pRp_L, p_R as closest left/right border points and compute width=∥pL−pR∥width = \|p_L - p_R\|.
  • StopArea/Intersection Grouping: Mapping stop_lines and traffic_lights to StopAreas, constructing PedCrossings by linking walkways <5 m apart.
  • RoadBlock Grouping: Clustering lanes sharing road surfaces/directions, linking via lane connectivity.
  • RDF Serialization: Dumping triples to triplestores (Blazegraph, Owlready2).
  • Subgraph Extraction (nSTP): For trajectory prediction, extracting a local spatio-temporal KG around target SceneParticipant and its recent context across reachable map-elements.

Coordinate normalization (shift- and rotation-invariance) is enforced:

plocal=RtargetT(pg−ptarget),Rlocal=RtargetTRgp_{\text{local}} = R_{\text{target}}^T (p_g - p_{\text{target}}),\qquad R_{\text{local}} = R_{\text{target}}^T R_g

where pg,Rgp_g, R_g are global pose, ptarget,Rtargetp_{\text{target}}, R_{\text{target}} are reference (Mlodzian et al., 2023).

4. Data API and Downstream Usage

Extracted subgraphs are converted into PyTorch-Geometric HeteroData for GNN processing. Each example (xi,yi)(x_i, y_i) comprises:

  • xix_i: PyG HeteroData graph (multi-type nodes/features, multi-relation edge indices).
  • yi∈R12y_i \in \mathbb{R}^{12}: future trajectory (6 s at 2 Hz, 12 steps).

Node features x∈RNntype×dvx \in \mathbb{R}^{N_{ntype} \times d_v}, edge indices for (src,rel,dst)∈Z2×Erel(src, rel, dst) \in \mathbb{Z}^{2 \times E_{rel}}. Target node masks and historical positions are provided for sequence modeling.

Loading and batching is supported via torch_geometric.loader.DataLoader(dataset, batch_size=…). All code, ontology, and data artifacts are open-sourced (github.com/boschresearch/nuScenes_Knowledge_Graph) (Mlodzian et al., 2023).

5. Integration with Graph Neural Networks and Symbolic Foundation Models

Graph neural architectures ingest nSKG via heterogeneous message-passing (e.g., HeteroConv in PyG), with per-relation processing:

  • Each (src,rel,dst)(src, rel, dst) relation passes through its specific GNN block, with aggregation over destination node.

For trajectory forecasting, target-node embeddings are regressed via an attached MLP head to predict 12×212\times2 displacement vectors. Standard loss: Mean-Squared-Error (MSE):

L=112∑t=112∥Y^t−Yt∥2L = \frac{1}{12} \sum_{t=1}^{12} \| \hat{Y}_t - Y_t \|^2

Variants include edge-type embeddings, Graph Transformers/HGT for type-aware attention, and multi-modal losses (e.g., minimum Average Displacement Error and maneuver classification) (Mlodzian et al., 2023).

nSKG also supports serialization into BEV-symbolic grids for large-scale symbolic foundation modeling (FM4SU) (Zhou et al., 24 Mar 2025). Scenes are encoded as grid matrices At∈(2E)n×mA_t \in (2^E)^{n \times m} with explicit semantic and spatial cell delimiters, serialized to token sequences incorporating metadata (country, distance, orientation), spatial order, and special tokens for masking and segmentation. These representations are amenable to transformer modeling (T5, etc.) for tasks such as next-scene prediction and masked-span object recovery.

6. Experimental Statistics and Quantitative Results

nSKG supports ~43 million RDF triples across ~30,000 scenes (10 Hz, 1,000 scenarios, 20 s sequences), with ~24,000/3,000/3,000 train/val/test splits (Zhou et al., 24 Mar 2025):

Concept Scene Occurrences
Walkway 61,879
Intersection 51,285
Pedestrian Crossing 22,448
Turn Stop Area 16,684
Car Park Area 14,575
Traffic Light Stop Area 13,618
Ped. Cross. Stop Area 9,337
Child 9
Stroller 3

Dynamic objects: 9,374; static: 371,271. The average lane graph degree is 2.3 (connectivity), with scene temporal chains of length 200.

Key model benchmarks on serialized BEV-symbolic nSKG representations:

  • Scene Object Prediction: T5-Base (fine-tuned): 88.7% accuracy, 86.6% precision, 74.4% recall, F1 78.6%.
  • Next Scene Prediction: T5-Base: 86.7% accuracy, 61.8% precision, 59.4% recall, F1 60.3%. T5-Large: 86.1% accuracy. Grid resolution ablation (2 m vs. 5 m cells): 88.7% → 39.6% accuracy.
  • Dynamic vs. Static performance: ~80–89% accuracy for static objects, ~40–84% for dynamic depending on task.
  • Zero-shot LLM baselines: ChatGPT-3.5/4o: 35–41%, LLaMA3.1-8B/70B: ~20–22%, untrained baseline: ~37.4%.

Pre-training on masked-span tasks accelerates next-scene training; recall is explicitly prioritized due to the significance of false negatives in safety contexts (Zhou et al., 24 Mar 2025).

7. Impact, Limitations, and Prospects

nSKG delivers a uniform, semantically rich representation bridging raw perception, detailed map structure, explicit traffic rules, and temporal evolution. It enables advanced reasoning and learning in both GNN and transformer-based settings for trajectory and scene forecasting. The principal contributions are the ontology, graph schema, and large-scale extraction pipeline enabling scalable graph regression and foundational scene understanding (Mlodzian et al., 2023, Zhou et al., 24 Mar 2025).

Limitations noted include the absence of end-to-end GNN benchmarking or ablation in the initial release; empirical validation is left for future work. Graph sizes (1,000–2,000 nodes per sample) mandate use of scalable GNNs or subgraph sampling methods. The representation enforces rotation and shift invariance by construction. All relevant artifacts and datasets are publicly available to support further research in graph-based autonomous driving and foundation models (Mlodzian et al., 2023, Zhou et al., 24 Mar 2025).

A plausible implication is that highly structured, semantically explicit knowledge graphs such as nSKG provide a critical substrate for next-generation scene understanding models, facilitating integrated reasoning across perception, map context, and spatio-temporal dynamics.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to NuScenes Knowledge Graph (nSKG).