nuScenes Knowledge Graph (nSKG)
- nuScenes Knowledge Graph (nSKG) is a comprehensive, ontologically grounded representation of autonomous driving scenes, modeling dynamic agents, static map features, and temporal relations.
- It integrates multi-sensor data from LiDAR, camera, radar, and HD maps to construct richly typed graphs that support both symbolic reasoning and graph neural network inference.
- Empirical evaluations show improved trajectory prediction metrics and robust scene understanding, making nSKG a pivotal tool for advancing autonomous driving research.
The nuScenes Knowledge Graph (nSKG) is a comprehensive, ontologically grounded scene representation for autonomous driving research, modeling all dynamic and static entities, their semantic, spatial, and temporal relationships, and supporting downstream tasks such as trajectory prediction and foundation model-based scene understanding. Built from the nuScenes dataset, nSKG encapsulates agents, map elements, traffic rules, and interactions as richly typed graphs, facilitating both symbolic processing and graph neural network (GNN) inference (Zhou et al., 24 Mar 2025, Mlodzian et al., 2023).
1. Ontology and Semantic Schema
nSKG formalizes traffic scenes as directed, labeled graphs , with (entities, literals) and edges representing relations. All facts are encoded as triples , integrating both TBox (schema-level, classes and properties) and ABox (instance-level) elements. The schema divides into:
- Agent module: Classes encompass vehicles (Car, Bus, Truck, Motorcycle), humans (Adult, Child, ConstructionWorker), two-wheelers (Bicycle, Motorcycle), with relations such as located_at, has_state, and various semantic interaction predicates (follows, overtakes, interacts_with).
- Map module: Nodes include Lane, LaneSnippet, LaneSlice, Intersections, PedestrianCrossing, DrivableArea, Walkway, TrafficSignArea. Relations encode connectivity (follows, hasNextLane, hasPreviousLane), adjacency, containment, and domain-specific properties such as speed limits.
- Scene module: Each Scene comprises state , temporal links , timestamp , and participants .
- Instances: Over 43 million triples represent concrete agents, map features, and interrelations, ensuring that ontological axioms (e.g., transitivity, reflexivity) are materialized for effective symbolic reasoning (Zhou et al., 24 Mar 2025, Mlodzian et al., 2023).
2. Construction Pipeline and Data Sources
The graph construction process ingests LiDAR, camera, radar, and HD-map annotations. For each time step , nuScenes sample yields dynamic objects (bounding boxes, semantic labels) and static map features (lane polygons, crosswalks, walkways). The pipeline proceeds via:
- Geo-registration: All detections use a unified world frame via the ego vehicle’s GPS/IMU.
- Entity instantiation: Each object/feature yields a unique node, with label alignment to ontology classes.
- Relation construction: Spatial ‘located_on’ is asserted if object footprints lie within map polygons; adjacency utilizes Delaunay neighborhoods among LaneSlices; semantic interactions are induced by bounding box overlap and spatial orientation.
- Granularity and filtering: Lanes are discretized into equi-spaced LaneSlices (~1–2 m), with LaneSnippets segmented based on border transitions and length. Low-confidence (<0.3) detections are discarded (Zhou et al., 24 Mar 2025, Mlodzian et al., 2023).
- Semantic and temporal enrichment: Agent–Map and Agent–Agent edges, as well as SceneParticipant temporal links (inNextScene), are incorporated.
Across evaluated splits, each per-scene graph comprises ~1,450 nodes and ~5,800 edges, with agent nodes (~30), map nodes (~1,400), and temporally consecutive agent trajectories (Mlodzian et al., 2023).
3. Integration of Domain Knowledge and Complex Interactions
The nSKG design embeds explicit domain knowledge:
- Road topology: Lane graphs encode directed acyclic (or cyclic) connectivity. RoadBlocks group lanes by direction; intersections and connectors structure the global scene geometry.
- Traffic rules: Relations such as priority_over distinguish right-of-way; traffic controls (StopSignArea, TrafficLightArea) enforce stop predicates.
- Legal/illegal maneuvers: LaneSnippet border types and switchVia relations denote permissible/impermissible transitions.
- Complex interactions: Dedicated subgraphs capture merging (agents on converging LaneSlices linked by merging_with) and overtaking (asserted when spatial/temporal criteria are met: agent A passes B on a lane within Δt < 2s).
- Temporal slicing: Scenes are partitioned to provide both past (for historical context) and future (prediction scope), with coordinate normalization (target-centric alignment, heading normalization) applied for learning invariance (Zhou et al., 24 Mar 2025, Mlodzian et al., 2023).
4. Mathematical Representation and Graph Neural Network Integration
Each nSKG instance is a heterogeneous graph with the following structure:
- Adjacency matrices: One per relation type, .
- Node features: storing positional, orientation, dynamic attributes for agents; geometric and categorical features for map nodes.
- Edge features: , e.g., pairwise distances, orientations, border types.
- Target and label: Each example designates a target agent node and ground-truth trajectory (6 seconds at 2 Hz).
GNNs (e.g., HeteroConv, HAN, RGCN, Graph Transformer) operate via multi-type message passing:
with final outputs mapped via MLPs. PyTorch-Geometric (PyG) implementation enables direct consumption of HeteroData objects containing all node/edge modalities (Mlodzian et al., 2023).
5. Symbolic Scene Foundation Model (FM4SU) and BEV Sequence Extraction
FM4SU employs nSKG-derived BEV symbolic extraction for foundation model training:
- Ego-centric coordinate transformation: All object coordinates are rotated/translated into the ego-vehicle frame:
- Area discretization: The ego frame is gridded ( cells, ), each cell listing all object/map element types present.
- Token sequence serialization: The grid is linearized with special concept/descriptive tokens (<scene_start>, <col_sep>, <row_sep>, <concept_sep>, <empty>, <M_i>), along with metadata (country, distance, orientation_diff). Sequences are constructed for both current and next scenes (Zhou et al., 24 Mar 2025).
6. Model Training and Evaluation Protocols
For GNN-based trajectory prediction (nSTP), training follows:
- Optimizer: Adam, , weight_decay=1e-5; batch sizes 16–32.
- Epochs: 80–120, employing early stopping on validation ADE.
- Augmentation: Random flips, translations; rotation invariance via coordinate normalization.
For FM4SU, the T5-base encoder-decoder Transformer (220 M parameters) is extended with nSKG-specific tokens and fine-tuned for:
- Masked Scene Object Prediction: Random grid cells in the tokenized BEV are masked, model predicts missing spans (<M_i>).
- Next Scene Prediction: The model is tasked to generate the complete token sequence for the future scene, given the current (Zhou et al., 24 Mar 2025).
Hyperparameters include AdamW with , no warm-up, up to 20 epochs on 80 GB A100, input lengths up to 880 tokens per sequence. Data split is 80% training, 10% validation, 10% test, over ~30,000 scenes.
7. Metrics and Empirical Performance
Evaluation employs:
- Trajectory prediction: ADE₆s (Average Displacement Error over 6s), FDE₆s (Final Displacement Error), MR (Miss Rate: FDE > 2m), OffRoad% (percent predictions outside driveable area).
- FM4SU scene prediction: Accuracy (exact token match), Precision, Recall, F1 (recall is emphasized).
Comparative metrics on nuScenes (test split):
| Method | ADE₆s (m) | FDE₆s (m) | MR (%) | OffRoad (%) |
|---|---|---|---|---|
| VectorNet | 1.82 | 3.75 | 12.5 | 3.9 |
| LaneGCN | 1.75 | 3.62 | 12.0 | 3.6 |
| PGP | 1.68 | 3.55 | 11.6 | 3.5 |
| HDGT | 1.62 | 3.42 | 11.5 | 3.3 |
| nSTP + HeteroGNN | 1.54 | 3.26 | 11.3 | 3.1 |
FM4SU foundation model results:
| Task | Accuracy (%) | Precision | Recall | F1 |
|---|---|---|---|---|
| Masked Object Pred | 88.7 | 0.866 | 0.744 | 0.786 |
| Next Scene Pred | 86.7 | 0.618 | 0.594 | 0.603 |
Ablation studies show meaningful drops in performance when removing metadata (accuracy falls to 82.4%), coarsening grid (39.6%), or excluding key semantic relations (ADE increases up to 0.12 m with -map, 0.07 m with -semantic) (Zhou et al., 24 Mar 2025, Mlodzian et al., 2023).
8. Implications and Research Significance
nSKG advances the modeling of urban driving scenes by enabling rigorous graph-based representation of heterogeneous domain knowledge, supporting heterogeneous GNN architectures and symbolic transformer foundation models. This modular pipeline—from ontology to open-source implementation and empirical assessment—demonstrates quantifiable gains in trajectory prediction (up to 5% better ADE/FDE than vectorized-state baselines), enhanced scene understanding, and makes comprehensive semantic context available for downstream autonomous driving tasks (Mlodzian et al., 2023, Zhou et al., 24 Mar 2025).
A plausible implication is that the integration of richly typed knowledge graphs and symbolic representations will continue to drive progress in interpretable, generalizable machine learning frameworks for dynamic urban environments.