Argoverse Motion Forecasting Dataset

Updated 13 January 2026

Argoverse is a motion forecasting dataset offering extensive vehicle trajectory, sensor, and high-definition map data for autonomous driving research.
It integrates detailed road geometry and traffic context, enabling the development of advanced prediction models and rigorous benchmarking.
Benchmark studies show the dataset improves trajectory prediction accuracy and safety in autonomous navigation systems.

The nuScenes Knowledge Graph (nSKG) is a comprehensive semantic traffic scene representation constructed over the nuScenes autonomous driving dataset. It formalizes map topology, agents, traffic rules, and their semantic-spatial interactions through a typed attributed directed graph for downstream tasks such as trajectory prediction and scene understanding. nSKG leverages an OWL-DL ontology (SROIQ(D)) and stores all data as RDF triples, supporting explicit reasoning and scalable integration into graph neural network (GNN) and foundation model pipelines (Mlodzian et al., 2023, Zhou et al., 24 Mar 2025).

1. Formal Structure and Ontology Specification

nSKG is defined formally as a heterogeneously typed attributed directed graph:

$G = (V, E, X^V, X^E, R)$

with $V$ denoting all ontology entities (nodes), $E \subseteq V \times V$ the set of directed edges encoding semantic/spatial relations, $R = \{r_1,\ldots,r_k\}$ the finite set of edge relation types, $X^V \in \mathbb{R}^{|V| \times d_v}$ the node-feature matrix, and $X^E \in \mathbb{R}^{|E|\times d_e}$ the edge-feature matrix.

The ontology comprises 42 classes, 10 object properties, and 24 data-type properties (Zhou et al., 24 Mar 2025). Major schema modules include:

Agent Module: Vehicle, Human, Micro-mobility, Static Obstacle subclasses with fine-grained types (e.g., Car, Truck, ConstructionWorker, Stroller).
Map Module: Hierarchical, with RoadSegment, Lane, LaneSnippet, LaneSlice, Intersection, PedCrossing, Walkway, CarParkArea, TrafficLightStopArea, and more.
Scene Module: Scene entity $S_t = (B, C, T, P)$ , capturing state, temporal linkage (prevScene, nextScene), timestamp, and participant links (hasParticipant).

Object properties encode topological, temporal, and physical relationships (e.g., hasParticipant, prevScene/nextScene, onLane, connectedTo, stopAt). Reasoning axioms enable multi-scale and transitive closure (e.g., LaneSlice $\xrightarrow{\text{partOf}}$ LaneSnippet $\xrightarrow{\text{partOf}}$ Lane) (Mlodzian et al., 2023).

2. Graph Construction Pipeline

Raw nuScenes sensor streams (LiDAR, multi-camera, radar, IMU/GNSS) supply perception and map data. nSKG construction follows an ontology-driven transformation pipeline:

Perception Stage: 3D object detection, semantic segmentation, instance segmentation, BEV-projection for dynamic/static elements.
ABox Population: Scene entity instantiation per timestamp (typically at 10 Hz), agents with class labels, location, heading, velocity; spatial joins via Geo-SPARQL (e.g., agent location within LaneSlice polygons) (Zhou et al., 24 Mar 2025).
Temporal and Spatial Linking: Build temporal scene chains (prevScene/nextScene) and agent trajectories (inNextScene). Map topology linked with Lane, LaneSnippet, LaneSlice, LaneConnector, road blocks.
LaneSnippet Extraction: Borders split at type changes or length $>$ 20 m. Adjacent snippets connected via semantic switchViaX relations (SingleSolid, DoubleDashed).
LaneSlice Geometry: For center arcline points, derive left/right border proximity and compute width, then instantiate LaneSlice nodes—linked sequentially (hasNextLaneSlice).
Stop Area and Crossings: Map stop_line to StopArea, traffic_light to TrafficLightStopArea; identify pedestrian crossing by spatial proximity of walkways $<$ 5 m apart.
RoadBlock Grouping: Lane clustering by shared surface/direction, connective linkage.
RDF Conversion: Dump entire entity-relation set as RDF triples, loaded to triplestores (Blazegraph, Owlready2).
nSTP Extraction: For trajectory prediction, extract 2 s agent history, spatially reachable map elements, other SceneParticipants, and normalize coordinates (shift- and rotation-invariant; $p_{local} = R_{target}^T (p_g - p_{target})$ ) (Mlodzian et al., 2023).

3. Semantic and Spatial Relations

Principal relation types in nSKG encapsulate broad scene semantics. Object properties (edges) include temporal, participant, agent-agent, map-topology, geometry anchoring, and physical proximity:

Group	Relation	Meaning/Edge Feature
Temporal	hasNextScene/prevScene	Scene chain t→t+1/t→t–1
Participant	isSceneParticipantOf	Agent↔Agent at time
Inter-Agent	follows, parallel	Precedence, lateral alongside
Map-Topology	hasNextLane, hasLeftLane	Lane graph adjacency
Geometry	laneHasSlice, connectorHasPose	Anchor geometry
Crossings	causesStopAt	Light→stop area causality
Proximity	isOn, walkwayIsNextTo	Agent@time→map element
RoadBlocks	hasNextRoadBlock	Forward block adjacency

In practice, relations are represented as RDF triples and edge features typically as one-hot or learned embeddings, with spatial edges carrying their type as feature. Continuous distances are encoded implicitly via existence thresholds, not as explicit edge features (Mlodzian et al., 2023).

4. PyTorch-Geometric API and Downstream Integration

For trajectory prediction, nSKG data is transformed to PyG HeteroData objects, supporting heterogeneous GNN architectures:

Node Features: Each node type (SceneParticipant, LaneSlice, LaneConnector, etc.) has an $N_{ntype} \times d_v$ feature tensor.
Edge Indexing: Per-relation edge index tensors encode source/target node connectivity, facilitating message passing per-edge type.
Past and Future Trajectories: SceneParticipant nodes carry boolean target masks and past-position features; regression targets $y \in \mathbb{R}^{12\times2}$ encode 6 s future displacement at 2 Hz.
Loader: Batch loading via torch_geometric.loader.DataLoader.
Model Architecture: Heterogeneous message passing (e.g., PyG HeteroConv) by relation type; MLP head attached to target node embedding for trajectory regression. Mean-squared error is the canonical loss. Alternatives include aggregated edge-type embeddings, Graph Transformer/HGT layers for richer attention, and multi-modal losses (min-ADE, maneuver classification) (Mlodzian et al., 2023).

Scalability considerations arise due to average subgraph sizes of 1,000–2,000 nodes; subgraph sampling strategies such as ClusterGCN, GraphSAINT are recommended.

5. BEV Symbolic Representation and Foundation Models

nSKG enables formal Bird’s Eye View (BEV) symbolic extraction for foundation model training (Zhou et al., 24 Mar 2025). Around each ego-vehicle, a BEV grid is constructed:

$A_t = [c_{ij}^{(T)}]_{i=1..n, j=1..m}, \quad c_{ij}^{(T)} \subseteq O_t$

with $n \times m$ (20 × 11) cells tiling 2 m × 2 m world-coordinate patches, each cell encoding scene object(s) by ontology label.

BEV grids are serialized into token sequences:

Metadata prefix: <country>, <dist> (Δd), <orientation_diff> (Δθ), <scene_start>.
Grid cell serialization: concept tokens/emission (<concept_sep>), <col_sep>, <row_sep>, <empty> for blanks.
Scene pairs: $A_t$ and $A_{t+1}$ concatenated.

For training, token masking strategy is employed (random span mask via sentinel <Mᵢ> tokens) for span prediction and next-scene prediction. The vocabulary unites 28+ ontology concepts and spatial delimiters.

6. Quantitative Characteristics and Experimental Findings

Quantitative statistics for nSKG (Zhou et al., 24 Mar 2025) include:

RDF triple count: ~43 million (static + dynamic)
Scene count: ~30,000 (20 s at 10 Hz, ~1,000 scenarios)
Object occurrence frequencies:

Concept	Count
Walkway	61,879
Intersection	51,285
Pedestrian Crossing	22,448
Car Park Area	14,575
Traffic Light Stop Area	13,618
Child	9
Stroller	3

Dynamic objects: 9,374; static objects: 371,271
Lane graph average degree: ~2.3; scene chain length: 200

Key experiments using pre-trained T5 models yielded:

Scene object prediction (T5-Base): Accuracy 88.7%, Precision 86.6%, Recall 74.4%, F1 78.6%
Next scene prediction (T5-Base): Accuracy 86.7%, Precision 61.8%, Recall 59.4%, F1 60.3%
Ablations: Finer grid resolution (2 m) yielded superior accuracy versus coarser (5 m); recall favored owing to safety-critical object detection.

Zero-shot baselines (LLaMA3.1, ChatGPT) performed significantly below fine-tuned T5 (20–41% acc.), and pre-training (masked span fill) accelerated convergence for scene-prediction tasks.

7. Significance, Current Usage, and Future Directions

nSKG provides an explicit, semantically rich graph representation for traffic scene understanding, unifying raw sensor streams with topological, contextual, and temporal relationships. Its main contributions are the open ontological schema, knowledge graph construction, and dataset release of >40,000 heterograph regression examples (Mlodzian et al., 2023).

Empirical validation for trajectory prediction architectures remains open; nSKG subgraphs are designed to maximize shift- and rotation-invariance. Extension to symbolic foundation models for autonomous driving demonstrates strong spatial and temporal reasoning and forms the basis for further research in comprehensive scene understanding (Zhou et al., 24 Mar 2025).

All scripts, ontology files, RDF triples, and PyG HeteroData artifacts are publicly available at the project repositories, facilitating reproducibility and broad downstream integration.