NuScenes vs Argoverse Datasets

Updated 24 September 2025

NuScenes and Argoverse are large-scale autonomous driving datasets featuring diverse sensor arrays, extensive spatiotemporal annotations, and HD maps.
They support research in perception, 3D detection, tracking, motion forecasting, and simulation, offering robust real-world benchmarks.
Comparative analysis reveals distinct sensor setups, annotation scopes, and evaluation metrics that influence performance in AV applications.

NuScenes and Argoverse datasets are foundational resources for the development and benchmarking of perception, tracking, mapping, forecasting, and simulation methods in autonomous driving. Both datasets are characterized by their large-scale multimodal sensor suites, rich spatiotemporal annotations, and detailed HD mapping components. They have enabled new lines of research on sensor fusion, robust 3D detection, precise trajectory prediction, map-based context modeling, and large-scale scenario simulation.

1. Sensor Suite Composition and Data Collection

nuScenes and Argoverse datasets are both constructed from driving data collected with vehicles outfitted with heterogeneous sensor suites to capture the complex dynamics and geography of urban traffic scenes.

nuScenes features a sensor configuration comprising 6 surround-view cameras (each 1600×900 pixels, 12 Hz, 360° FOV), a 32-beam spinning lidar operating at 20 Hz (range ~70 m, ±2 cm accuracy), 5 radars (77 GHz, FMCW, 13 Hz, range ~250 m, ±0.1 km/h velocity accuracy), and high-frequency GPS/IMU localization modules (1 kHz, RTK, ~20 mm accuracy); all sensors are precisely synchronized and motion compensated for tight alignment across modalities (Caesar et al., 2019).
Argoverse deploys a similar multi-sensor setup: 7 externally mounted “ring” RGB cameras (1920×1200 at 30 Hz, 360° FOV), 2 front-facing high-res stereo cameras (2056×2464 at 5 Hz, 0.3 m baseline—the only major dataset to offer this), 2 roof-mounted 32-beam spinning lidars (10 Hz, ~200 m, ~107k points/sweep), and 6 DOF localization via GPS and dead-reckoning, contextualized in a city coordinate system (Chang et al., 2019). Argoverse 2 further expands sensors to 9 cameras (adding 2 stereo), maintains dual lidar, and provides improved synchronization (within ±1.4 ms) (Wilson et al., 2023).

Both datasets sample from geographically and climatically diverse U.S. cities (nuScenes: Boston, Singapore; Argoverse: Pittsburgh, Miami; Argoverse 2: Austin, Detroit, Miami, Palo Alto, Pittsburgh, Washington D.C.), increasing the diversity of driving scenes, weather, and infrastructure.

2. Annotation Scope, Taxonomy, and HD Maps

Annotations in nuScenes and Argoverse cover 3D objects, trajectory histories, and richly detailed scene semantics, with a corresponding set of structured HD maps.

nuScenes contains 1,000 20-second scenes; every keyframe (2 Hz) is labeled with 3D bounding boxes (23 classes, 8 attributes) and semantic maps (road, sidewalk, etc.), totaling 1.4 million 3D box annotations over 40,000 keyframes, an order of magnitude larger in scale than KITTI (Caesar et al., 2019).
Argoverse’s 3D Tracking set provides dense 3D cuboids for 15 object classes in logs of 15–30 s (Chang et al., 2019), with Argoverse 2 expanding to 30 object categories with sufficient sampling for 3D detection/forecasting (Wilson et al., 2023). Annotations are spatially restricted to objects within 5 m of a driveable region (as determined by the HD map), and faces/license plates are blurred for privacy.
Maps: nuScenes provides rasterized (1 m grid) top-down semantic maps, whereas Argoverse offers both vector map elements (3D lane centerlines—polylines with connectivity and semantic metadata such as intersection status, turn direction), high-resolution rasterized ground height, and driveable area masks. Argoverse’s map-based driveable region definitions enable filtering and context-enriched post-processing (e.g., map-based ground removal, centerline alignment), enabling substantially stronger priors for tracking, ground removal, and forecasting (Chang et al., 2019).
Motion Forecasting Annotations: Argoverse Motion Forecasting dataset includes >300k 5-second sequences (320 hours manually selected as “interesting”), each with rich agent trajectories, map context, and social context (Chang et al., 2019). Argoverse 2 expands this to 250,000 11s multi-actor scenarios mined for interaction-critical events (Wilson et al., 2023); nuScenes motion forecasting is smaller scale.

3. Evaluation Metrics and Baseline Methods

Both datasets have driven the development and standardization of novel metrics to reflect the unique challenges of 3D detection, tracking, and forecasting in automated driving.

A. 3D Detection & Tracking (nuScenes, Argoverse, Argoverse 2)

nuScenes introduces a center-distance-based matching protocol (2D center on ground plane with a configurable threshold, e.g., 0.5–4 m), instead of standard IOU, to accommodate small/slender objects. Average Precision (AP) is defined over a set of classes and distance thresholds:

$\text{mAP} = \frac{1}{|\mathcal{C}| \cdot |\mathcal{D}|} \sum_{c \in \mathcal{C}} \sum_{d \in \mathcal{D}} AP_{c,d}$

Multiple True Positive (TP) error metrics are defined: Average Translation Error (ATE), Scale Error (ASE, $1–\text{IOU}$ ), Orientation Error (AOE), Velocity Error (AVE), Attribute Error (AAE). The consolidated nuScenes Detection Score (NDS) combines mAP and mTP metrics:

$\text{NDS} = \frac{1}{10} [5 \cdot \text{mAP} + \sum_{mTP \in TP} (1–\min(1, mTP))]$

Argoverse 2 applies conceptually similar metrics, with formulas for AP, ATE, ASE, AOE, and a detection score called CDS (product of mAP and complement unit errors), with lanes and tracking assessed over 30 object classes (Wilson et al., 2023).
Baseline methods: For nuScenes, Lidar (PointPillars, with multi-frame sweeps for velocity), Monocular (OFT, MonoDIS), and Tracking-by-detection with AB3DMOT. Map priors are used to correct orientation, filter ground, and produce more reliable association (Caesar et al., 2019, Chang et al., 2019, Wilson et al., 2023).

B. Motion Forecasting

Standard metrics (common to both datasets): Minimum Average Displacement Error (minADE), Minimum Final Displacement Error (minFDE), Miss Rate (MR), Drivable Area Compliance (DAC) (Chang et al., 2019).
Usage of HD maps as priors: Map-based pruning (candidate paths along centerlines) and curvilinear transforms (transforming agent positions into lane-aligned systems) result in higher forecasting accuracy.

C. BEV Map Estimation

Occupancy grid, IoU, and vector layout mAP are widely used for mapping, with approaches integrating Bayesian fusion and uncertainty measures (Roddick et al., 2020, Zhang et al., 3 Nov 2024). For example, VQ-Map achieves 62.2% surround-view mIoU on nuScenes and 73.4% monocular mIoU on Argoverse (Zhang et al., 3 Nov 2024). RelMap reports significant advances in online vectorized HD map construction by integrating class-aware spatial and MoE-based semantic priors, achieving mAP of 77.1 on nuScenes and high scores on Argoverse 2 (Cai et al., 29 Jul 2025).

4. Research Applications, Community Impact, and Downstream Use

nuScenes and Argoverse underpin a broad and growing ecosystem of research directions across perception, prediction, scene understanding, and simulation:

Perception and Tracking: The canonical nature of the datasets (rich sensor fusion, diversity, dense annotations) has established them as standard benchmarks for training and evaluating end-to-end 3D detection, tracking, and motion forecasting models. Analysis and baselines on these data motivated innovations like multi-frame input, velocity estimation from consecutive sweeps, and context-based association (Caesar et al., 2019, Chang et al., 2019).
Semantic Mapping and BEV Estimation: The inclusion of surround-view imagery, lidar, and annotated maps enabled the design and validation of approaches for monocular BEV semantic mapping, Bayesian occupancy grids, transformer-based mapping (e.g., Pyramid Occupancy Networks, GitNet, VQ-Map), and token-based map layout estimation (Roddick et al., 2020, Gong et al., 2022, Zhang et al., 3 Nov 2024).
Trajectory Prediction and Map-Aware Modeling: Argoverse’s trajectory data and lane centerlines serve as priors for LSTM, GNN, and diffusion-based prediction models. Map-based ground removal and lane-aligned coordinates are demonstrably beneficial for orientation accuracy and forecasting performance (Chang et al., 2019, Mlodzian et al., 2023).
Simulation, Synthetic Data, and Benchmark Creation: Both datasets are repurposed for advanced simulation (ScenarioNet, TRoVE) by converting annotated scenes into digital twins or photorealistic synthetic data for benchmarking perception, imitation learning, and reinforcement learning (Dokania et al., 2022, Li et al., 2023).
Vision-Language and Spatial QA Benchmarks: NuScenes-QA and NuScenes-SpatialQA leverage 3D annotations to validate multi-modal VQA and spatial reasoning in driving context, systematically generating millions of question-answer pairs grounded in 3D scene graphs (Qian et al., 2023, Tian et al., 4 Apr 2025).
Dataset Management and Quality Tools: Tools like ReBound facilitate interactive and cross-dataset 3D annotation, re-annotation, and active learning, directly enhancing data quality and supporting iterative dataset curation (Chen et al., 2023).

5. Comparative Analysis and Notable Differences

Table: Key Features of nuScenes vs. Argoverse/Argoverse 2

Feature/Aspect	nuScenes	Argoverse (1/2)
Cameras	6 RGB, 12 Hz, 1600×900	7 ring (30 Hz), 2 stereo (5 Hz), 1920×1200
Lidar	1x 32-beam, 20 Hz	2x 32-beam, 10 Hz
Radar	5x FMCW, 77 GHz	Not present
Stereoscopic Imagery	No	Yes (front-facing stereo)
HD Maps	Rasterized, driveable/sidewalk	HD vector map (lanes w/ connectivity, height)
Object Classes	23 (1.4M boxes)	15/30 (Argoverse 2), within 5 m of drivable
Annotations	All 360°, 20 s scenes, full suite	15–30 s, 300k+ forecast tasks, HD map-aligned
Notable Baselines	PointPillars, OFT, MonoDIS, AB3DMOT	Map-based orientation, ground removal, LSTM
Motion Forecasting	Limited set	Extensive (300k scenarios), lane prior
Scene Diversity	Boston, Singapore	Multiple US cities (Pittsburgh, Miami, etc.)

Notably, Argoverse is the only major AV dataset to include forward-facing stereo imagery. Its HD maps offer explicit lane centerline connectivity, facilitating downstream map automation. Argoverse 2 extends number of cities, sensor modalities (including stereo), object classes, sampling rates, and the scale of lidar-only data for self-supervised learning.(Wilson et al., 2023)

nuScenes, in contrast, is the canonical early multimodal AV dataset unifying 360° image, lidar, radar, and map annotations, as well as pioneering evaluation protocols (center-distance, TID/LGD). Map information in nuScenes is rasterized—less structurally explicit than Argoverse’s vector maps, but sufficient for BEV segmentation/semantic tasks. Both have provided public devkits, leaderboards, and serve as primary testbeds for the field.

6. Limitations, Challenges, and Best Practices

Geographical Data Leakage: Standard splits in both datasets use time-based partitioning, leading to high overlap of training and test/val locations (e.g., over 80% of nuScenes’ validation/test are within 5 m of training, ~40% for Argoverse). This causes “localization leakage,” such that models can inflate performance by memorizing local appearance-to-map correspondences rather than generalizing to unseen geography (Lilja et al., 2023). New geographically disjoint splits (e.g., “Near/Far Extrapolation”) reveal a sharp drop in mAP (e.g., MapTRv2 drops >45 mAP on nuScenes), demonstrating the challenge of true generalization.
Annotation Boundaries: Both datasets restrict annotations spatially (e.g., Argoverse to a 5 m driveable buffer), which, while practical for labeling, introduces boundary effects for models operating at greater range or with uncertain map context.
Class Imbalance & Long-tail: As shown in nuScenes analysis, the severely imbalanced occurrence of rare classes (e.g., construction vehicles) presents an open challenge for detection/forecasting—long-tail effects persist in learned system performance (Caesar et al., 2019).
Role of Map Priors: While map context aids tracking and trajectory forecasting (especially in Argoverse), map automation (automatic inference/extension of maps from sensor data) remains challenging and under-explored. Most methods still rely on human-annotated, high-definition base maps.
Sensor-Driven Discrepancies: Different acquisition modalities, frame rates, and noise characteristics drive cross-dataset discrepancies, presenting obstacles to direct model transfer. Newer methods (e.g., NSDE-based sequence models) and uncertainty estimation techniques aim to mitigate these effects and improve cross-dataset generalization (Park et al., 2023, Gilles et al., 2022).

7. Future Directions and Community Impact

nuScenes and Argoverse (and Argoverse 2) have catalyzed progress in AV perception, prediction, and simulation:

Advances in Perception, Mapping, Forecasting: Standardized datasets and metrics have enabled rapid benchmarking and innovation, including new transformer/fusion-based architectures for BEV perception, multi-scale tracking, and map construction (Pyramid Occupancy Networks, GitNet, RelMap, VQ-Map).
Rich Simulation and Synthetic Data: The data structure and annotation fidelity of both datasets have made them ideal “substrates” for photorealistic simulation, synthetic dataset generation (Panacea+, TRoVE), and digital twin creation for robust, scalable imitation learning and RL.
Emergence of Knowledge and Scene Graphs: Efforts like the nuScenes Knowledge Graph (nSKG) formalize the use of semantic reasoning and graph neural networks for trajectory prediction, potentially paving the way for neuro-symbolic pipeline integration (Mlodzian et al., 2023).
Comprehensive Multi-modal VQA/Spatial Reasoning Tasks: Benchmarks such as NuScenes-QA and NuScenes-SpatialQA systematically raise the bar for vision-language and spatial reasoning evaluations in AV, directly leveraging the scale and grounding of original dataset annotations (Qian et al., 2023, Tian et al., 4 Apr 2025).
Recommendations for Practice: To evaluate generalization, geographically disjoint data splits are now advocated. Additionally, dataset management tools such as ReBound facilitate iterative, cross-domain annotation improvement and conversion, enhancing dataset longevity and impact (Chen et al., 2023, Lilja et al., 2023).

Ongoing development, exploration of richer semantic priors, robust handling of sensor and annotation artifacts, and rigorous scenario-level simulation and evaluation continue to drive the field, with nuScenes and Argoverse datasets remaining central to both methodological innovation and empirical benchmarking.