nuScenes: Autonomous Driving Benchmark

Updated 5 October 2025

nuScenes is a large-scale, multimodal benchmark dataset for autonomous driving, featuring 360° sensor coverage and detailed 3D annotations in urban environments.
It integrates synchronized data from cameras, LiDAR, radar, and GPS/IMU, enabling the use of novel evaluation metrics for 3D detection and tracking.
The dataset supports advanced research in sensor fusion, class imbalance modeling, and robust prediction, setting new standards in autonomous vehicle perception.

nuScenes Dataset

nuScenes is a large-scale, multimodal benchmark for autonomous driving perception and prediction, providing a comprehensive sensor suite, extensive annotations, novel performance metrics, and serving as a standard for research in 3D detection, tracking, and sensor fusion. The dataset was collected in dense urban environments of Boston and Singapore and is distinguished by its complete 360° sensor coverage, dense 3D annotations for both objects and their states, and public releases of a development kit, evaluation code, and standardized protocols (Caesar et al., 2019).

1. Sensor Configuration and Synchronization

nuScenes is the first autonomous driving dataset to capture the full sensor suite typically mounted on a research vehicle, enabling holistic perception capability:

Modality	Sensor Details
Camera	6 × 70° (front, sides), 1 × 110° (rear), synchronized 360° RGB views
LiDAR	1 × 32-beam spinning Velodyne at 20 Hz, 360° horizontal, –30° to +10° vertical FOV, ~70 m range
Radar	5 × 77 GHz Continental ARS408 FMCW, 13 Hz, up to 250 m, ±0.1 km/h velocity accuracy
Localization	GPS/IMU at 1000 Hz, RTK, 20 mm accuracy

Precise temporal synchronization is a defining feature: camera exposures are triggered when the LiDAR’s top sweep is at the center of the camera field of view. This results in tightly aligned multimodal sensor data, addressing calibration and registration problems encountered in prior datasets.

2. Dataset Structure, Size, and Annotation Schema

nuScenes comprises 1,000 driving scenes, each 20 seconds long, for a total of 5.5 hours of data, sampled across Boston and Singapore urban areas. The annotated core ("keyframes") is at 2 Hz, representing the following statistics:

~1.4 million images (~100× KITTI)
~400,000 LiDAR point clouds, ~1.3 million radar sweeps
Full 360° annotation coverage—not limited to the forward view as in KITTI

Each keyframe is richly annotated with:

3D bounding boxes for 23 semantic classes (including both prevalent and rare types, such as construction vehicles, trailers, traffic cones)
8 object attributes (e.g., pedestrian pose, vehicle state)
Full geometric box description: $(x, y, z)$ , width, length, height, and yaw angle This scale yields ≈7× more annotations than KITTI, and the dataset exhibits strong class imbalance (1:10,000 between rarest and commonest classes), motivating research on long-tail distribution modeling.

3. Benchmarking: Novel Detection, Tracking Metrics, and Baselines

nuScenes introduced evaluation metrics tailored to the challenges of 3D detection and tracking in safety-critical, sensor-fusion scenarios:

Detection

Class-agnostic, center-distance-based average precision (AP): instead of intersection-over-union (IoU), matches between predicted and ground-truth boxes are determined by the 2D center distance on the ground, with thresholds $D = \{0.5, 1, 2, 4\}$  m over classes $C$ .
$mAP = \frac{1}{|C||D|} \sum_{c \in C}\sum_{d \in D} AP_{c,d}$
nuScenes Detection Score (NDS): captures overall quality, balancing AP and five TP error metrics:
- Average Translation Error (ATE, meters)
- Average Scale Error (ASE, $1{-}\text{IoU}$ )
- Average Orientation Error (AOE, radians)
- Average Velocity Error (AVE, m/s)
- Average Attribute Error (AAE, $1-{\rm accuracy}$ )
- ${\text{NDS}} = \frac{1}{10}\left[ 5\cdot mAP + \sum_{tp} (1-\min(1,tp)) \right]$
- (this summation is over the five error metrics above)

Tracking

sAMOTA (scaled MOT accuracy), sMOTA $_r$ (confidence-calibrated MOTA)
Track Initialization Duration (TID): time for tracker to begin tracking a target
Longest Gap Duration (LGD): maximum tracking failure interval per target

Baselines

LiDAR: PointPillars with temporal accumulation (10 sweeps), velocity regression
Image: Orthographic Feature Transform (OFT, with SSD head), MonoDIS
Tracking: Adapted AB3DMOT for both LiDAR and image-based detection

4. Dataset Analysis, Spatial Coverage, and Class Imbalance

Spatial and statistical analyses reveal:

Intersections are overrepresented (reflecting real-world conflict points)
On average, each keyframe contains ≈7 pedestrians and ≈20 vehicles, but strong long-tail distributions persist across rare classes
Distributional histograms for box sizes, spatial positions, and yaw demonstrate pronounced diversity in object geometry and scenario complexity
Experiments confirm that a center-distance match threshold of 2 m yields more informative cross-modality (LiDAR vs. image) rankings than IoU, addressing the well-known small-object penalization of IoU for near-miss errors
Accumulating additional LiDAR sweeps yields measurable improvements in both detection AP and velocity estimation, especially for dynamically moving objects

5. Comparison with Legacy Datasets

nuScenes represents an order-of-magnitude advance over previous datasets:

Dataset	# Images	# 3D Boxes	Sensor Coverage	Radar	Environment Diversity
KITTI	~15k	~200k	Front only	No	Limited, daylight
nuScenes	~1.4M	~1.4M	360°, full suite	Yes	Urban, Boston, SG; day/night, rain

nuScenes is the first to offer radar, 360° multi-modality, and urban diversity under varied environmental conditions.

6. Research Applications and Implications

nuScenes enables:

Advanced benchmarks for 3D object detection, tracking, trajectory prediction, and sensor fusion—within a realistic, fused-sensor urban context
Design and evaluation of algorithms under severe class imbalance, supporting progress in rare-event detection and long-tail modeling
Study of sensor synchronization and calibration, critical for robust perception in real-world deployments
Exploration of robustness to geographic and environmental domain shifts, with standardized evaluation practices

Semantic maps and human-authored scene descriptions are included, supporting research in semantic localization, behavior modeling, and prior-based scene understanding. Open-sourced tooling and code facilitate reproduction and comparability.

7. Impact and Future Directions

nuScenes, through its multimodal design, exhaustive annotations, tailored metrics, and open protocols, has established itself as a foundational benchmark for perception in autonomous urban driving. Its structure directly addresses limitations seen in earlier efforts—such as insufficient sensor diversity, annotation sparsity, and narrow operational domains—and sets high standards for future dataset development in the field. The dataset has catalyzed research in class-imbalanced detection, robust tracking, and multi-sensor fusion, and continues to serve as the de facto evaluation bedrock for large-scale autonomous vehicle research.

PDF Markdown Chat (Pro)

References (1)

nuScenes: A multimodal dataset for autonomous driving (2019)

Follow Topic

Get notified by email when new papers are published related to nuScenes Dataset.