nuScenes: Autonomous Driving Benchmark
- nuScenes is a large-scale, multimodal benchmark dataset for autonomous driving, featuring 360° sensor coverage and detailed 3D annotations in urban environments.
- It integrates synchronized data from cameras, LiDAR, radar, and GPS/IMU, enabling the use of novel evaluation metrics for 3D detection and tracking.
- The dataset supports advanced research in sensor fusion, class imbalance modeling, and robust prediction, setting new standards in autonomous vehicle perception.
nuScenes Dataset
nuScenes is a large-scale, multimodal benchmark for autonomous driving perception and prediction, providing a comprehensive sensor suite, extensive annotations, novel performance metrics, and serving as a standard for research in 3D detection, tracking, and sensor fusion. The dataset was collected in dense urban environments of Boston and Singapore and is distinguished by its complete 360° sensor coverage, dense 3D annotations for both objects and their states, and public releases of a development kit, evaluation code, and standardized protocols (Caesar et al., 2019).
1. Sensor Configuration and Synchronization
nuScenes is the first autonomous driving dataset to capture the full sensor suite typically mounted on a research vehicle, enabling holistic perception capability:
Modality | Sensor Details |
---|---|
Camera | 6 × 70° (front, sides), 1 × 110° (rear), synchronized 360° RGB views |
LiDAR | 1 × 32-beam spinning Velodyne at 20 Hz, 360° horizontal, –30° to +10° vertical FOV, ~70 m range |
Radar | 5 × 77 GHz Continental ARS408 FMCW, 13 Hz, up to 250 m, ±0.1 km/h velocity accuracy |
Localization | GPS/IMU at 1000 Hz, RTK, 20 mm accuracy |
Precise temporal synchronization is a defining feature: camera exposures are triggered when the LiDAR’s top sweep is at the center of the camera field of view. This results in tightly aligned multimodal sensor data, addressing calibration and registration problems encountered in prior datasets.
2. Dataset Structure, Size, and Annotation Schema
nuScenes comprises 1,000 driving scenes, each 20 seconds long, for a total of 5.5 hours of data, sampled across Boston and Singapore urban areas. The annotated core ("keyframes") is at 2 Hz, representing the following statistics:
- ~1.4 million images (~100× KITTI)
- ~400,000 LiDAR point clouds, ~1.3 million radar sweeps
- Full 360° annotation coverage—not limited to the forward view as in KITTI
Each keyframe is richly annotated with:
- 3D bounding boxes for 23 semantic classes (including both prevalent and rare types, such as construction vehicles, trailers, traffic cones)
- 8 object attributes (e.g., pedestrian pose, vehicle state)
- Full geometric box description: , width, length, height, and yaw angle This scale yields ≈7× more annotations than KITTI, and the dataset exhibits strong class imbalance (1:10,000 between rarest and commonest classes), motivating research on long-tail distribution modeling.
3. Benchmarking: Novel Detection, Tracking Metrics, and Baselines
nuScenes introduced evaluation metrics tailored to the challenges of 3D detection and tracking in safety-critical, sensor-fusion scenarios:
Detection
- Class-agnostic, center-distance-based average precision (AP): instead of intersection-over-union (IoU), matches between predicted and ground-truth boxes are determined by the 2D center distance on the ground, with thresholds m over classes .
- nuScenes Detection Score (NDS): captures overall quality, balancing AP and five TP error metrics:
- Average Translation Error (ATE, meters)
- Average Scale Error (ASE, )
- Average Orientation Error (AOE, radians)
- Average Velocity Error (AVE, m/s)
- Average Attribute Error (AAE, )
- (this summation is over the five error metrics above)
Tracking
- sAMOTA (scaled MOT accuracy), sMOTA (confidence-calibrated MOTA)
- Track Initialization Duration (TID): time for tracker to begin tracking a target
- Longest Gap Duration (LGD): maximum tracking failure interval per target
Baselines
- LiDAR: PointPillars with temporal accumulation (10 sweeps), velocity regression
- Image: Orthographic Feature Transform (OFT, with SSD head), MonoDIS
- Tracking: Adapted AB3DMOT for both LiDAR and image-based detection
4. Dataset Analysis, Spatial Coverage, and Class Imbalance
Spatial and statistical analyses reveal:
- Intersections are overrepresented (reflecting real-world conflict points)
- On average, each keyframe contains ≈7 pedestrians and ≈20 vehicles, but strong long-tail distributions persist across rare classes
- Distributional histograms for box sizes, spatial positions, and yaw demonstrate pronounced diversity in object geometry and scenario complexity
- Experiments confirm that a center-distance match threshold of 2 m yields more informative cross-modality (LiDAR vs. image) rankings than IoU, addressing the well-known small-object penalization of IoU for near-miss errors
- Accumulating additional LiDAR sweeps yields measurable improvements in both detection AP and velocity estimation, especially for dynamically moving objects
5. Comparison with Legacy Datasets
nuScenes represents an order-of-magnitude advance over previous datasets:
Dataset | # Images | # 3D Boxes | Sensor Coverage | Radar | Environment Diversity |
---|---|---|---|---|---|
KITTI | ~15k | ~200k | Front only | No | Limited, daylight |
nuScenes | ~1.4M | ~1.4M | 360°, full suite | Yes | Urban, Boston, SG; day/night, rain |
- nuScenes is the first to offer radar, 360° multi-modality, and urban diversity under varied environmental conditions.
6. Research Applications and Implications
nuScenes enables:
- Advanced benchmarks for 3D object detection, tracking, trajectory prediction, and sensor fusion—within a realistic, fused-sensor urban context
- Design and evaluation of algorithms under severe class imbalance, supporting progress in rare-event detection and long-tail modeling
- Study of sensor synchronization and calibration, critical for robust perception in real-world deployments
- Exploration of robustness to geographic and environmental domain shifts, with standardized evaluation practices
Semantic maps and human-authored scene descriptions are included, supporting research in semantic localization, behavior modeling, and prior-based scene understanding. Open-sourced tooling and code facilitate reproduction and comparability.
7. Impact and Future Directions
nuScenes, through its multimodal design, exhaustive annotations, tailored metrics, and open protocols, has established itself as a foundational benchmark for perception in autonomous urban driving. Its structure directly addresses limitations seen in earlier efforts—such as insufficient sensor diversity, annotation sparsity, and narrow operational domains—and sets high standards for future dataset development in the field. The dataset has catalyzed research in class-imbalanced detection, robust tracking, and multi-sensor fusion, and continues to serve as the de facto evaluation bedrock for large-scale autonomous vehicle research.