Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

nuScenes: A multimodal dataset for autonomous driving (1903.11027v5)

Published 26 Mar 2019 in cs.LG, cs.CV, cs.RO, and stat.ML

Abstract: Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for lidar and image based detection and tracking. Data, development kit and more information are available online.

Overview of nuScenes: A Multimodal Dataset for Autonomous Driving

In the paper "nuScenes: A multimodal dataset for autonomous driving," Holger Caesar et al. present a comprehensive dataset designed to support the development of autonomous vehicle (AV) technologies. The dataset, referred to as nuScenes, distinguishes itself by providing a rich multimodal collection of sensory inputs using an AV sensor suite that includes six cameras, five radars, and one lidar with full 360-degree coverage. This dataset is instrumental in advancing computer vision tasks like 3D object detection, tracking, and behavior modeling under diverse environmental conditions.

Key Features and Contributions

The nuScenes dataset is notable for several reasons:

  1. Multimodal Sensor Suite: It includes synchronized data from six cameras, five radars, and one lidar, capturing the environment around the vehicle in 360 degrees. This integration of multiple sensor modalities addresses the inherent limitations of using a single type of sensor, providing a more robust dataset for training and evaluating AV perception systems.
  2. Granular Annotations: nuScenes contains 1000 scenes, each lasting 20 seconds, and provides detailed annotations of 3D bounding boxes for 23 object classes and 8 attributes. This extensive labeling surpasses the annotation volume of previous datasets, such as KITTI, by a significant margin.
  3. Detection and Tracking Metrics: The dataset introduces novel metrics for 3D object detection and tracking, such as Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), and Average Velocity Error (AVE). These metrics aim to offer a holistic evaluation of model performance beyond simple detection accuracy.
  4. Scene Diversity: The data collection spans different geographical locations (Boston and Singapore), weather conditions (rain, clear), and times of the day (day and night). This variety ensures that models trained on this dataset are exposed to the broad range of scenarios they might face in real-world situations.

Dataset Composition and Methodologies

Drive Planning and Car Setup

The dataset's acquisition involved meticulous planning to ensure diverse and representative samples of urban driving conditions. Two Renault Zoe electric cars, equipped identically with the AV sensor suite, were deployed in Boston and Singapore. The localization accuracy was maintained within 10 cm using a Monte Carlo Localization scheme based on lidar odometry.

Data Annotation and Privacy Protection

Annotation efforts were extensive, involving expert annotators to provide high-fidelity 3D bounding boxes and semantic map data. To protect privacy, automated blurring techniques were employed to obscure identifiable information such as faces and license plates captured in the images.

Experimental Setup and Baselines

The authors present several baseline models, leveraging both lidar and camera data for object detection and tracking. Among these, the PointPillars method, which accumulates multiple lidar sweeps to enrich point clouds, demonstrated significant performance improvements. Another baseline, MonoDIS, utilizes monocular camera data and excels in detecting smaller and more thinly distributed objects like bicycles and traffic cones.

Results and Analysis

The paper provides an in-depth analysis of the baseline performance, highlighting critical insights:

  • Impact of Multimodal Data: Models benefit significantly from the rich multimodal data provided by nuScenes. Lidar-based detection methods were particularly effective in general object detection, whereas camera-based methods were better for smaller and finely detailed objects.
  • Data Augmentation and Pretraining: Pretraining on related datasets (e.g., KITTI) yielded marginal improvements in final model performance, emphasizing the importance of large-scale data for training.
  • Novel Metrics: The newly proposed metrics, such as the nuScenes detection score (NDS), offer a more comprehensive assessment by incorporating various aspects of detection performance, including translation, scale, orientation, velocity, and attribute estimation.

Implications and Future Directions

The introduction of the nuScenes dataset has substantial implications for the AV research community. It enables the development and rigorous evaluation of more sophisticated models capable of handling the complex and dynamic nature of urban environments. The dataset's rich annotations and diverse scenarios support advancements in multimodal sensor fusion and robust perception algorithms.

Looking forward, the paper hints at future expansion areas, such as adding image-level and point-level semantic labels and establishing a benchmark for trajectory prediction. These enhancements will further solidify nuScenes' role as a crucial resource for AV research, driving progress toward safer and more reliable autonomous driving systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Holger Caesar (31 papers)
  2. Varun Bankiti (3 papers)
  3. Alex H. Lang (8 papers)
  4. Sourabh Vora (7 papers)
  5. Venice Erin Liong (3 papers)
  6. Qiang Xu (129 papers)
  7. Anush Krishnan (3 papers)
  8. Yu Pan (154 papers)
  9. Giancarlo Baldan (1 paper)
  10. Oscar Beijbom (15 papers)
Citations (4,894)
Github Logo Streamline Icon: https://streamlinehq.com