nuImages: 2D Benchmark for Autonomous Driving

Updated 7 December 2025

nuImages is a large-scale 2D benchmark dataset for autonomous driving, featuring 93,000 keyframes with 13-frame clips from diverse urban settings.
It provides dense annotations including 2D bounding boxes, polygon instance masks, and pixel-wise semantic segmentation for 23 object classes.
The dataset’s temporal context and geographic diversity enable robust evaluation of object detection, instance segmentation, and OOD generalization.

nuImages is a large-scale, image-only benchmark dataset designed to advance research in 2D object detection, instance segmentation, and semantic segmentation for autonomous driving and related computer vision domains. As an extension of the nuScenes ecosystem, nuImages provides dense annotations, rich semantic diversity, and substantial geographic and environmental coverage, forming a foundation for robust perception model evaluation under realistic urban driving scenarios.

1. Dataset Composition, Annotations, and Acquisition

nuImages consists of 93,000 keyframes sampled from urban driving sequences, each keyframe accompanied by six preceding and six following frames at 2 Hz, resulting in short 13-frame video clips and approximately 1.2 million raw images. The data was recorded using six Basler acA1600-60gc global-shutter cameras mounted on two Renault Zoe vehicles operating in Boston and Singapore. Frame resolution is 1600×900 pixels (cropped from native 1600×1200). Camera field of view is 70° for frontal and side views and 110° for the rear camera (Fong et al., 2 Dec 2025).

Annotations are provided for the reference (middle) frame of each clip:

2D bounding boxes for all visible instances of 23 object classes (same taxonomy as nuScenes), including vehicles, pedestrians, bicycles, and more.
Polygon-instance masks precisely tracing each object silhouette.
Per-pixel panoptic segmentation masks (covering foreground and background classes).
No keypoints or 3D boxes are included; nuImages is strictly 2D-focused.

An explicit emphasis is placed on semantic diversity: collection routes span four map areas (Boston Seaport, Singapore One North, Queenstown, Holland Village), thus covering night/day, varying weather, left/right-hand driving, and rare object classes.

Annotation underwent a two-pass independent labeling procedure, with a third annotator resolving disagreements exceeding a 5 px IoU threshold. Automated consistency checks and targeted QA sweeps yielded inter-annotator agreement of 0.97 (box-IoU) and 0.93 (mask-IoU) on a held-out set, supporting benchmark reliability (Fong et al., 2 Dec 2025).

2. Dataset Splits, Organization, and Access

nuImages does not ship with a fixed train/validation/test split. Instead, frame selection is driven 75% by active-learning (targeting uncertainty, rare cases, adverse conditions) and 25% by uniform random sampling for representativeness. Users are expected to partition the 93,000 annotated frames using stratification by camera viewpoint, weather, and class frequency; the MMDetection3D splits (~71k/11k/11k) are commonly adopted, though not canonical (Fong et al., 2 Dec 2025).

File organization follows a COCO-compatible manifest:

images/CAM_<VIEW>/*.jpg for raw frames
annotations/nuimages_{train,val,test}.json for ground-truth (with "images", "annotations", "categories" fields)
Supplementary samples.csv and categories.csv for quick metadata lookup

Access and integration are facilitated by the nuScenes devkit Python API:

from nuscenes.nuimages import NuImages
nuim = NuImages(version='v1.0-train') # or other splits
for img in nuim.images:
    print(img['file_name'], img['timestamp'])
ann_ids = nuim.getAnnIds(img['token'])
anns = nuim.loadAnns(ann_ids)

This permits streamlined access for pipeline integration with common frameworks (Fong et al., 2 Dec 2025).

3. Supported Tasks and Evaluation Metrics

nuImages was architected for three principal 2D vision tasks:

Object Detection: AP or AP₅₀, and multi-threshold COCO-style AP (across IoU thresholds 0.50:0.05:0.95). Evaluation integrates precision-recall curves under standard conventions.
Instance Segmentation: Mask-AP at identical IoU thresholds; masks penalized for insufficient overlap with ground-truth polygons.
Semantic Segmentation: Mean intersection-over-union (mIoU), formally:

$\mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^C \frac{ |P̂_c \cap P_c| }{ |P̂_c \cup P_c| }$

where $P̂_c$ and $P_c$ denote predicted and ground-truth pixels for class $c$ .

No multi-object tracks or 3D tasks are supported natively in nuImages (Fong et al., 2 Dec 2025).

4. Benchmarking, Generalization, and Out-of-Distribution Studies

nuImages is utilized extensively both as an in-distribution and out-of-distribution (OOD) testbed:

OOD Generalization:

EvCenterNet (Nallapareddy et al., 2023) uses the nuImages validation set (3,249 images) as a zero-shot benchmark: models trained on KITTI without seeing nuImages data are evaluated on the "car" and "pedestrian" classes (due to differing label semantics for "cyclist"). Frames are resized to 896×512, and no dataset-specific tuning is performed.
Evaluation metrics include per-class AP at IoU=0.50, mean AP, Expected Calibration Error (ECE) for objectness, and Uncertainty Boundary Quality (UBQ) for bounding box variance. EvCenterNet achieves 46.5% (car) and 26.3% (pedestrian) AP, outperforming MC-Dropout, ensemble, and CertainNet baselines under domain shift. Their qualitative analysis demonstrates that predicted variance accurately correlates with object localization quality, even in OOD scenes.

Self-Supervised, Zero-Label, and Feature Learning:

Self-supervised representation learning via temporal ordering (TempO) leverages nuImages 13-frame window context for region-level proposal learning (Lang et al., 2023). Pretraining via a temporal ordering loss on native nuImages sequences yields up to 4.3 AP gain over COCO-initialized baselines (Sparse R-CNN: 34.6 vs. 30.3 AP), with greatest benefits for medium and large objects. This structure exploits nuImages’ video context, unique among driving benchmarks in density and annotation coverage.
Label-free scene understanding (Chen et al., 2023) applies a Cross-modality Noisy Supervision pipeline with CLIP and SAM to nuImages, enabling semantic segmentation training without access to any ground-truth labels. Panoptic segmentations are supervised with CLIP-derived pseudo-labels refined by SAM. The CNS approach produces 22.1% mIoU (+3.5 points over prior CLIP2Scene), with the improvement attributed to feature-space consistency regularization and noisy pseudo-label handling.

Efficient Out-of-Distribution Detection:

β-VAE-based detectors (Ramakrishna et al., 2021) utilize nuImages with its per-image semantic metadata (weather, traffic density, pedestrian count) as a real-world test case for sequential OOD change-point detection. Martingale-based conformal prediction enables both anomaly flagging and feature-level attribution (e.g., identifying which latent dimension corresponds to pedestrian count shifts), achieving 92% recall on OOD events with latency dictated by martingale window size.

Synthetic Data, Open-Vocabulary Testing, and Dataset Challenges:

Synthetic data pipelines (Mütze et al., 30 Jun 2025) apply inpainting with stable diffusion to front-camera nuImages, replacing native objects with out-of-context concepts (e.g., "walrus", "sofa") for diagnostic assessment of open-vocabulary detectors. This enables location-bias analyses, with findings that state-of-the-art models like Grounding DINO exhibit detection performance more sensitive to spatial placement than to semantic novelty.
GRAID (Elmaaroufi et al., 25 Oct 2025) systematically generates over 2.4M QA pairs on nuImages using geometric box relations, producing VQA datasets with 91% human-validated accuracy. nuImages annotations, with an average of 10–12 objects per image, facilitate spatial reasoning evaluation (e.g., left-of, right-of, size, appearance, ranking).

5. Distinctive Features, Impact, and Comparisons

nuImages advances the field beyond nuScenes and comparable benchmarks via several core innovations (Fong et al., 2 Dec 2025):

Active-learning selection: 75% of frames are "hard" cases (rare classes, occlusions, adverse weather), maximizing the dataset's value for robustness benchmarking.
Short video context: Each keyframe is part of a 13-frame window, supporting temporal modeling methods (object tracking, video segmentation).
Dense mask coverage: Foreground instance segmentation and per-pixel semantic maps enable panoptic segmentation studies unrealized in prior driving datasets.
High inter-annotator agreement: Extensive quality-control and multi-pass annotation ensure high reliability for supervised and self-supervised training alike.
Geographic and environmental variation: 500+ logs from two continents allow cross-domain generalization experiments.
Accessibility and integration: COCO-style formats and nuScenes-compatible SDKs ease direct adoption into standard vision pipelines.

A noteworthy implication is that while nuImages is strictly 2D, its temporal context and annotation density uniquely support both single-frame and video-based algorithmic advances, especially those requiring hard negatives and rare object inclusion.

6. Applications and Research Directions

nuImages supports a broad spectrum of research directions:

Robust object detection, segmentation, and panoptic tasks: The dataset's annotation richness underpins work in semi-supervised learning, OOD detection, uncertainty quantification, open-vocabulary detection, and geometric VQA (Nallapareddy et al., 2023, Elmaaroufi et al., 25 Oct 2025).
Spatial reasoning in vision-language modeling: GRAID's question/answer generation exploits nuImages for high-fidelity spatial VQA, a capability infeasible in sparser datasets (Elmaaroufi et al., 25 Oct 2025).
Synthetic data generation and failure mode elicitation: nuImages enables controlled synthesis of rare or OOD objects for challenging detector evaluation (Mütze et al., 30 Jun 2025).
Temporal and sequential learning: 13-frame clips across diverse settings support research in temporal feature learning, self-supervised clip ordering, and cross-frame context integration (Lang et al., 2023).

Empirically, models pre-trained or benchmarked on nuImages demonstrate superior generalization to urban domain shift, semantic edge cases, and location-dependent blind spots, driving both foundational and applied advancements in machine perception for automated driving.

Table: nuImages Dataset Properties and Supported Tasks

Property	Value / Description	Reference
Annotated keyframes	93,000	(Fong et al., 2 Dec 2025)
Total frames (with clips)	~1.2 million (13-frame context per key)	(Fong et al., 2 Dec 2025)
2D object classes	23 (matches nuScenes taxonomy)	(Fong et al., 2 Dec 2025)
Instance segmentation	Polygon masks for each visible object	(Fong et al., 2 Dec 2025)
Semantic segmentation	Full-scene pixel-wise annotation	(Fong et al., 2 Dec 2025)
Temporal context	6 past / 6 future at 2 Hz per keyframe	(Fong et al., 2 Dec 2025)
Sensor config	6 cameras, 1600×900 px, multi-view, day/night	(Fong et al., 2 Dec 2025)
Benchmark tasks	Detection, instance/semantic segmentation	(Fong et al., 2 Dec 2025)
OOD / zero-shot studies	Supported, see (Nallapareddy et al., 2023, Ramakrishna et al., 2021)
COCO-compatibility	Yes (JSON manifests, evaluation metrics)	(Fong et al., 2 Dec 2025)

nuImages thus occupies a central role in contemporary perception research, providing a testbed for diverse methodological advances and critical evaluation under both standard and adversarial conditions.