Occ3D-nuScenes: 3D Occupancy Benchmark

Updated 27 December 2025

Occ3D-nuScenes is a benchmark dataset that provides voxel-level, multimodal annotations and occlusion-aware 3D occupancy labels for autonomous driving.
It employs an automated label generation pipeline integrating LiDAR densification, occlusion reasoning, and image-guided voxel refinement to ensure high-quality, fine-grained annotations.
The dataset offers comprehensive metrics and baseline results, guiding the development of advanced scene-centric occupancy prediction models in urban environments.

Occ3D-nuScenes is a large-scale benchmark dataset for fine-grained, visibility-aware 3D occupancy prediction in autonomous driving, derived from the nuScenes multimodal urban driving corpus. It enables research on scene-centric perception by providing dense, auto-generated labels for both geometric occupancy and semantic content, addressing the need for granular, voxel-level 3D supervision. Occ3D-nuScenes extends the original nuScenes framework by introducing advanced label-generation pipelines that combine LiDAR, camera, and annotation information, resolving occlusion ambiguities and supporting supervised learning for voxel-based occupancy models (Tian et al., 2023).

1. Dataset Structure and Coverage

Occ3D-nuScenes consists of 900 scenes (600 train, 150 val, 150 test) with a total of 40,000 keyframes—each sampled at 2 Hz, corresponding with the nuScenes keyframe annotation rate. Each keyframe includes:

Surround-view imagery from 6 cameras (per-camera typical resolution: 928×1600 pixels).
Single-sweep (or aggregated) 360° LiDAR (32–64 beams, nuScenes standard).
Dense 3D occupancy grid: 200×200×16 voxels over $[x, y, z]=[-40,40]\times[-40,40]\times[-1,5.4]$ meters, voxel size 0.4 m.
Per-voxel semantic class label, occupancy state, LiDAR and camera visibility masks.
Sensor intrinsics/extrinsics, ego pose information, and scene/frame indices.

The dataset provides ground-truth voxel grids that densely annotate both static and dynamic parts of the scene, leveraging both multi-sweep fusion and annotation-driven label propagation (Tian et al., 2023).

2. Automated Label Generation Pipeline

The Occ3D-nuScenes annotation pipeline proceeds in three algorithmic stages to address LiDAR sparsity, occlusion, and multiview consistency:

Voxel Densification:
- Raw LiDAR is partitioned into static and dynamic points (semantic boxes define objects).
- Multi-sweep aggregation is applied: for statics in global coords, for each dynamic object via local box-aligned fusion.
- KNN label propagation and VDBFusion-based mesh reconstruction fill spatial holes.
- Mesh sampling increases density; semantics are re-propagated by KNN (see pseudocode in (Tian et al., 2023)).
Occlusion Reasoning:
- For each keyframe, LiDAR-based ray casting identifies occluded, free, and observed voxels. For each LiDAR point, the corresponding ray $o \to p$ traverses voxels in sequence, incrementing occupancy/free counters.
- Camera-based visibility is computed by casting rays from camera centers to voxel centers, only retaining as "visible" those voxels observed in both LiDAR and at least one camera view.
- Only voxels observed by both modalities participate in subsequent training and validation.
Image-Guided Voxel Refinement:
- 2D semantic segmentation (per-pixel $L_x$ ) on camera images is projected into the scene: along the pixel's camera ray, assign semantic labels to the first voxel matching $L_x$ ; all preceding voxels along the ray are forced to "free."
- This step corrects over-propagated labels at 3D–2D boundaries, mitigating LiDAR shadow artifacts and providing sharper object contours (as formalized in (Tian et al., 2023), eq. above).

A dense and visibility-aware 3D occupancy annotation is thus achieved for all keyframes.

3. Semantic Taxonomy, Masking, and Annotation Format

The semantic label set comprises 16 in-vocabulary classes (barrier, bicycle, bus, car, construction vehicle, motorcycle, pedestrian, traffic cone, trailer, truck, drivable surface, other flat, sidewalk, terrain, manmade, vegetation) and one "General Objects (GO)" class for voxels associated with out-of-taxonomy objects. Each per-sample file contains:

$200 \times 200 \times 16$ float32 occupancy and class arrays.
3D binary masks for LiDAR and camera visibility (per-voxel, 1 = observed, 0 = unobserved).
All per-sample metadata (extrinsics, intrinsics, pose).

Train/eval protocols subsample only voxels where both modality masks are 1. All occluded/out-of-FOV voxels are set to state "unobserved" and ignored during training and metrics. No human manual labeling is employed; all labels derive from the algorithmic pipeline (Tian et al., 2023).

4. Benchmark Task Definition and Metrics

The canonical Occ3D-nuScenes task is semantic 3D occupancy prediction: given T past frames (surround images, LiDAR, calibrations), a model must infer, for every voxel $v$ , the triplet

Occupancy $s(v)\in\{free,occupied,unobserved\}$
Class label $\ell(v)\in\{0,\dots,C-1,GO\}$

Evaluation is restricted to observed voxels. The principal metrics are:

Per-class Intersection-over-Union:

$\mathrm{IoU}_k = \frac{|\mathrm{TP}_k|}{|\mathrm{TP}_k| + |\mathrm{FP}_k| + |\mathrm{FN}_k|}$

Mean IoU (mIoU):

$\mathrm{mIoU} = \frac{1}{C} \sum_{k=1}^C \mathrm{IoU}_k$

Precision and recall are computed per class; F1 can be derived.

Negative samples are only computed over visible voxels; unobserved voxels do not affect loss or score. Official splits: 24,000 frames (train), 6,000 (val), 10,000 (test) (Tian et al., 2023).

5. Baseline Results and Model Architectures

Baseline results as presented in (Tian et al., 2023):

Method	mIoU	Car IoU	Pedestrian IoU
MonoScene	6.06	9.38	3.01
BEVDet	19.38	34.47	10.36
BEVFormer	26.88	42.43	21.81
TPVFormer	27.83	45.90	18.85
OccFormer	21.93	39.17	17.22
CTF-Occ	28.53	42.24	22.72

CTF-Occ (Coarse-to-Fine Occupancy) introduces a ResNet-101 backbone with a multi-stage voxel encoder; deformable cross-attention fuses image features onto 3D voxel tokens, and a coarse-to-fine pyramid incrementally refines predictions. Only top-k uncertain voxels are propagated for computational efficiency.

Losses include OHEM cross-entropy on semantic classes (over occupied voxels only) and a binary occupancy head optimizing CE at all pyramid stages. The state-of-the-art is 28.53 mIoU on the camera-only benchmark (Tian et al., 2023). For multi-modal fusion settings (outside camera-only evaluation), OccFusion further reports robust results on the nuScenes-Occ3D split (Zhang et al., 8 Mar 2024).

6. Data Accessibility, Tools, and Integration

The dataset, including generated annotation files, code, and tutorials, is hosted at https://tsinghua-mars-lab.github.io/Occ3D/ (Tian et al., 2023). Key integration aspects:

Per-sample format: dense arrays, masks, and metadata in standard binary/numpy container.
All camera and sensor calibrations are expressed in ego-vehicle frame for direct transform chaining, closely matching protocols used in the nuScenes devkit (Caesar et al., 2019).
The upstream nuScenes toolkit provides functions for I/O, transformation, and visualization; Occ3D extensions build on these for voxel visualization and metric reporting.

7. Relationship to nuScenes and Broader Benchmarks

Occ3D-nuScenes is constructed atop the nuScenes multimodal driving dataset (Caesar et al., 2019), inheriting precise sensor extrinsics, timestamp synchronization, and detailed city-scale urban coverage. It fundamentally differs from traditional bounding box-centric datasets by providing direct voxel-level ground-truth. Related nuScenes-derived occupancy datasets, such as nuScenes-Occ3D (a term used interchangeably in recent literature (Zhang et al., 8 Mar 2024)), may use similar splits but differ in annotation protocol and grid resolution.

The Occ3D labeling pipeline draws on advances in voxel densification and visibility-aware reasoning and complements object-centric 3D benchmarks by enabling fine-grained, compositional scene understanding. It defines a rigorous protocol for observed-vs-unobserved region scoping—a critical factor for honest evaluation under occlusion and limited FOV settings.

A plausible implication is that the Occ3D-nuScenes standard is likely to become a reference scene-centric occupancy benchmark for the field, supporting both monocular, multi-view, and multi-sensor 3D perception research in autonomous vehicles (Tian et al., 2023, Zhang et al., 8 Mar 2024).