Occ3D Benchmark Overview

Updated 31 March 2026

Occ3D Benchmark is a large-scale framework for 3D occupancy prediction that unifies grid definitions, annotation protocols, and evaluation metrics across urban datasets.
It integrates multi-view camera and LiDAR inputs to yield precise semantic volumetric scene understanding and forecast dynamic occupancy.
The benchmark employs standardized splits and metrics such as mIoU and RayIoU, driving innovation in hybrid representations and robust performance assessment.

The Occ3D benchmark is a large-scale, visibility-aware framework for 3D occupancy prediction in autonomous driving, designed to evaluate semantic volumetric scene understanding from multi-view imagery. It unifies ground-truth curation, grid definitions, annotation protocols, evaluation metrics, and experimental splits across urban datasets, enabling rigorous assessment of camera-based, LiDAR-based, and multimodal approaches. Occ3D, in its nuScenes (Occ3D-nuScenes) and Waymo (Occ3D-Waymo) instantiations, has become the definitive standard for the structured evaluation and comparison of semantic 3D occupancy methods, serving as the reference benchmark for recent research on fine-grained volumetric scene perception, hybrid representations, and dynamic forecasting (Tian et al., 2023).

1. Dataset Construction, Structure, and Modalities

Occ3D-nuScenes and Occ3D-Waymo benchmarks are derived from nuScenes and Waymo Open, respectively. Each consists of hundreds of autonomous driving scenes captured with tightly synchronized 360° camera rigs and high-performance automotive LiDAR.

nuScenes: 6 RGB cameras (60° each), 32-beam 20 Hz LiDAR, 1000 scenes of 20 s. Standard Occ3D split: 700 train / 150 validation / 150 test (Tian et al., 2023, Shi et al., 8 Jun 2025, Chen et al., 3 Jul 2025, Shi et al., 2024).
Waymo: 5 cameras (Front, FL, FR, SL, SR), 20 Hz LiDAR, 798 train / 202 validation / 150 test scenes (Shi et al., 11 Jun 2025, Shi et al., 8 Jun 2025, Ye et al., 25 Jul 2025).
Frames and grid: Each scene provides frames subsampled at 2 Hz, typically totaling 40 frames/scene. The canonical grid covers $[-40,40]\times[-40,40]\times[-1,5.4]$ m in $x,y,z$ , discretized into $200\times 200\times 16$ voxels (0.4 m cell size).

Modalities and Preprocessing

Six (or five) surround RGB images are the default input; raw images are downsampled for computation (e.g., 704×256).
For occupancy annotation, LiDAR points are aggregated, voxelized, and semantically labeled. “Occupied” voxels contain any LiDAR return in the aggregation window; “free” voxels are traversed by rays but not contain returns; “unobserved” voxels are neither hit nor observed from any view (Tian et al., 2023, Shi et al., 11 Jun 2025).

2. Semantic Labels, Annotation Pipeline, and Visibility

Semantic Classes

nuScenes: 17 classes (vehicles, pedestrians, road, sidewalk, terrain, vegetation, manmade, etc.); Waymo: 15 (vehicle, pedestrian, bicyclist, sign, road, vegetation, etc.) (Tian et al., 2023, Shi et al., 11 Jun 2025, Shi et al., 8 Jun 2025).
Occ3D harmonizes class definitions across datasets for consistent evaluation.

Label Generation

The Occ3D pipeline (Tian et al., 2023):
1. Voxel Densification: Static/dynamic points are aggregated over time, mesh reconstruction is performed to densify thin or sparsely observed structures.
2. Occlusion Reasoning: LiDAR and camera-based visibility masks are computed by ray-casting.
3. Image-Guided Voxel Refinement: Semantic 2D labels (from camera segmentation) are projected to refine 3D occupancy boundaries along viewing rays, disambiguating cases where LiDAR is sparse or noisy.

Visibility Constraints

Evaluation is limited to “visible” regions: voxels observed by any camera or LiDAR. Ray-casting ensures that only voxels intersected along view rays are eligible for scoring; unviewed space is explicitly ignored (Tian et al., 2023, Shi et al., 11 Jun 2025, Shi et al., 2024).

3. Evaluation Metrics and Protocols

Intersection-over-Union (IoU)

For each class $c$ ,

$\mathrm{IoU}_c = \frac{|\{v: \hat y_v = c\} \cap \{v: y_v = c\}|}{|\{v: \hat y_v = c\} \cup \{v: y_v = c\}|}$

Mean IoU (mIoU): $\frac{1}{C}\sum_{c=1}^{C} \mathrm{IoU}_c$ ; universally adopted for Occ3D (Tian et al., 2023, Shi et al., 11 Jun 2025, Shi et al., 8 Jun 2025, Shi et al., 2024, Chen et al., 3 Jul 2025).

Ray-based IoU (RayIoU)

Measures class prediction accuracy along camera viewing rays:

$\text{RayIoU} = \frac{\sum_{r} \mathbf{1}[\hat o_r = o_r]}{\sum_{r} \mathbf{1}[o_r \neq \text{free}]}$

Emphasizes depth-resolved predictions, providing a complementary perspective to volume-centric mIoU (Chen et al., 3 Jul 2025, Shi et al., 11 Jun 2025).

Forecasting Metrics

For forecasting scenarios, mIoU is reported for each prediction horizon ( $k=1,2,3$ s); arithmetic mean summarizes across time (Mohan et al., 8 Feb 2026).

Efficiency Metrics

Inference runtime (ms per input, FPS) and GPU memory usage are consistently reported for competitive analysis (Shi et al., 8 Jun 2025, Chen et al., 3 Jul 2025, Shi et al., 2024).

4. Benchmark Splits, Baseline Methods, and Results

Dataset	Train	Val	Test	Grid (m)	Voxels	Classes
Occ3D-nuScenes	700	150	150	80×80×6.4	200x200x16	17
Occ3D-Waymo	798	202	150	80×80×6.4	200x200x16	15

Baselines and Top Methods (Occ3D-nuScenes mIoU, val set)

MonoScene: 6.06, BEVDet: 19.38, TPVFormer: 27.83, BEVFormer: 26.88, BEVFormer* (with mask): 37.84, OPUS (1f): 23.92, OSP: 39.41, BePo: 32.77, FlashOcc: 29.79, ODG-L (8f): 38.18, FMOcc (2f): 39.8 (47.9 with visible mask), CTF-Occ: 28.53. FMOcc and OSP (point-based) currently achieve leading mIoU (Tian et al., 2023, Shi et al., 11 Jun 2025, Shi et al., 2024, Shi et al., 8 Jun 2025, Chen et al., 3 Jul 2025).

Forecasting Performance

ForecastOcc (3 horizons): 22.7, 19.3, 17.0 mIoU at 1 s, 2 s, 3 s (val set); outperforming Occ-in and shallow 2D baselines (Mohan et al., 8 Feb 2026).

Strongest camera-only approaches (FMOcc, ODG, OSP, BePo) exploit hybrid scene representations (grid, sparse points, Gaussian splats), advanced feature fusion (cross-attention; flow matching selective state-space models), and robustness mechanisms (mask training).

5. Influence on Occupancy Networks and Hybrid Representations

Occurrence of the Occ3D benchmarks has led to methodologically diverse models:

Volume-based: BEVFormer, TPVFormer, CTF-Occ, FMOcc, STCOcc. These operate on the dense 3D grid, leveraging BEV feature pooling, temporal fusion, or multi-view stereopsis.
Point/Query-based: OSP recasts the task as a point-set classification problem, enabling flexible region-of-interest sampling and boundary extrapolation (Shi et al., 2024).
Hybrid BEV + Sparse: BePo fuses the advantages of BEV for flat structures with sparse 3D queries for fine objects, via dual-branch cross-attention (Shi et al., 8 Jun 2025).
Gaussian-based: ODG and GS-Occ3D employ sparse/hierarchical Gaussian parameterizations, enabling efficient, render-supervised, or vision-only label curation with high geometric fidelity (Shi et al., 11 Jun 2025, Ye et al., 25 Jul 2025).

Each of these paradigms is quantitatively evaluated and ablated on Occ3D, enabling insights on representation bias, efficiency trade-offs, and generalization.

6. Extensions: Forecasting, Robustness, and Label Curation

Forecasting Benchmarking: Occ3D-nuScenes supports the first semantic occupancy forecasting protocols, spanning multiple time horizons and evaluating direct anticipation of future 3D semantic grids from sequences of images (Mohan et al., 8 Feb 2026).

Robustness Mechanisms: Protocols such as Mask Training (FMOcc) explicitly evaluate model robustness under heavy feature corruption, with strong preservation of mIoU (>35% at 50% feature loss) (Chen et al., 3 Jul 2025).

Vision-only Label Curation: GS-Occ3D demonstrates vision-only occupancy reconstruction by fitting an Octree-Gaussian Surfel field directly to multi-view images, providing competitive labels for downstream Occ3D evaluation—even matching or exceeding LiDAR-based rules in zero-shot transfer settings (Ye et al., 25 Jul 2025).

7. Current Limitations and Open Problems

Unobserved/occluded space: All train/eval metrics are visibility-masked; hallucination of fully unobserved volumes is not scored nor, in general, attempted (Tian et al., 2023, Shi et al., 2024).
Single modality bias: Model accuracy is limited by RGB information coverage; LiDAR, radar, and thermal fusion remain open research frontiers (Chen et al., 3 Jul 2025, Tian et al., 2023).
Dynamic occupancy: Current Occ3D benchmarks annotate only per-frame occupancy; explicit instance-level motion and dynamic occupancy grids are open research directions (Tian et al., 2023).
Open-vocabulary semantics: The grouping of “general object” or unknown classes highlights the need for finer or open-set label spaces (Tian et al., 2023).
Efficiency: Trade-offs among accuracy, compute, and memory are not fully resolved; point-based and hybrid designs offer scalable alternatives (Shi et al., 2024, Shi et al., 8 Jun 2025).

References

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving (Tian et al., 2023)
FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model (Chen et al., 3 Jul 2025)
ODG: Occupancy Prediction Using Dual Gaussians (Shi et al., 11 Jun 2025)
ForecastOcc: Vision-based Semantic Occupancy Forecasting (Mohan et al., 8 Feb 2026)
Occupancy as Set of Points (Shi et al., 2024)
BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction (Shi et al., 8 Jun 2025)
GS-Occ3D: Scaling Vision-only Occupancy Reconstruction for Autonomous Driving with Gaussian Splatting (Ye et al., 25 Jul 2025)