Papers
Topics
Authors
Recent
Search
2000 character limit reached

Occ3D Benchmark Overview

Updated 31 March 2026
  • Occ3D Benchmark is a large-scale framework for 3D occupancy prediction that unifies grid definitions, annotation protocols, and evaluation metrics across urban datasets.
  • It integrates multi-view camera and LiDAR inputs to yield precise semantic volumetric scene understanding and forecast dynamic occupancy.
  • The benchmark employs standardized splits and metrics such as mIoU and RayIoU, driving innovation in hybrid representations and robust performance assessment.

The Occ3D benchmark is a large-scale, visibility-aware framework for 3D occupancy prediction in autonomous driving, designed to evaluate semantic volumetric scene understanding from multi-view imagery. It unifies ground-truth curation, grid definitions, annotation protocols, evaluation metrics, and experimental splits across urban datasets, enabling rigorous assessment of camera-based, LiDAR-based, and multimodal approaches. Occ3D, in its nuScenes (Occ3D-nuScenes) and Waymo (Occ3D-Waymo) instantiations, has become the definitive standard for the structured evaluation and comparison of semantic 3D occupancy methods, serving as the reference benchmark for recent research on fine-grained volumetric scene perception, hybrid representations, and dynamic forecasting (Tian et al., 2023).

1. Dataset Construction, Structure, and Modalities

Occ3D-nuScenes and Occ3D-Waymo benchmarks are derived from nuScenes and Waymo Open, respectively. Each consists of hundreds of autonomous driving scenes captured with tightly synchronized 360° camera rigs and high-performance automotive LiDAR.

  • nuScenes: 6 RGB cameras (60° each), 32-beam 20 Hz LiDAR, 1000 scenes of 20 s. Standard Occ3D split: 700 train / 150 validation / 150 test (Tian et al., 2023, Shi et al., 8 Jun 2025, Chen et al., 3 Jul 2025, Shi et al., 2024).
  • Waymo: 5 cameras (Front, FL, FR, SL, SR), 20 Hz LiDAR, 798 train / 202 validation / 150 test scenes (Shi et al., 11 Jun 2025, Shi et al., 8 Jun 2025, Ye et al., 25 Jul 2025).
  • Frames and grid: Each scene provides frames subsampled at 2 Hz, typically totaling 40 frames/scene. The canonical grid covers [40,40]×[40,40]×[1,5.4][-40,40]\times[-40,40]\times[-1,5.4] m in x,y,zx,y,z, discretized into 200×200×16200\times 200\times 16 voxels (0.4 m cell size).

Modalities and Preprocessing

  • Six (or five) surround RGB images are the default input; raw images are downsampled for computation (e.g., 704×256).
  • For occupancy annotation, LiDAR points are aggregated, voxelized, and semantically labeled. “Occupied” voxels contain any LiDAR return in the aggregation window; “free” voxels are traversed by rays but not contain returns; “unobserved” voxels are neither hit nor observed from any view (Tian et al., 2023, Shi et al., 11 Jun 2025).

2. Semantic Labels, Annotation Pipeline, and Visibility

Semantic Classes

  • nuScenes: 17 classes (vehicles, pedestrians, road, sidewalk, terrain, vegetation, manmade, etc.); Waymo: 15 (vehicle, pedestrian, bicyclist, sign, road, vegetation, etc.) (Tian et al., 2023, Shi et al., 11 Jun 2025, Shi et al., 8 Jun 2025).
  • Occ3D harmonizes class definitions across datasets for consistent evaluation.

Label Generation

  • The Occ3D pipeline (Tian et al., 2023):
    1. Voxel Densification: Static/dynamic points are aggregated over time, mesh reconstruction is performed to densify thin or sparsely observed structures.
    2. Occlusion Reasoning: LiDAR and camera-based visibility masks are computed by ray-casting.
    3. Image-Guided Voxel Refinement: Semantic 2D labels (from camera segmentation) are projected to refine 3D occupancy boundaries along viewing rays, disambiguating cases where LiDAR is sparse or noisy.

Visibility Constraints

  • Evaluation is limited to “visible” regions: voxels observed by any camera or LiDAR. Ray-casting ensures that only voxels intersected along view rays are eligible for scoring; unviewed space is explicitly ignored (Tian et al., 2023, Shi et al., 11 Jun 2025, Shi et al., 2024).

3. Evaluation Metrics and Protocols

Intersection-over-Union (IoU)

  • For each class cc,

IoUc={v:y^v=c}{v:yv=c}{v:y^v=c}{v:yv=c}\mathrm{IoU}_c = \frac{|\{v: \hat y_v = c\} \cap \{v: y_v = c\}|}{|\{v: \hat y_v = c\} \cup \{v: y_v = c\}|}

Ray-based IoU (RayIoU)

  • Measures class prediction accuracy along camera viewing rays:

RayIoU=r1[o^r=or]r1[orfree]\text{RayIoU} = \frac{\sum_{r} \mathbf{1}[\hat o_r = o_r]}{\sum_{r} \mathbf{1}[o_r \neq \text{free}]}

Forecasting Metrics

  • For forecasting scenarios, mIoU is reported for each prediction horizon (k=1,2,3k=1,2,3 s); arithmetic mean summarizes across time (Mohan et al., 8 Feb 2026).

Efficiency Metrics

4. Benchmark Splits, Baseline Methods, and Results

Dataset Train Val Test Grid (m) Voxels Classes
Occ3D-nuScenes 700 150 150 80×80×6.4 200x200x16 17
Occ3D-Waymo 798 202 150 80×80×6.4 200x200x16 15

Baselines and Top Methods (Occ3D-nuScenes mIoU, val set)

Forecasting Performance

  • ForecastOcc (3 horizons): 22.7, 19.3, 17.0 mIoU at 1 s, 2 s, 3 s (val set); outperforming Occ-in and shallow 2D baselines (Mohan et al., 8 Feb 2026).

Strongest camera-only approaches (FMOcc, ODG, OSP, BePo) exploit hybrid scene representations (grid, sparse points, Gaussian splats), advanced feature fusion (cross-attention; flow matching selective state-space models), and robustness mechanisms (mask training).

5. Influence on Occupancy Networks and Hybrid Representations

Occurrence of the Occ3D benchmarks has led to methodologically diverse models:

  • Volume-based: BEVFormer, TPVFormer, CTF-Occ, FMOcc, STCOcc. These operate on the dense 3D grid, leveraging BEV feature pooling, temporal fusion, or multi-view stereopsis.
  • Point/Query-based: OSP recasts the task as a point-set classification problem, enabling flexible region-of-interest sampling and boundary extrapolation (Shi et al., 2024).
  • Hybrid BEV + Sparse: BePo fuses the advantages of BEV for flat structures with sparse 3D queries for fine objects, via dual-branch cross-attention (Shi et al., 8 Jun 2025).
  • Gaussian-based: ODG and GS-Occ3D employ sparse/hierarchical Gaussian parameterizations, enabling efficient, render-supervised, or vision-only label curation with high geometric fidelity (Shi et al., 11 Jun 2025, Ye et al., 25 Jul 2025).

Each of these paradigms is quantitatively evaluated and ablated on Occ3D, enabling insights on representation bias, efficiency trade-offs, and generalization.

6. Extensions: Forecasting, Robustness, and Label Curation

Forecasting Benchmarking: Occ3D-nuScenes supports the first semantic occupancy forecasting protocols, spanning multiple time horizons and evaluating direct anticipation of future 3D semantic grids from sequences of images (Mohan et al., 8 Feb 2026).

Robustness Mechanisms: Protocols such as Mask Training (FMOcc) explicitly evaluate model robustness under heavy feature corruption, with strong preservation of mIoU (>35% at 50% feature loss) (Chen et al., 3 Jul 2025).

Vision-only Label Curation: GS-Occ3D demonstrates vision-only occupancy reconstruction by fitting an Octree-Gaussian Surfel field directly to multi-view images, providing competitive labels for downstream Occ3D evaluation—even matching or exceeding LiDAR-based rules in zero-shot transfer settings (Ye et al., 25 Jul 2025).

7. Current Limitations and Open Problems

  • Unobserved/occluded space: All train/eval metrics are visibility-masked; hallucination of fully unobserved volumes is not scored nor, in general, attempted (Tian et al., 2023, Shi et al., 2024).
  • Single modality bias: Model accuracy is limited by RGB information coverage; LiDAR, radar, and thermal fusion remain open research frontiers (Chen et al., 3 Jul 2025, Tian et al., 2023).
  • Dynamic occupancy: Current Occ3D benchmarks annotate only per-frame occupancy; explicit instance-level motion and dynamic occupancy grids are open research directions (Tian et al., 2023).
  • Open-vocabulary semantics: The grouping of “general object” or unknown classes highlights the need for finer or open-set label spaces (Tian et al., 2023).
  • Efficiency: Trade-offs among accuracy, compute, and memory are not fully resolved; point-based and hybrid designs offer scalable alternatives (Shi et al., 2024, Shi et al., 8 Jun 2025).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Occ3D Benchmark.