Occ3D-Waymo: 3D Occupancy Benchmark
- Occ3D-Waymo is a large-scale, high-fidelity benchmark providing voxel-level occupancy annotations and unified evaluation protocols for dynamic driving scenes.
- It employs both LiDAR-supervised and fully vision-only annotation pipelines, enabling dense, semantically-rich occupancy labels across urban and suburban environments.
- Benchmark results from models like CTF-Occ and BEVFormer-Fusion demonstrate practical insights into sensor fusion, zero-shot transfer, and scalable annotation strategies.
Occ3D-Waymo is a large-scale, 3D occupancy prediction benchmark for autonomous driving, based on the Waymo Open Dataset. It provides dense, high-fidelity visibility-aware occupancy annotations suitable for training and evaluating perception models over complex, real-world urban and suburban scenarios. The benchmark includes multiple label regimes—most notably, LiDAR-supervised and recent fully vision-only alternatives—supporting both semantic and binary occupancy tasks at scale. Given its size, sensor coverage, annotation protocol, and downstream model benchmarks, Occ3D-Waymo is a central resource for research on voxel-level 3D understanding in dynamic, real-world driving environments (Tian et al., 2023, Ye et al., 25 Jul 2025).
1. Dataset Origin and Scope
Occ3D-Waymo derives from the Waymo Open Dataset, which comprises 1,150 driving scenes collected in San Francisco, Mountain View, and Phoenix (76 km² effective coverage after pose dilation), with each scene containing around 1,000 frames of synchronized 5-camera and 5-LiDAR data at 10 Hz over 20 seconds (Sun et al., 2019, Tian et al., 2023). For occupancy prediction, Occ3D-Waymo utilizes splits of 798 training and 202 validation scenes in the original LiDAR-supervised version, with later variations such as GS-Occ3D filtering out ego-static scenes to yield 637 training and 165 validation scenes (Ye et al., 25 Jul 2025). The curated frames per benchmark variant total approximately 200,000.
A central distinction of Occ3D-Waymo is its voxel-level annotation covering diverse urban morphologies (downtown, highway, overpass, tunnel). It supports both fine-grained semantic occupancy (multi-class) and binary (occupied/free/unobserved) occupancy paradigms, enabling transparent benchmarking of both generic and tailor-made occupancy inference models.
2. Annotation Pipelines and Label Types
2.1 LiDAR-supervised Pipeline
The original Occ3D-Waymo annotation pipeline is semi-automatic:
- Voxel Densification: Static points (ground, structures) and dynamic object points (vehicles, pedestrians, etc.) are segmented, aggregated across time (in global and box coordinate frames, respectively), and densified through mesh-based hole filling (Tian et al., 2023).
- Label Propagation: Semantic labels—available at 2 Hz Waymo keyframes—are extended to non-key frames via -nearest neighbor voting in point cloud space, ensuring all dense points are labeled.
- Occlusion Reasoning: Visibility masks are computed via ray casting. Each voxel traversed by a sensor ray (LiDAR or camera) is labeled as "free" if passed before any return, "occupied" at the first hit, and "unobserved" otherwise.
- Image-guided Refinement: Voxels at the scene boundary are further refined using 2D semantic masks, aligning 3D label boundaries with image-space instance segmentation via ray casting.
2.2 Vision-only Label Generation (GS-Occ3D)
A recent branch uses a fully vision-only occupancy labeling protocol (Ye et al., 25 Jul 2025):
- Camera-based Scene Reconstruction: Structure-from-Motion and multi-view stereo pipelines (e.g., LoFTR + SfM) estimate camera intrinsics/extrinsics and a sparse 3D scene skeleton using only multi-camera RGB streams.
- Octree-based Gaussian Surfel Representation: The reconstructed point cloud is adaptively voxelized via an octree. Each voxel may host anisotropic 3D Gaussian "surfels" encoding color, opacity, position, and covariance. Scene depth spread determines octree depth (), and surfels are optimized under geometry and appearance losses.
- Occupancy Label Extraction: Following surfel-based geometry optimization, a voxel is marked as occupied if it contains any surfel. Camera rays are again cast through every occupied voxel centroid; the first intersection is "observed occupied," subsequent are "occluded," and untouched voxels are "unobserved."
- Scene Decomposition: The protocol supports decomposing data into static background, ground plane surfaces (via explicit initialization and planar smoothness losses), and dynamic objects (vehicles tracked in image space and reconstructed within dynamic boxes).
This approach enables crowd-scalable, densely labeled occupancy grids without reliance on active depth sensors or ego-motion priors, demonstrating superior cross-domain generalization (Ye et al., 25 Jul 2025).
3. Dataset Structure, Semantic Taxonomy, and File Formats
3.1 Voxel Gridding and Coverage
- Spatial Range: Waymo-centric coverage is , in LiDAR-based pipelines (Tian et al., 2023); typical vision-only grids cover , at resolution (Ye et al., 25 Jul 2025).
- Resolution: LiDAR-based labels use resolution (), producing 0 voxels/scene. Vision-only pipelines output 1 grids per frame.
- Label Values: Each voxel is classified as "free" (0), "occupied" (1), or "unobserved" (2); semantic pipelines additionally assign class IDs (15-class taxonomy).
3.2 Semantic Categories
- The semantic taxonomy adopted for multi-class occupancy includes 14 foreground classes (vehicle, bicyclist, pedestrian, sign, traffic_light, pole, construction_cone, bicycle, motorcycle, building, vegetation, tree_trunk, road, sidewalk) and a "General Object" (GO) catch-all (Tian et al., 2023). This facilitates evaluation of known and out-of-vocabulary object segmentation.
3.3 File Organization
File structure is scene-oriented, with each scene directory comprising per-camera/model metadata, occupancy grids, surfel descriptors, and per-frame dynamic object boxes. Formats include .npz for camera calibration, .bin for occupancy tensor, .pth for surfel parameters, and .json for pose and object annotations (Ye et al., 25 Jul 2025).
4. Evaluation Protocols and Metrics
Evaluation on Occ3D-Waymo is standardized around visibility-limited voxels—i.e., voxels "observed" by both LiDAR and camera, as determined by the visibility masks. The primary metrics are:
- (Mean) Intersection-over-Union (IoU): For multi-class, per-class 2, with mean IoU (mIoU) averaged across classes (3).
- F1 Score, Precision, Recall: In binary occupancy evaluation, 4.
- Voxel Accuracy (optional): Fraction of correctly labeled voxels among observed voxels.
Only voxels within the evaluation region-of-interest and marked "observed" by both sensor modalities are included in scoring, excluding unobserved or out-of-bounds voxels (Tian et al., 2023, Ye et al., 25 Jul 2025).
5. Baseline Models and Benchmark Results
Multiple detection and occupancy reasoning models have been benchmarked on Occ3D-Waymo:
- CTF-Occ (Coarse-to-Fine Occupancy Network): Uses a ResNet-101 backbone with token selection and 3D spatial attention, achieving highest image-only mIoU of 18.73% (Tian et al., 2023).
- BEVDet, TPVFormer, BEVFormer: Image-only BEV methods yield lower mIoUs (up to 16.76%).
- LiDAR-Only and Multi-Sensor Fusion: LiDAR-only baselines reach mIoU = 29.74%, with fusion (BEVFormer-Fusion) up to 39.05% (at 0.4m voxel).
- CVT-Occ: Tested with both LiDAR and vision-only labels on the binary regime, CVT-Occ achieves IoU=44.7 (vision-only labels, Occ3D-Waymo val) and IoU=57.4 (LiDAR-based, Occ3D-Waymo val) (Ye et al., 25 Jul 2025).
Zero-shot transfer to Occ3D-nuScenes demonstrates that vision-only supervision can match or exceed LiDAR-trained models in cross-dataset generalization (e.g., vision-only CVT-Occ: IoU=33.4 on nuScenes vs. LiDAR-trained: IoU=31.4).
A summary of the Occ3D-Waymo multi-class IoU results:
| Method | mIoU | Notable Metrics |
|---|---|---|
| CTF-Occ | 18.73 | Highest image-only baseline |
| BEVFormer | 16.76 | |
| LiDAR-Only | 29.74 | Upper bound for single modality |
| BEVFormer-Fusion | 39.05 | Multi-sensor |
6. Significance, Implications, and Research Directions
Occ3D-Waymo advances 3D driving scene understanding through (a) unified evaluation protocols, (b) large-scale, fine-grained occupancy labels, and (c) baseline results across sensor modalities and methods. The availability of both LiDAR-supervised and vision-only label regimes facilitates analysis of supervision cost, scalability, and transferability. The dataset demonstrates that vision-only occupancy labels can enable strong zero-shot transfer and scalable, crowdsourced annotation pipelines, a potential direction for next-generation autonomy datasets (Ye et al., 25 Jul 2025).
This suggests that scalable, camera-based 3D perception for autonomous vehicles may become feasible as vision-based label generation further matures. However, LiDAR- and fusion-based methods continue to define the upper bounds for in-domain occupancy estimation performance on Waymo scenes.
7. Data Access and Resources
All code, data, and benchmarks associated with Occ3D-Waymo are publicly available. The Occ3D benchmark suite (including label generation, evaluation code, and reference models) is hosted at https://tsinghua-mars-lab.github.io/Occ3D/ (Tian et al., 2023). GS-Occ3D vision-only occupancy labels, code, and model checkpoints are presented at https://gs-occ3d.github.io/ (Ye et al., 25 Jul 2025). The underlying Waymo Open Dataset is released at http://www.waymo.com/open (Sun et al., 2019).
Researchers should consult these resources for the latest updates on splits, annotation schemas, and official leaderboard results.