OpenOccupancy: 3D Semantic Benchmark & Modeling

Updated 23 June 2026

OpenOccupancy is a 3D semantic occupancy paradigm that predicts voxel occupancy and categories using densely annotated sensor data.
It employs multi-modal sensor fusion and novel evaluation protocols to enhance scene understanding in autonomous driving and embodied AI.
The framework integrates efficient grid annotation and decoding techniques, enabling real-time inference and robust cross-domain performance.

OpenOccupancy denotes both a family of dense 3D occupancy perception benchmarks and a broader modeling paradigm for semantic occupancy prediction in environments such as autonomous driving and embodied AI. The term typically refers to a task of predicting, for each voxel in a large-scale 3D scene, both whether it is occupied and—if so—its semantic category, using high-dimensional sensor data (camera images, LiDAR point clouds, or both). Crucially, OpenOccupancy datasets are distinguished by comprehensive 360° scene coverage, fine-grained and densely annotated ground-truth, and evaluation protocols that support both standard and open-vocabulary semantic queries. The OpenOccupancy research lineage combines large-scale benchmarks, algorithmic innovations in multi-modal and open-vocabulary occupancy estimation, and applications in indoor and outdoor 3D scene understanding.

1. Benchmark Construction and Datasets

OpenOccupancy benchmark datasets are built upon large-scale sensor suites, most notably nuScenes, which provides synchronized 360° camera imagery and multi-sweep LiDAR. The canonical OpenOccupancy benchmark (Wang et al., 2023) significantly augments the nuScenes dataset by introducing:

Dense semantic occupancy labels: Each voxel in a $512\times512\times40$ grid spanning $[-51.2, +51.2]^2 \times [-3, +5]$ m (voxel size = 0.2 m) is densely annotated as either empty or belonging to one of 17 semantic categories.
Augmenting And Purifying (AAP) pipeline: To overcome the sparsity inherent in direct LiDAR point voxelization, OpenOccupancy employs a hybrid pipeline. LiDAR points are superimposed and voxelized to form an initial grid, which is further densified using a multi-modal occupancy prediction model. Human annotators then verify and purify the grid, yielding up to 2× annotation density versus prior work.
Multi-modality: Input modalities include single or multi-sweep LiDAR, monocular or surround-vision cameras, and optional depth sensors.

Other benchmark extensions include OpenOccupancy-nuScenes (Xu et al., 2024), which supports domain-adaptive and multi-LiDAR occupancy prediction, and UniOcc (Wang et al., 31 Mar 2025), a unified benchmark for occupancy forecasting and cooperative multi-agent scenarios integrating data from nuScenes, Waymo, CARLA, and OpenCOOD.

2. Problem Formulation and Evaluation Protocols

Semantic occupancy perception in OpenOccupancy is cast as the dense prediction of per-voxel occupancy and semantics over a full 3D grid:

For each voxel $(x,y,z)$ , predict $O(x,y,z)\in\{0,\ldots,C_{\rm sem}\}$ , with 0 indicating empty and $C_{\rm sem}$ (e.g., 17 or 19) semantic classes.
Open-vocabulary occupancy extends this to a function $O:\mathbb{R}^3\times C\to [0,1]$ , allowing querying with arbitrary class names or language prompts (Tan et al., 2023, Yu et al., 2024).

Evaluation metrics for OpenOccupancy are standardized:

Class-agnostic (geometry) IoU: measures occupied vs. free space,

$\mathrm{IoU} = \frac{\lvert \mathrm{Pred}_{\mathrm{occ}} \cap \mathrm{GT}_{\mathrm{occ}} \rvert}{\lvert \mathrm{Pred}_{\mathrm{occ}} \cup \mathrm{GT}_{\mathrm{occ}} \rvert}$

Semantic mean IoU (mIoU): averages over all $C$ semantic classes (excluding “empty”),

$\mathrm{IoU}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c}$

$\mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^C \mathrm{IoU}_c$

These metrics are computed on held-out validation and test splits. UniOcc additionally adopts ground-truth-free plausibility metrics, such as temporal foreground/background consistency and object dimension plausibility (Wang et al., 31 Mar 2025).

3. Algorithmic Paradigms and Model Architectures

A wide variety of modeling approaches have emerged for OpenOccupancy, spanning sensor modalities and supervision regimes:

Camera-based architectures: Early vision-centric methods utilize 2D CNN backbones (e.g., ResNet+FPN (Hong et al., 2024, Li et al., 2024)), lifting features into 3D using explicit geometric projection (lift-splat) or learned attention-based 2D→3D transforms (Zhuang et al., 2024). Temporal aggregation is addressed by hierarchical context learning and deformable attention mechanisms, which align and refine multi-frame features for robust scene completion (Li et al., 2024).
LiDAR-centric and cylindrical frameworks: PointOcc (Zuo et al., 2023) and PVP (Xue et al., 2024) exploit cylindrical or polar transforms to better match LiDAR spatial densities, with tri-perspective view aggregation, global representation propagation, and convolution modules tailored to mitigate polar distortion.
Multi-modal fusion: MS-Occ (Wei et al., 22 Apr 2025), REO (Zhuang et al., 2024), and PVP (Xue et al., 2024) incorporate both image and LiDAR modalities via hierarchical fusion (depth-aware feature enhancement, cross-modal deformable attention, adaptive voxel fusion), optimizing the joint prediction of geometry and semantics, especially for small, safety-critical objects.
Efficient and scalable decoders: Cascade Occupancy Networks (CONet) (Wang et al., 2023) and query-based schemes (Zhuang et al., 2024) prioritize computational tractability by focusing refinement on predicted occupied voxels and employing coarse-to-fine or query-driven decoding, thereby enabling real-time inference on high-resolution scenes.
Open-vocabulary occupancy: Recent frameworks such as OVO (Tan et al., 2023) and LOcc (Yu et al., 2024) use open-vocabulary vision-LLMs (e.g., CLIP), distilling pixel- or voxel-level features to match arbitrary text queries, with supervision drawn from dense 2D open-vocabulary segmentation or LVLM-generated labels. FreeOcc (Jiang et al., 30 Apr 2026) pioneers a training-free, SLAM-anchored, language-embedded pipeline that provides globally consistent open-vocabulary maps without 3D annotation.
Statistical occupancy for buildings: OpenOccupancy also denotes statistical sensor-based occupancy detection in commercial buildings, where CO₂ and VOC time series, together with environmental variables, are modeled using SVM, KNN, and RF classifiers for binary presence inference (Varnosfaderani et al., 2022).

4. Representative Performance and Comparative Results

Performance on OpenOccupancy is typically reported as IoU / mIoU on the validation set. The following table synthesizes representative results for different modalities (higher is better):

Method / Modality	Geometry IoU (%)	Semantic mIoU (%)	Reference
HTCL-M (RGB, temporal)	21.4	14.1	(Li et al., 2024)
UniVision (Camera-only)	—	14.3	(Hong et al., 2024)
JS3C-Net (LiDAR)	30.2	12.5	(Li et al., 2024)
PointOcc (LiDAR)	34.1	23.9	(Zuo et al., 2023)
PVP (Polar, LiDAR)	37.0	25.8	(Xue et al., 2024)
M-CONet (Multi-modal)	29.5	20.1	(Zuo et al., 2023)
MS-Occ (Multi-modal)	32.1	25.3	(Wei et al., 22 Apr 2025)
SliceSemOcc (Multi-modal)	—	22.9	(Huang et al., 4 Sep 2025)
MergeOcc (LiDAR/domain)	38.4	21.4	(Xu et al., 2024)

Camera-only models are consistently outperformed by LiDAR and multi-modal models in geometry IoU, but advanced temporal modeling and calibration-free attention close much of the semantic gap (Li et al., 2024, Hong et al., 2024). Domain-adaptive and unified LiDAR networks (MergeOcc) outperform standard multi-modal fusion by leveraging diverse cross-dataset distributions (Xu et al., 2024). Multi-stage and polar representations further boost small-object and long-range prediction.

5. Open-Vocabulary and Zero-Shot Occupancy Prediction

Open-vocabulary occupancy (“OVO”) methods extend semantic prediction to arbitrary class queries and address label-transfer in new domains:

Distillation-based OVO: OVO (Tan et al., 2023) transfers supervision from pre-trained open-vocabulary 2D segmenters into a 3D voxel space via pixel-pixel, voxel-pixel, and voxel-text alignment losses on filtered voxel-pixel pairs, yielding substantial zero-shot accuracy on novel classes.
Semantic-transitive pseudo-labeling: LOcc (Yu et al., 2024) uses vision-LLMs to enumerate scene objects, then transfers their text labels from images to projected LiDAR points and voxels, enabling camera-only or multi-modal open-vocabulary occupancy with explicit geometry and language heads.
Training-free SLAM-based OVO: FreeOcc (Jiang et al., 30 Apr 2026) avoids 3D annotation and learning entirely by incrementally fusing SLAM-based 3D Gaussians with language features from vision-language segmentation, projecting to voxels via probabilistic exclusion. FreeOcc achieves >2× accuracy over prior self-supervised methods in indoor settings and transfers robustly to novel environments.

The OVO research trend demonstrates that grounding occupancy grids in language-conditional supervision—either via distillation, transitive mapping, or direct language-embedded volumetric fusion—enables semantic querying far beyond the fixed taxonomies of traditional benchmarks and supports robust zero-shot performance.

6. Applications and Extensions

OpenOccupancy advances core tasks in autonomous driving, embodied AI, and indoor scene understanding:

Surround semantic scene completion: Dense, 360° SSC from multi-sensor input, essential for autonomous navigation, long-horizon planning, and small-object recognition.
Unified perception: Models such as OccNet (Sima et al., 2023) and UniVision (Hong et al., 2024) demonstrate that occupancy prediction can be integrated with 3D object detection and planning, improving collision rates and perception robustness.
Occupancy forecasting: UniOcc (Wang et al., 31 Mar 2025) formalizes the spatiotemporal forecasting problem, leveraging large-scale, flow-annotated and simulator-augmented data for multi-step future prediction, with both voxel-wise and object-level plausibility metrics.
Cross-domain robustness: MergeOcc (Xu et al., 2024) empirically confirms that geometric realignment and joint label mapping enable single models to generalize across heterogeneous LiDAR hardware and distinct geographic domains.
Real-time deployment: Coarse-to-fine query-based decoding and efficient fusion pipelines in REO (Zhuang et al., 2024) and Cascade/Polar architectures facilitate real-time high-resolution inference usable in embedded systems.

OpenOccupancy benchmarks, models, and open-vocabulary extensions thus provide a unified foundation for semantic 3D scene representation and have set new standards for large-scale, fine-grained, and language-driven occupancy perception.

References:

(Wang et al., 2023) OpenOccupancy benchmark construction and baseline algorithms.
(Li et al., 2024, Hong et al., 2024) Camera-based and multi-task architectures.
(Zuo et al., 2023, Xue et al., 2024) LiDAR-centric and polar coordinate models.
(Wei et al., 22 Apr 2025, Huang et al., 4 Sep 2025) Multi-stage and vertical-slice fusion.
(Tan et al., 2023, Yu et al., 2024, Jiang et al., 30 Apr 2026) Open-vocabulary, zero-shot frameworks.
(Xu et al., 2024) Domain-adaptive, cross-dataset occupancy learning.
(Wang et al., 31 Mar 2025) Unified occupancy forecasting with multi-modal and multi-agent data.
(Varnosfaderani et al., 2022) Sensor-based statistical occupancy in buildings.