3D Occupancy Estimation: Scene Understanding

Updated 12 April 2026

3D occupancy estimation is the task of predicting dense spatial occupancy using voxel grids or continuous functions to represent physical matter in 3D scenes.
It enables comprehensive scene understanding for applications such as autonomous driving, robotics, and reconstruction by integrating data from multi-view cameras and LiDAR sensors.
Recent research tackles trade-offs between geometric detail, computational cost, and real-time deployment through adaptive representations, multimodal fusion, and self-supervised learning.

3D occupancy estimation is the task of predicting a dense spatial distribution, typically in the form of a voxel grid or continuous function, where each element encodes the likelihood that a 3D location in a scene is occupied by physical matter. In semantic variants, each occupied element is also assigned a class label. This problem is central to autonomous driving, robotics, reconstruction, and occlusion-aware perception, as it enables holistic scene understanding beyond sparse object detection or BEV (bird's-eye view) projections. Technical objectives in modern research include addressing the trade-off between geometric detail and computational cost, bridging camera-LiDAR modality gaps, achieving self- or weakly supervised training regimes, and supporting real-time deployment in safety-critical environments.

1. Problem Definition and Task Formulation

The 3D occupancy estimation task requires inferring a mapping from inputs (e.g., synchronized multi-view RGB images, possibly LiDAR) to a spatially dense 3D volume within some coordinate frame. The most common representation is a 3D voxel grid $O(x) \in [0,1]^{X \times Y \times Z}$ , where each grid element gives the probability of occupancy at a quantized spatial location. Semantic occupancy augments this with per-voxel class logits or distributions: $\mathbf{O}(x) = \big[p_1(x), ..., p_{C}(x)\big], \quad \sum_{c} p_{c}(x) = 1$ for $C$ categories, or a one-hot label.

Continuous alternatives model occupancy as a function $O: \mathbb{R}^3 \to [0,1]$ , either via neural fields parameterized by MLPs (Zhang et al., 2023) or by representing the scene as a sum over spatial kernels, as in Gaussian-based approaches (Doruk et al., 30 Jan 2026, Boeder et al., 24 Feb 2025, Shi et al., 11 Jun 2025).

Supervision in classical methods relied on dense 3D labels from LiDAR, which are laborious to obtain. Recent pipelines exploit sophisticated pseudo-labels, 2D volumetric rendering from foundation models (Boeder et al., 19 Nov 2025, Boeder et al., 2024), or self-supervised photometric consistency (Huang et al., 2023, Gan et al., 2024).

2. Methodological Taxonomy: Representations and Architectures

2.1 Voxel Grid Approaches

Voxel grids are discretized cubic partitions of 3D space, with per-voxel predictions from 3D CNNs, attention mechanisms, or point-based heads (Gan et al., 2023, Lu et al., 2023). Methods such as SurroundOcc (Wei et al., 2023) and Occ3D (Tian et al., 2023) employ multistage feature lifting—first extracting image features, lifting them into 3D via geometric projection, and fusing with multi-scale 3D convolutions. ProtoOcc (Kim et al., 2024) and DA-Occ (Zhou et al., 31 Jul 2025) efficiently combine BEV (2D) and 3D voxel branches, supplemented with mechanisms to preserve fine vertical or structural information lost in BEV-only schemes.

2.2 Adaptive and Sparse Representations

To address the cubic scaling with resolution, several approaches introduce adaptive spatial allocation:

Octrees: OctreeOcc (Lu et al., 2023) replaces dense grids with hierarchical octrees, adaptively subdividing only high-complexity regions. Octree structure is initialized with 2D semantic priors and refined via learned rectification, leading to compute and memory savings with minimal loss of detail on thin or small objects.
Gaussian Splatting Approaches: Techniques including GaussianOcc (Gan et al., 2024), GaussianOcc3D (Doruk et al., 30 Jan 2026), GaussianFlowOcc (Boeder et al., 24 Feb 2025), and ODG (Shi et al., 11 Jun 2025) represent 3D space as a set of learnable 3D Gaussian primitives (mean, scale, rotation, opacity). Such representations enable continuous, memory-efficient modeling, with anisotropic kernels capturing fine boundaries and sparse occupancy distributions focusing network capacity where geometry exists. Gaussian Splatting provides rapid “one-pass” rendering and enables fully or weakly self-supervised training through 2D consistency losses.

2.3 Hybrid and Multimodal Fusion

Modal fusion is an ongoing area of research. GaussianOcc3D (Doruk et al., 30 Jan 2026) advances multi-sensor occupancy via a continuous Gaussian field, combining LiDAR and camera features through modules for LiDAR Depth Feature Aggregation (LDFA), Entropy-Based Feature Smoothing (EBFS), and Adaptive Camera-LiDAR Fusion (ACLF). Other systems, such as OccFusion (Zhang et al., 2024), perform efficient cross-modal fusion by aligning LiDAR and multi-view image features directly at the voxel or point level without depth prediction pre-processing.

3. Supervision and Learning Paradigms

3.1 Full Supervision and Pseudo-labels

Benchmarks such as Occ3D (Tian et al., 2023) and SemanticKITTI precompute dense voxel ground-truth by fusing multi-frame LiDAR, semantic interpolation, TSDF surface completion, and explicit occlusion reasoning. Such pipelines are crucial to enable robust evaluation and are referenced in model-specific training regimes (Wei et al., 2023, Kim et al., 2024).

Pseudo-labeling from open-vocabulary 2D/3D foundation models (e.g., GroundedSAM, CLIP) enables scalable semantic occupancy supervision (Kim et al., 2024, Boeder et al., 19 Nov 2025, Boeder et al., 2024). ShelfOcc (Boeder et al., 19 Nov 2025) introduces a data-centric regime for 3D pseudo-label generation using vision-only depth/semantics, framewise consistency filtering, and careful handling of dynamic content, allowing generic 3D occupancy networks to be trained without LiDAR.

3.2 Self- and Weak Supervision

A major research thrust is toward label-free or label-scarce estimation. Models such as SelfOcc (Huang et al., 2023), OccNeRF (Zhang et al., 2023), GaussianOcc (Gan et al., 2024), LangOcc (Boeder et al., 2024), OccFlowNet (Boeder et al., 2024), and others employ self-supervised, photometric, or 2D-foundation-model derived consistency signals. Differentiable volumetric rendering (inspired by NeRFs) is used to render depth or class logits, training the models via losses computed directly against observed images or 2D semantic maps, across multi-view and multi-temporal settings.

Such schemes obviate the need for 6D camera pose supervision (as in GaussianOcc) and allow scale recovery and dynamic object modeling via temporal flow modules or occupancy flow fields (Boeder et al., 2024, Boeder et al., 24 Feb 2025).

4. Algorithmic Components: Modules and Losses

Across architectures, common algorithmic modules include:

Lifting and Fusion: Lifting 2D (multi-view) features to a 3D grid, often via geometric projection and bilinear interpolation (SimpleOccupancy (Gan et al., 2023), ProtoOcc (Kim et al., 2024), DA-Occ (Zhou et al., 31 Jul 2025)) or transformer-style deformable attention lifting (SurroundOcc (Wei et al., 2023), OctreeOcc (Lu et al., 2023), Occ3D (Tian et al., 2023)).
Directional and Multi-scale Attention: DA-Occ uses directional slicing and multi-axis attention to recover vertical information, while ProtoOcc dual-branch encoders exploit different receptive fields in BEV and voxel space.
3D Gaussian Splatting: Models such as GaussianOcc and related works replace traditional volume rendering along rays with Gaussian Splatting, projecting 3D Gaussians into image space to rapidly simulate depth, color, or class logits and compute their respective losses.
Occupancy Flow and Temporal Reasoning: Handling scene dynamics via learned or precomputed per-object/voxel temporal flow, ensuring moving objects are temporally aligned for rendering and supervision (Boeder et al., 24 Feb 2025, Boeder et al., 2024).
Active Decoder and Coarse-to-Fine: CTF-Occ (Tian et al., 2023) and related active approaches (Zhang et al., 2024) allocate computation adaptively: first performing coarse prediction in the full grid, then refining only uncertain voxels with intensive computation.

Loss functions combine per-voxel cross-entropy, Lovász-softmax (to approximate mIoU), rendering-based reprojection/photometric/semantic consistency, and task-specific regularization (e.g., Eikonal or Chamfer for smoothness and surface accuracy).

5. Efficiency, Resolution, and Structural Preservation

The choice of representation and fusion mechanism heavily influences computational and memory efficiency, fidelity to thin or fine structures, and suitability for real-time applications:

Voxel grids offer regular, GPU-optimized structures but scale poorly at high resolution ( $O(N^3)$ ). Methods mitigate this by limiting grid size (DA-Occ: $16\times32\times32$ coarse lifting) or by moving heavy computation into lightweight domains (ProtoOcc: large BEV kernels, single-step prototype decoding).
Sparse, adaptive, or hierarchical representations (octrees (Lu et al., 2023), continuous Gaussians (Doruk et al., 30 Jan 2026, Boeder et al., 24 Feb 2025), ODG dual queries (Shi et al., 11 Jun 2025)) provide memory and compute savings and adapt granularity spatially, preserving detail on small or thin objects.
AdaOcc (Chen et al., 2024) addresses the resolution–efficiency dilemma by allocating expensive, fine-grained point-based reconstruction only to regions of interest (object proposals), combining them with holistic coarse grid occupancy.
Directional attention and specialized fusion modules efficiently extract geometric cues while minimizing overhead (DA-Occ, DBE in ProtoOcc), supporting real-time deployment (DA-Occ: up to 39.6 FPS; ProtoOcc: 12.8 FPS).

6. Benchmarks, Results, and Research Frontiers

Leading benchmarks such as Occ3D-nuScenes and SemanticKITTI provide quantitative performance using geometric IoU and semantic mIoU. Top-performing models approach or surpass 45% mIoU with multi-frame input (ProtoOcc: 45.02% mIoU (Kim et al., 2024), GaussianOcc3D: 49.4% mIoU (Doruk et al., 30 Jan 2026)) or achieve rapid inference (DA-Occ: 27.7 FPS at 39.3% mIoU (Zhou et al., 31 Jul 2025)).

Ongoing challenges and frontiers include:

Achieving self-supervised or pseudo-labeled performance on par with dense LiDAR label methods (Boeder et al., 19 Nov 2025, Boeder et al., 2024).
Robustness to severe occlusion, adverse weather, and distributional shift (Doruk et al., 30 Jan 2026).
Handling dynamic scenes through explicit temporal modules or occupancy flow (Boeder et al., 24 Feb 2025, Boeder et al., 2024).
Open-vocabulary occupancy, where the prediction space is not restricted to a fixed class set but aligns with vision-LLMs (Boeder et al., 2024).
Generalization across datasets and efficient scaling to city-scale scenes, leveraging hierarchical, continuous, or prototype-based decoders.

7. Future Directions and Open Problems

Key future directions identified in state-of-the-art works include:

Unified, single-stage pipelines combining scale reconstruction, pose estimation, and occupancy learning (Gan et al., 2024).
Extensions to open-vocabulary, panoptic, and temporal occupancy tasks, leveraging language alignment and self-supervised scene flow.
Learnable, adaptive allocation of spatial granularity (splitting thresholds in octree models, Gaussian density/scale in continuous representations).
Stronger integration with foundation models for multi-modal input, cross-sensor consistency, and robust pseudo-labeling (Boeder et al., 19 Nov 2025, Boeder et al., 2024).
Optimizing for practical deployment, beyond core accuracy: reducing inference cost, memory overhead, and maintaining prediction quality for long-range or rare categories.

Overall, 3D occupancy estimation has emerged as a foundational task for unified geometric and semantic perception, with a rich landscape of representations, learning strategies, and efficient computational solutions now being explored and refined in the academic community.