Vision-Based Occupancy Estimation
- Vision-based occupancy is the use of visual data to create semantic and geometric 3D maps that indicate where objects or people are present.
- The pipeline involves extracting features from multi-view images, projecting them into voxel grids, and refining with spatial-temporal aggregation to mitigate occlusion and depth ambiguity.
- Applications span autonomous driving, robotics, and smart buildings, with techniques yielding up to 34% mIoU improvement and significant computational efficiency gains.
Vision-based occupancy refers to the use of visual inputs—typically camera images, and in some cases, thermal or omnidirectional vision—to infer spatial occupancy information, including where objects or people are present, what kind of objects occupy which regions, and sometimes their geometric extent or semantic class. This field spans both dense 3D scene understanding for autonomous navigation and fine-grained occupancy detection in smart environments. Approaches range from per-voxel semantic scene completion in large autonomous driving datasets to lightweight, privacy-preserving occupancy sensing in buildings.
1. Problem Foundations and Representational Principles
Vision-based occupancy estimation aims at reconstructing a geometric and, often, semantic description of which regions of a scene are occupied or free, based on purely visual data (e.g., RGB, thermal, panoramic). Depending on application domain, the output may be:
- A binary or probabilistic 3D voxel grid over a physical extent, where each voxel has an occupancy or probability and often a semantic label distribution (Sima et al., 2023, Zhang et al., 8 Dec 2024, Cui et al., 27 Jul 2025).
- Coarser 2D/Bird’s-Eye-View (BEV) occupancy, possibly with per-cell heights, trading off memory for efficiency (Xu et al., 21 Nov 2024).
- Per-slot or per-region “occupied vs. free” status for infrastructure sensing (e.g., office HVAC or parking control) (Cui et al., 13 May 2025, Grbić et al., 2023, Callemein et al., 2020).
Vision is an attractive sensing modality due to its ubiquity, cost-effectiveness, and rich integrative cues. However, unlike range sensors (LiDAR), vision-based methods must address depth ambiguity, occlusion, temporal inconsistency, and, often, more challenging training signal acquisition (Boeder et al., 19 Nov 2025, Zhang et al., 2023).
2. Vision-Based Occupancy Pipelines: From Input to Output
Most state-of-the-art vision-based occupancy systems share a common architectural flow, albeit with domain-specific adaptations:
- Feature Extraction: Images from multiple views or frames are processed by a deep 2D backbone, usually with a Feature Pyramid (e.g., ResNet-FPN) (Wei et al., 2023, Boeder et al., 19 Nov 2025, Ma et al., 2023).
- Lifting or Unprojection to 3D: Pixel-wise features are projected into a structured 3D voxel grid using geometric calibration (intrinsics/extrinsics), or, in BEV approaches, collapsed in via per-pixel depth (Wei et al., 2023, Zhang et al., 8 Dec 2024).
- Latent Representation: The lifted features can be stored densely (as tensors), sparsely (as a COO-indexed set of non-empty voxels), or as a differentiable collection of learnable Gaussian primitives (Tang et al., 15 Apr 2024, Ye et al., 25 Jul 2025, Yan et al., 20 Sep 2025).
- Aggregation and Refinement: 3D convolutions, transformer blocks, or spatial-temporal attention refine the latent volume, with explicit mechanisms for aggregating over time or spatial context (Boeder et al., 19 Nov 2025, Xu et al., 21 Nov 2024, Yan et al., 20 Sep 2025, Shi et al., 5 Nov 2025).
- Occupancy and Semantic Decoding: Final per-voxel (or per-region) predictions are decoded—either via standard MLP/softmax heads or specialized mask classification units (Ma et al., 2023, Sima et al., 2023).
- Supervision: Labels can be derived from LiDAR point clouds, synthetic or video-based pseudo-labeling, or even weak signals such as photometric consistency or 2D segmentations (Boeder et al., 19 Nov 2025, Zhang et al., 2023, Ye et al., 25 Jul 2025).
A representative pseudocode fragment for voxel label assignment in native vision-based 3D supervision (ShelfOcc) is:
1 2 3 4 5 6 7 8 9 10 11 |
for frame in frames: for camera in cameras: for pixel in image: if pixel is static: point = unproject(pixel) voxel = find_voxel(point) increment(voxel_counter[voxel]) add_to_static_points(voxel, point) for voxel in voxels: if valid(static points in voxel): assign occupancy/semantic labels |
3. Supervision Paradigms: Native 3D, 2D Render-and-Compare, and Self-supervision
Three supervisory regimes dominate vision-based occupancy, each with distinct advantages and trade-offs:
- LiDAR-Guided Native 3D Supervision: Traditionally, high-quality, dense occupancy ground truth is generated by aggregating and voxelizing LiDAR sweeps, sometimes with Poisson surface reconstruction or majority-vote semantic assignments (Wei et al., 2023, Sima et al., 2023). However, collecting and annotating LiDAR labels is expensive and not scalable to all vehicles or environments.
- Vision-Only Native 3D Supervision: Recent works (e.g., ShelfOcc) generate metrically consistent 3D pseudo-labels directly from video using a cascade of 2D semantic segmentation, monocular depth estimation, and temporal/static-dynamic filtering. This supports explicit, per-voxel 3D supervision without LiDAR (Boeder et al., 19 Nov 2025).
- 2D Render-and-Compare Losses: Weakly-supervised approaches (OccNeRF, GaussianFlowOcc) render the predicted 3D occupancy structure into 2D images along camera rays, enforcing photometric or semantic consistency; however, these suffer from depth bleeding, partial visibility issues, and ambiguity regarding the volumetric extent (Boeder et al., 19 Nov 2025, Zhang et al., 2023).
- Self-supervision and Foundation Models: To further reduce reliance on human labels, methods now leverage geometry foundation models (e.g., MapAnything FM) and open-vocabulary segmenters (e.g., Grounding DINO + SAM), deploying large-scale prompt-cleaned class masks as pseudo ground truth (Boeder et al., 19 Nov 2025, Zhang et al., 2023).
The following table summarizes supervision strategies:
| Approach Type | Label Source | Limitations |
|---|---|---|
| LiDAR-guided | Dense LiDAR sweeps | Costly, not universal, annotation effort |
| Vision-only 3D | Video + 2D semantics | Pseudo-label noise, dynamic object recall |
| 2D render-loss | Reprojected image | Depth bleeding, ambiguity |
4. Architectural Innovations and Efficiency Considerations
Occupancy estimation is computationally constrained due to cubic scaling of voxel grids. Strategies to mitigate this include:
- Sparse Latent Representations: SparseOcc processes only non-empty voxels, using sparse convolutions and transformer heads adapted from Mask2Former to deliver significant FLOP and memory reductions (up to 74.9% for nuScenes-Occupancy) with improved mIoU (Tang et al., 15 Apr 2024).
- Spatial and Temporal Decoupling: EfficientOCF compresses 3D occupancy into lightweight BEV+height representations and decouples temporal reasoning via flow-based mask propagation for efficiency and sharper dynamic object modeling (Xu et al., 21 Nov 2024).
- Height-Aware Pooling: Methods such as Deep Height Decoupling (DHD) enforce explicit per-class height priors, partitioning 3D space along and integrating features only within matching height bins (Wu et al., 12 Sep 2024). This filters spurious frustum evidence and sharpens object boundaries.
- Compact and Multi-Perspective 3D Representations: LightOcc uses spatial-to-channel transpositions and Tri-Perspective View (TPV) embeddings, followed by lightweight 2D convolutions and matrix multiplications, to approximate full volumetric reasoning without full 3D CNNs (Zhang et al., 8 Dec 2024).
- Gaussian Splatting and Surfel-based Grids: Some approaches (GS-Occ3D, ST-GS) model occupancy with learnable sets of 3D Gaussians (“surfels”) for flexible, scalable geometry, supporting both spatial (attention-based) and temporal (geometry-aware) fusion mechanisms (Ye et al., 25 Jul 2025, Yan et al., 20 Sep 2025).
5. Applications: Autonomous Driving, Robotics, Smart Buildings, and Parking
Vision-based occupancy has wide-ranging applications:
- Autonomous Driving: The primary research frontier, where dense 3D occupancy grids form the substrate for detection, segmentation, planning, and prediction, supporting complex traffic environments and rare object identification (Wei et al., 2023, Sima et al., 2023, Ma et al., 2023). Occupancy maps are shown to reduce downstream planner collisions by over 15% vs. 3D box-only pipelines (Sima et al., 2023).
- Robotics and Embodied Agents: Legged and humanoid robots leverage panoramic or multimodal occupancy perception for navigation, manipulation, and interaction within complex indoor and outdoor environments (Cui et al., 27 Jul 2025, Shi et al., 5 Nov 2025).
- Building Automation and Privacy-Preserving Sensing: Lightweight camera (RGB/thermal) streams and deep object detectors (YOLOv5, ResNet34) yield highly accurate occupancy status for energy optimization (HVAC) or space management without compromising privacy (Cui et al., 13 May 2025, Callemein et al., 2020).
- Parking Lot Management: End-to-end vehicle and slot detection with per-slot occupancy classification enables robust, scalable, and mostly unsupervised deployment, with >99% accuracy across diverse environmental conditions (Grbić et al., 2023).
6. Benchmarking, Impact, and Open Challenges
Standard benchmarks include Occ3D-nuScenes, SemanticKITTI, and recent panoramic datasets for robots. Key metrics include voxel-wise IoU, mean semantic IoU (mIoU), ray-IoU, and task-driven planning metrics (e.g., collision rate, L2 trajectory error).
Quantitative improvements from vision-specific advances include:
- Up to +34% relative mIoU over prior weakly-supervised best with vision-only native 3D labeling (ShelfOcc) (Boeder et al., 19 Nov 2025).
- +5.85% mIoU using lightweight spatial embedding (LightOcc) (Zhang et al., 8 Dec 2024).
- 74.9% FLOP reduction via sparse 3D operations (SparseOcc) (Tang et al., 15 Apr 2024).
- Temporal consistency and robustness gains for dynamic scenes using flow-based decoupling, Gaussian temporal fusion, or spatio-temporal refinement (Xu et al., 21 Nov 2024, Yan et al., 20 Sep 2025, Yu et al., 21 Feb 2025).
Persistent challenges include:
- Scalability of annotation for rare or occluded classes
- Efficient yet precise modeling of tall/complex objects
- Robustness under adverse weather, lighting, or sensor/camera perturbation
- Integrating continuous, weak, or self-supervised labels to further unlock large-scale camera-only datasets
Future works anticipate advances in joint 4D spatio-temporal occupancy, scalable auto-labeling pipelines, and tight fusion of multi-modal and temporal context (Boeder et al., 19 Nov 2025, Shi et al., 13 Mar 2024).
7. References
Key references corresponding to the research above:
- ShelfOcc: "ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation" (Boeder et al., 19 Nov 2025)
- SurroundOcc: "SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving" (Wei et al., 2023)
- Scene as Occupancy: "Scene as Occupancy" (Sima et al., 2023)
- Compact Occupancy Transformer: "COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction" (Ma et al., 2023)
- Deep Height Decoupling: "Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction" (Wu et al., 12 Sep 2024)
- LightOcc: "Lightweight Spatial Embedding for Vision-based 3D Occupancy Prediction" (Zhang et al., 8 Dec 2024)
- SparseOcc: "SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction" (Tang et al., 15 Apr 2024)
- GS-Occ3D: "GS-Occ3D: Scaling Vision-only Occupancy Reconstruction for Autonomous Driving with Gaussian Splatting" (Ye et al., 25 Jul 2025)
- ST-GS: "ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting" (Yan et al., 20 Sep 2025)
- Semantic Causality-Aware Transformation: "Semantic Causality-Aware Vision-Based 3D Occupancy Prediction" (Chen et al., 10 Sep 2025)
- OccFiner: "Offboard Occupancy Refinement with Hybrid Propagation for Autonomous Driving" (Shi et al., 13 Mar 2024)
- OccLinker: "OccLinker: Deflickering Occupancy Networks through Lightweight Spatio-Temporal Correlation" (Yu et al., 21 Feb 2025)
- OneOcc: "OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera" (Shi et al., 5 Nov 2025)
- Humanoid Occupancy: "Humanoid Occupancy: Enabling A Generalized Multimodal Occupancy Perception System on Humanoid Robots" (Cui et al., 27 Jul 2025)
- Parking/Building domains: (Cui et al., 13 May 2025, Grbić et al., 2023, Callemein et al., 2020)
- OccNeRF: "OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments" (Zhang et al., 2023)
- Collaborative Perceiver: (Yuan et al., 28 Jul 2025)
These works collectively define the current state, challenges, and emerging directions in vision-based occupancy research.