Papers
Topics
Authors
Recent
Search
2000 character limit reached

Occ-ScanNet: Indoor 3D Occupancy Benchmark

Updated 23 June 2026
  • Occ-ScanNet is a comprehensive benchmark dataset for indoor semantic 3D occupancy prediction, offering dense voxel grids with fine-grained labels.
  • It supports both local (single-frame) and embodied (sequential) occupancy tasks, driving advances in online mapping and embodied spatial reasoning.
  • Evaluation metrics such as per-class IoU, mIoU, and SC-IoU highlight significant performance improvements over earlier volumetric datasets.

Occ-ScanNet is a large-scale benchmark dataset for semantic 3D occupancy prediction in indoor environments. Constructed from the ScanNet RGB-D reconstruction corpus, Occ-ScanNet provides per-frame dense voxel grids with fine-grained semantic labels, serving as a foundation for training and evaluating monocular occupancy prediction architectures. The benchmark supports both local (single-frame, frustum-centric) and embodied (sequential, global) occupancy tasks and drives advances in online scene mapping, embodied perception, and semantic spatial reasoning.

1. Dataset Construction and Annotation Protocol

Occ-ScanNet derives its data from the ScanNet indoor dataset, which consists of RGB videos, depth maps, camera intrinsics, and trajectory information for over 650 real-world indoor scenes, including living rooms, bedrooms, kitchens, offices, and corridors (Yu et al., 2024). For Occ-ScanNet, frames are sampled such that each view is represented as a canonical voxel grid of dimensions 60×60×3660 \times 60 \times 36, corresponding to a camera frustum of 4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m} at 8 cm resolution (Zhang et al., 20 Apr 2025, Wang et al., 24 May 2026).

Ground-truth annotation is obtained by assigning each voxel the semantic label of the nearest mesh surface point from the ScanNet-fused 3D reconstruction. Only voxels positioned in front of the camera are labeled; frames are filtered if camera parameters are invalid, the frustum is not fully contained in the reconstructed mesh, or the labeled voxel population is insufficiently diverse. This yields dense per-frame occupancy grids where each voxel is one-hot encoded for one of 12 semantic categories: ceiling, floor, wall, window, chair, bed, sofa, table, TVs, furniture, generic objects, and free space (Yu et al., 2024, Zhang et al., 20 Apr 2025, Guo et al., 14 Mar 2026).

2. Benchmark Structure and Data Splits

Occ-ScanNet releases two principal versions:

  • Full Dataset: Training set of 45,755 frames and validation set of 19,764 frames. Roughly 100 frames are sampled per scene before filtering, drawn from ≈655\approx655 distinct ScanNet scenes.
  • Occ-ScanNet-Mini: Subset for rapid prototyping, with 4,639 train and 2,007 validation frames (Yu et al., 2024, Wang et al., 24 May 2026).

Splits are scene-disjoint and maintain diversity across space types. Each frame provides:

  • RGB image (480×640×3480 \times 640 \times 3)
  • Camera intrinsics and pose
  • Dense semantic voxel grid (60×60×3660 \times 60 \times 36), aligned to the current frustum
  • (Optional: depth map)

Compared to earlier volumetric datasets such as NYUv2 (1,449 labeled frames, 13 labels), Occ-ScanNet is approximately 40×\times larger and offers greater variety in room structure, depth range (up to 10 m), and semantic granularity (Yu et al., 2024).

3. Occupancy Representation, Tasks, and Evaluation Metrics

Occupancy Representation

For each frame, occupancy is defined over a cubic 3D voxel grid corresponding to the camera frustum, with each voxel assigned a discrete semantic class or marked as free space. In the local task, the grid encodes only the region visible to a single RGB frame. In the embodied/global setting, per-frame occupancy maps are integrated incrementally into a global map (covering sequential frames), using known or estimated camera poses to align and merge voxel grids (Wang et al., 24 May 2026, Zhang et al., 20 Apr 2025).

Core Tasks

  • Local semantic occupancy prediction: Infer per-voxel semantic occupancy within the current camera frustum using a single RGB image and known camera parameters.
  • Embodied/global occupancy prediction: Sequentially aggregate per-frame predictions to build a global semantic occupancy map as an agent moves through a trajectory (Wang et al., 24 May 2026, Guo et al., 14 Mar 2026).

Evaluation Metrics

Standard metrics are defined as:

  • Per-class Intersection-over-Union (IoU): For each class cc,

IoUc=TPcTPc+FPc+FNc\mathrm{IoU}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c}

where TPc\mathrm{TP}_c, FPc\mathrm{FP}_c, and 4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}0 denote true positives, false positives, and false negatives for class 4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}1, respectively.

  • Mean IoU (mIoU):

4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}2

with 4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}3 (semantic classes) (Zhang et al., 20 Apr 2025, Wang et al., 24 May 2026).

  • Scene completion IoU (SC-IoU):

4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}4

where 4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}5 and 4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}6 are predicted and ground-truth occupied voxel sets (Guo et al., 14 Mar 2026).

  • Accuracy (Acc):

4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}7

with 4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}8 and 4.8 m×4.8 m×2.88 m4.8 \,\mathrm{m} \times 4.8\, \mathrm{m} \times 2.88\,\mathrm{m}9 as ground-truth and predicted labels, respectively (Yu et al., 2024).

4. Methodological Advances Driven by Occ-ScanNet

Occ-ScanNet catalyzes research in monocular occupancy prediction by providing a high-resolution, semantically rich, and spatially well-aligned evaluation platform. Algorithms are required to estimate fine spatial geometry, model class imbalance (free space vs. objects), and resolve small-object and occlusion challenges.

Representative Architectures

  • ISO (Yu et al., 2024): Introduces a Dual Feature Line of Sight Projection module to fuse 2D depth distributions and image features into the 3D voxel space. Designed to exploit pre-trained depth predictors for improved lifting of monocular features.
  • VEOcc (Wang et al., 24 May 2026): Proposes a recursive voxel-centric paradigm, eliminating scene-size priors for open-ended global mapping. Introduces Cross-Temporal Logit Aggregation (TLA), Reliability-Aware Confidence Modulation (RCM), and Confidence-Driven Incremental State Update (CSU) for robust online semantic fusion.
  • RoboOcc (Zhang et al., 20 Apr 2025): Combines opacity-guided self-encoding and geometry-aware cross-encoding to address semantic ambiguity in overlapping Gaussians.
  • SGR-OCC (Guo et al., 14 Mar 2026): Employs a Soft-Gating Feature Lifter for depth-uncertainty-aware voxel lifting, Dynamic Ray-Constrained Anchor Refinement for sub-voxel geometric accuracy, and semantic-adaptive geometric regularization.

Ablation and Qualitative Results

Leading methods demonstrate superior boundary preservation, improved recovery of fine structures such as windows, chair legs, and occluded objects, and resilience to noisy monocular depth (Wang et al., 24 May 2026, Guo et al., 14 Mar 2026). SGR-OCC, for example, shows a +5.30 percentage point gain in SC-IoU over naive lifting due to soft-gating, and VEOcc outperforms prior SplatSSC on mIoU by +3.66 in local settings, with absolute gains up to nearly +10 on the Occ-ScanNet-Mini (Wang et al., 24 May 2026, Guo et al., 14 Mar 2026).

5. Comparative Performance and Benchmark Statistics

Scale and Diversity

Dataset Frames (train/val) Voxel grid (per frame) Classes
Occ-ScanNet 45,755 / 19,764 60×60×36 12
Occ-ScanNet-Mini 4,639 / 2,007 60×60×36 12
NYUv2 795 / 654 60×36×60 (downsampled) 13

Occ-ScanNet comprises approximately 8.5 billion voxels overall, broadening scene diversity and depth coverage compared to previous indoor volumetric datasets (Yu et al., 2024).

Leaderboard (Local Task, Validation Split)

Method Input IoU (%) mIoU (%) SC-IoU (%)
SGR-OCC RGB — 49.89 58.55
VEOcc RGB 64.55 55.49 —
RoboOcc RGB 56.48 44.78 —
EmbodiedOcc++ RGB — 46.20 54.90
EmbodiedOcc RGB 53.95 42.90 53.55
ISO RGB 42.16 24.61 42.16
MonoScene RGB 41.60 19.84 41.60

This table samples results observed in the literature, with top performances achieved by VEOcc (mIoU=55.49), SGR-OCC (mIoU=49.89), and RoboOcc (mIoU=44.78) (Wang et al., 24 May 2026, Guo et al., 14 Mar 2026, Zhang et al., 20 Apr 2025).

Occ-ScanNet differs from ScanNet by supplying spatially aligned dense occupancy labels rather than meshes or sparse point clouds and being formatted expressly for voxel-based monocular occupancy prediction. In contrast to EmbodiedOcc-ScanNet, which targets global spatiotemporal fusion, Occ-ScanNet’s primary focus is on the single-view, frustum-local occupancy estimation, though it also supports sequential (embodied) tasks (Zhang et al., 20 Apr 2025, Wang et al., 24 May 2026).

Unique challenges stem from class imbalance, high sparsity of occupied voxels, limited frustum extent, clutter, and occlusion. The inclusion of small, fine-grained object classes (e.g., windows, TVs, and miscellaneous objects) requires methods to maintain high geometric precision (Zhang et al., 20 Apr 2025).

7. Impact and Research Trajectory

Occ-ScanNet has established itself as a dominant benchmark for indoor semantic occupancy, underpinning numerous advances in monocular online mapping, voxel-centric representation learning, and embodied AI for spatial understanding. State-of-the-art methods such as VEOcc and SGR-OCC directly attribute their algorithmic development and validation to the availability of high-fidelity frame-aligned occupancy supervision presented in Occ-ScanNet (Wang et al., 24 May 2026, Guo et al., 14 Mar 2026). A plausible implication is that future benchmarks may further expand embodied, interactive, and sensor-fusion tasks seeded by the Occ-ScanNet evaluation design and protocol.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Occ-ScanNet.