ScanNet++: High-Fidelity 3D Benchmark

Updated 25 May 2026

ScanNet++ is a high-fidelity 3D indoor scene dataset combining sub-millimeter laser scans, 33MP DSLR imagery, and RGB-D video streams for detailed geometry and semantic analysis.
It establishes comprehensive public benchmarks for 3D semantic and instance segmentation as well as novel view synthesis, evaluated with metrics like PSNR, SSIM, and mIoU.
The resource integrates automated semantic relabeling using consensus voting and neural rendering, significantly enhancing annotation quality and facilitating state-of-the-art research.

ScanNet++ constitutes a suite of high-fidelity, large-scale datasets and benchmarks for 3D indoor scene understanding and novel view synthesis, emphasizing the joint capture of geometry and multimodal imagery annotated with richly detailed, open-vocabulary semantics. The central resource is a 460-scene dataset characterized by sub-millimeter laser scan geometry, 33-megapixel DSLR image sequences, commodity-level RGB-D video streams, and multi-label, ambiguity-aware 3D semantic annotation, all co-registered within a unified coordinate frame. ScanNet++ establishes comprehensive public benchmarks for 3D semantic and instance segmentation under complex multi-label scenarios, as well as rigorous settings for neural rendering and novel view synthesis from real-world camera trajectories. The term "ScanNet++" also refers to high-quality, fully-automated semantic relabelings of legacy ScanNet scenes using consensus-driven neural rendering pipelines, which further inform benchmarking in the field.

1. Dataset Composition and Acquisition Methodology

The primary ScanNet++ dataset is constructed through a multimodal capture pipeline coupling high-resolution geometry and appearance modalities:

Laser scanning: Each scene is scanned with a Faro Focus Premium laser scanner, yielding ≈40 million points per scan at average point spacing ≈0.9 mm, with typically 1–10 scans per environment (average 4.85).
DSLR imagery: A Sony α7 IV camera with fisheye lens (approx. 180° FOV) acquires 33-megapixel frames under fixed photometric settings (1/100 s, fixed white balance). Scenes contain on average ≈200 images for training and 15–25 distinct test views.
Commodity RGB-D: iPhone 13 Pro RGB (1920×1440) and LiDAR depth (256×192), captured in video at 60 Hz, producing >3.7 million RGB-D frames for the dataset.
Scene statistics: The release comprises 460 scenes (apartments, offices, classrooms, labs, workshops), 1,858 laser scans, ≈280,000 DSLR images, and dense geometric reconstructions exceeding sub-millimeter mesh resolution. Meshes are simplified for visualization (to 12.5%, 5%, and 1.5% of faces).

Calibration and registration comprehensively align all sensor modalities. Initial alignment uses COLMAP-based structure-from-motion, combining real and rendered “pseudo-images.” Pose and intrinsics are then refined by minimizing 2D–3D reprojection error, dense photometric terms, and cross-modality depth consistency (discrepancies >0.3 m lead to RGB-D frame rejection). Intrinsics and extrinsics are optimized jointly.

2. Semantic Annotation: Open Vocabulary and Ambiguity

Semantic ground truth in ScanNet++ is defined by open-vocabulary 3D labeling:

Labeling scheme: Over-segmented mesh regions are annotated with free-text instance labels, not restricted to a fixed class set (>1,000 unique entries).
Multi-label regions: Segments can simultaneously belong to multiple concepts (e.g., a “jacket” draped over a “chair,” or a “window” inset in a “door”), supporting explicit capture of part–whole and occlusion ambiguity.
Ambiguity representation: Guidelines codify typical multi-labeling scenarios; quality control includes verification passes for consistency. The dataset reports a multi-label co-occurrence matrix, quantifying long-tail and ambiguous class pairings, though no single “ambiguity score” per region is introduced.
Comparative context: Unlike ScanNet’s fixed ≈20 category labels, or ARKitScenes’ 17 bounding-box categories, ScanNet++ provides far richer semantic diversity and annotation coverage.

3. Benchmarks and Evaluation Protocols

ScanNet++ supports two principal benchmark tracks with well-defined evaluation metrics:

A. Novel View Synthesis (NVS):

Both high-fidelity DSLR and commodity RGB (iPhone) frames are accepted as training sources.
Test views are strictly held out, with novel camera positions (mean translation/rotation difference 0.40 m/42.7° versus ScanNet’s 0.04 m/3.1°).
Main metrics: PSNR (peak signal-to-noise ratio), SSIM (Structural Similarity), LPIPS (Learned Perceptual Image Patch Similarity).

B. 3D Semantic and Instance Segmentation:

Per-vertex semantic labels (multi-label allowed) and instance masks are to be predicted on the reconstructed mesh.
Semantic segmentation is quantified by mean intersection-over-union (mIoU):

$\mathrm{mIoU} = \frac{1}{C} \sum_{c=1}^{C} \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c}$

Instance segmentation uses average precision at IoU≥0.5 (AP@50), with evaluation on a hidden test set via online leaderboard.
Submissions may return multiple labels per vertex to accommodate the multi-label ground truth inherent in ScanNet++ (Yeshwanth et al., 2023).

Further variants exist (“ScanNet++ V1” with 330 scans and “ScanNet++ V2” with 956 scans and denser annotations), each with specific splits, 84-instance category taxonomies, and superpoint-driven annotation conventions (Yao et al., 11 Feb 2026).

4. Automated Label Generation and ScanNet++ Labelmaker

Sophisticated automatic semantic relabeling pipelines expand the ScanNet++ concept to legacy datasets:

Ensemble segmentation: Multiple 2D/3D models (InternImage, CMX, OVSeg, Mask3D) are run per frame, mapping all outputs to a unified 186-class (WordNet synkey) space. Each model contributes per-pixel predictions (logits or hard labels), optionally augmented by the human ScanNet label.
Consensus voting: For pixel $(x,i)$ , label $\ell^*(x,i)$ is assigned if the majority (thresholded, e.g., 4/8 or 3/13) agree; otherwise, the pixel is marked unknown.
3D consistency via neural rendering: Predictions are lifted into 3D with a neural implicit surface (SDF-based “Neus-Acc” in SDFStudio), jointly training radiance and semantic heads with multi-term losses (photometric, depth, normal, eikonal, semantic).
Multi-view fusion: Consensus labels are rendered back into every frame and/or fused onto the point cloud by projecting each 3D vertex into visible frames and majority voting.

Empirical results (on 5 ScanNet scenes) show label quality increases of up to 12 pp mIoU on NYU40 and 8 pp on WordNet186 relative to original human annotations. Consensus voting and 3D consistency both yield measurable improvements. For instance, fully automatic LabelMaker achieves 53.4%/65.0%/77.5% for 2D NYU40 (mIoU/mAcc/tAcc), while 3D fusion gives 44.1%/53.4%/76.1% for NYU40 (Weder et al., 2023).

5. Advanced Algorithms for Segmentation and Instance Reasoning

Recent methods benchmarked on ScanNet++ highlight the difficulty and the progression of the 3D instance segmentation task:

SAI3D (Segment Any Instance in 3D Scenes):

Begins by over-segmenting the point cloud into normal-coherent “superpoints” using graph-cut (Felzenszwalb-Huttenlocher).
Computes multi-view affinities between superpoints: projects them into each RGB view, accumulates histogram similarities of overlapping SAM masks, and aggregates affinities with visibility-weighted averaging.
A hierarchical, dynamically-thresholded region-growing algorithm sequentially merges superpoints into 3D instance masks, employing a multi-level affinity criterion ( $A_{R,j}$ ) to ensure joint connectivity and a progressive schedule of thresholds (e.g., $\{0.9, 0.8, 0.7\}$ ).
On ScanNet++ validation, SAI3D outperforms both open-vocabulary and supervised methods: AP@50=31.1, AP@25=49.5, versus Mask3D (AP@50=17.3) and SAM3D (AP@50=14.2). Ablations confirm the necessity of superpoint-based aggregation, multi-level affinity, and progressive thresholding for state-of-the-art performance (Yin et al., 2023).

LaSSM (Efficient Semantic-Spatial Query Decoding):

Introduces a hierarchical semantic-spatial query initializer, selecting superpoints with high semantic activation and spatial coverage via farthest point sampling.
The coordinate-guided state space model (SSM) decoder applies local kNN-based aggregation and dual-path (causal and anti-causal, via Hilbert curve permutations) SSM convolution, enabling both geometric locality and global context without quadratic attention costs.
On ScanNet++ V2, LaSSM (smpro variant) achieves val mAP=29.1, AP_50=43.5, AP_25=51.6; on the hidden test set, test mAP=32.4, AP_50=48.0, AP_25=54.8. The method uses only about a third the FLOPs of SGIFormer (4.8 GFlops vs. 13.5 GFlops), establishing both accuracy and computational efficiency at the SOTA level (Yao et al., 11 Feb 2026).

6. Comparative Context and Research Applications

ScanNet++ distinguishes itself from prior benchmarks in both geometric and semantic fidelity:

Dataset	# Scenes	Geometry	Image Modality	Semantic Schema	Distinct Features
ScanNet	1,503	Commodity RGB-D	640×480 (RGB-D)	≈20 fixed classes	Low-res geometry, limited semantic scope
Matterport3D	–	Panoramic RGB-D	Panoramic, lower-res	40 classes	Spherical capture, no open vocab.
ARKitScenes	–	Commodity RGB-D	iPhone RGB-D	17 bounding-box categories	Bounding-box-style annotation
ScanNet++	460	Sub-mm laser + DSLR + iPhone RGB-D	33MP DSLR, dense iPhone RGB-D	Open vocab, >1,000 classes, multi-label	Multi-modal registration, real-world NVS, open-vocab, ambiguity-aware semantics

ScanNet++ enables generalizable neural radiance fields (NeRF), learned radiance and semantic priors, and multi-modal learning regimes integrating geometry, appearance, and semantics. Benchmarks are designed for robustness under real-world image artifacts (motion blur, exposure variation, pose noise), and for rigorous study of ambiguity both in labeling and instance segmentation (Yeshwanth et al., 2023).

7. Practical Considerations

ScanNet++ provides a unified framework and public resources to facilitate reproducible research:

Dataset splits (standard: 360 train / 50 val / 50 test) preserve scene-type distributions.
Open-source pipelines (e.g., LabelMaker) are available covering the entire processing stack (data loading, consensus, neural rendering, evaluation), with pretrained models and all required dependencies documented (Weder et al., 2023).
Per-scene processing (e.g., for LabelMaker semantic relabeling) completes in ≈5 hours on an RTX3090 for ≈1,300 frames per scene.
Online leaderboards govern access to held-out ground truth for all principal ScanNet++ benchmarks, ensuring fair and blind evaluation.
Efficient algorithms (LaSSM) enable tractable training and inference despite dense point clouds and high annotation complexity, reducing compute requirements for SOTA performance.

ScanNet++ thus provides the research community with a unique, high-fidelity, multi-modal resource for advancing 3D scene understanding, novel view synthesis, semantic reasoning under ambiguity, and multi-label instance discrimination, setting new standards for benchmark scale, annotation density, and semantic richness.