ScanNetV2 Dataset: 3D Indoor Scene Benchmark

Updated 9 April 2026

ScanNetV2 is a large-scale 3D indoor scene dataset offering RGB-D scans from residential and office environments to benchmark semantic and zero-training 3D instance segmentation.
The refined ScanNetV2-INS version improves annotation completeness by re-segmenting partially labeled objects and adding missing instances, resulting in a 28% increase in instance counts.
Enhanced evaluation metrics and detailed point-level labels challenge open-world and class-agnostic segmentation methods while providing a rigorous basis for research comparisons.

ScanNetV2 is a large-scale 3D indoor scene dataset extensively employed as a benchmark for 3D instance segmentation. The dataset consists of RGB-D scans from residential and office environments, providing point cloud representations of complex scenes. Its primary use centers on the evaluation of both semantic and, notably, class-agnostic (zero-training) 3D instance segmentation methods under diverse, realistic indoor settings. The validation portion of ScanNetV2 has, however, exhibited key annotation challenges, prompting the introduction of ScanNetV2-INS—a point-level enhanced version—aimed at addressing issues with labeling completeness and facilitating more rigorous benchmarking for open-world and class-agnostic segmentation models (Yang et al., 2024).

1. Annotation Challenges in ScanNetV2

ScanNetV2's original instance labels suffer from two principal flaws: missing instances and incomplete masks. Numerous semantically distinct but small objects (such as papers on desks and posters on walls) were excluded from annotation entirely. Additionally, many obvious instances, notably doors and boards, have regions in the point cloud marked as "unlabeled," resulting in partial or fragmented instance masks. These deficiencies introduce a bias: zero-training (class-agnostic) methods are prone to overestimating their performance, as unannotated real instances become hidden false negatives. This confounds fair quantitative comparison, particularly for methods not leveraging ground-truth semantics. The incomplete annotation leads to underestimation of false negatives and impairs the accuracy of performance metrics.

2. Construction Process of ScanNetV2-INS

ScanNetV2-INS constitutes a comprehensive revision of the original ScanNetV2 validation set, covering all 312 validation scenes while leaving the training and test splits unaltered. The authors utilized the interactive 3D annotation tool AGILE3D, as described in [Yue et al., ICLR 2024], for efficient point-level relabeling. The revision process consisted of two main steps:

Completing Partially Labeled Instances: Every door, wall board, and object displaying extensive unlabeled (black) regions was fully re-segmented and assigned a new, class-agnostic instance ID.
Adding Previously Missing Instances: All small, clearly recognizable objects, such as paper piles and desk accessories that were omitted in the original labels, were added with dedicated instance IDs.

All annotations in ScanNetV2-INS remain class-agnostic. The revisions strictly focused on improving the mask completeness and object coverage per scene.

3. Dataset Statistics

The enhancement yielded systematic changes in dataset coverage and granularity:

Dataset	Min Instances	Max Instances	Avg per Scene	Total Instances
ScanNetV2 (val)	2	47	14.0	4,364
ScanNetV2-INS (val)	2	54	17.9	5,596

A breakdown by object size (number of points per instance) reveals a significant increase in annotated small objects (<500 points), almost tripling from 252 to 692. The total instance count in validation rose by 1,232 (a 28% increase), with maximum instances per scene increasing to 54 and the average rising from 14 to 18. Because ScanNetV2-INS revises only the validation partition, it lacks separate training or test splits and is intended for evaluation.

Point-count Range	ScanNetV2	ScanNetV2-INS
< 500	252	692
500–1,000	452	748
1,000–2,000	1,119	1,366
2,000–5,000	1,690	1,873
5,000–10,000	567	626
>10,000	284	291

4. Evaluation Metrics and Benchmarking Protocol

Standard 3D instance segmentation metrics, independent of semantic labeling, are employed for evaluation:

Intersection over Union (IoU) between a predicted instance mask $P$ and ground truth $G$ :

$\text{IoU}(P, G) = \frac{|P \cap G|}{|P \cup G|}$

Average Precision (AP) at IoU threshold $\tau$ :

$\mathrm{AP}(\tau) = \text{precision across all predictions with } \mathrm{IoU} \geq \tau$

Mean Average Precision (mAP), averaged across thresholds $\tau = 50\%, 55\%, ..., 95\%$ :

$\mathrm{mAP} = \frac{1}{10}\sum_{k=0}^{9} \mathrm{AP}(50\% + 5\% \cdot k)$

Reporting conventions follow established practice: AP at 50% (AP₅₀), AP at 25% (AP₂₅), and mAP ($50:5:95$). The increased completeness in ScanNetV2-INS is expected to raise true positives and lower false negatives, with a general effect of boosting AP values for models capable of fine-grained instance segmentation.

5. Comparative Improvements and Benchmark Impact

The introduction of ScanNetV2-INS yields both quantitative and qualitative improvements over the original validation annotations:

Instance Coverage: Grew by 28%, from 4,364 to 5,596, directly addressing underrepresented object instances.
Small Object Annotations: Nearly tripled in prevalence, increasing the dataset's granularity and augmenting the test conditions for segmentation methods.
Per-Scene Complexity: Average and maximum instance counts per scene rose, challenging models to handle denser, more intricate scenarios.
Annotation Accuracy: Large unlabeled swaths in doors, boards, and accessories were corrected, and previously omitted small objects were systematically annotated.

When evaluated on the stricter ScanNetV2-INS ground truth, zero-training (class-agnostic) methods generally observe a decrease in raw AP, reflecting the higher bar set by more thorough positive instance labeling. Only methods that substantially over-segment may observe unchanged or improved scores, as the likelihood of covering additional, previously unlabeled regions increases. The revised benchmark thus provides a more demanding and equitable basis for comparing approaches, particularly those relying on mask transfer without ground-truth semantics.

6. Significance for Open-World and Class-Agnostic 3D Segmentation

ScanNetV2-INS specifically targets the shortcomings that hinder transparent evaluation of open-vocabulary and class-agnostic 3D instance segmentation. By supplying fully complete point-level instance masks and a significantly enlarged pool of small objects, it represents a drop-in replacement for the validation split that better reflects the actual capabilities and limitations of zero-training methods. It preserves compatibility with established data formats while supplementing ground-truth labels, enabling researchers to evaluate new models with increased rigor and fairness (Yang et al., 2024).

A plausible implication is that future benchmarks for 3D perception will adopt similar annotation enhancements to mitigate the systemic biases caused by incomplete ground truth, especially as open-world and foundation model-driven paradigms become more prevalent.

Markdown Report Issue Upgrade to Chat

References (1)

SA3DIP: Segment Any 3D Instance with Potential 3D Priors (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScanNetV2 Dataset.