Replica & ScanNet++: 3D Scene Datasets

Updated 2 December 2025

Replica and ScanNet++ are comprehensive datasets for 3D indoor scene reconstruction, offering synthetic perfection and real-world sensor challenges.
Replica features artist-curated scenes with high-resolution HDR textures and closed-vocabulary semantic annotations for controlled experiments.
ScanNet++ delivers real-world high-fidelity scans with open-vocabulary, multi-label annotations, enabling robust evaluation of novel view synthesis and segmentation.

Replica and ScanNet++ are leading large-scale datasets for 3D indoor scene reconstruction, novel view synthesis (NVS), and semantic scene understanding. Replica provides highly photo-realistic synthetic environments, while ScanNet++ delivers real-world high-fidelity scans with comprehensive geometric and semantic annotations. Together, they enable in-depth benchmarking, algorithmic development, and evaluation for computer vision, robotics, and embodied AI.

1. Dataset Composition, Scale, and Sensor Modalities

Replica consists of 18 artist-curated synthetic indoor scenes. Scales range from single rooms (offices, apartments, hotel), to multi-room apartments (including 6 variants of an FRL apartment with furniture reloadings), up to a two-floor house. Each scene is provided as a dense quad mesh generated by Marching Cubes plus Instant Meshes, with a median geometry resolution of approximately 6,000 primitives per square meter. Associated with these meshes are high-resolution HDR textures in PTex format (per-texel 16-bit float RGB), with a median texture resolution of approximately 92,000 pixels per square meter. Scene composition statistics can be summarized mathematically: - Aggregate triangle count: $T = \sum_{i=1}^{18} T_i$ - Vertex count: $V = \sum_{i=1}^{18} V_i$ - Texture pixels: $P = \sum_{i=1}^{18} (H_i \times W_i)$

ScanNet++ is a real-world benchmark comprising 460 scenes across 15,000 m², captured using multiple sensors: - Faro Focus Premium laser scanner (sub-millimeter point spacing, approx. 40 million points per scan) - Sony α7 IV DSLR (33 MP images, 7000×5000 px, ≈200–2000 frames/scene training, 15–25 held-out views for test) - iPhone 13 Pro RGB-D (RGB 1920×1440 px, LiDAR 256×192 px, 3.7M frames across all scenes) All modalities are metric-registered via SfM (COLMAP, augmented with laser mesh proxies), followed by photometric pose refinement (Yeshwanth et al., 2023).

Dataset	# Scenes	RGB Frames / Scene	Depth Modality	Mesh Resolution	Image Resolution
Replica	18	∼5,000	Perfect, synthetic	6,000 prim./m² (median)	640×480 px
ScanNet++	460	∼600 DSLR, 8K iPhone	Real, LiDAR/mesh	0.9 mm point spacing	33MP DSLR, 1920×1440 iPhone

Replica’s scenes are fully synthetic, supporting perfect sensor simulation and dense camera trajectories ( $\sim$ 10 Hz sample rate). In contrast, ScanNet++ comprises real sensor captures merged into metrical meshes, supporting the evaluation of robustness to sensor noise and registration uncertainty.

2. Semantic Annotation Schemes

Replica implements a closed-vocabulary schema, with 88 semantic classes (common examples: floor, wall, chair, table, etc.). Semantic segmentation is represented as a “segmentation forest” where: - Leaves: mesh primitives - Intermediate nodes: merged surface segments - Roots: object instances, each tagged with a class label Each primitive is mapped into the forest and stored in JSON/BIN encodings. Only single-label assignment is supported; every mesh face belongs exclusively to one semantic category (Straub et al., 2019).

ScanNet++ provides open-vocabulary, free-form semantic annotation for over 1,000 unique classes, explicitly supporting dense instance segmentation and multi-label assignment to reflect: - Occlusion (“jacket” + “chair”) - Part-whole ambiguity (“door” + “window” for glazed doors) - Functional ambiguity (“sofa” + “bed” for convertible furniture) This annotation protocol enables evaluation of semantic models under ambiguous and overlapping labeling scenarios (Yeshwanth et al., 2023).

3. Benchmark Tasks and Evaluation Metrics

Both datasets enable key tasks:

3D Semantic Segmentation: Replica supports per-vertex or per-pixel accuracy and mean IoU (mIoU) over 12–15 coarse classes. ScanNet++ advances this with per-class IoU, mean IoU, AP25/AP50 for instances, and explicit support for multi-label ground-truth.
Novel View Synthesis (NVS): Replica’s synthetic regime uses PSNR, SSIM, and LPIPS on held-out RGB frames. ScanNet++ sets the standard for real-world NVS, with:
- Inputs: high-res DSLR or commodity iPhone frames
- Queries: held-out DSLR views outside the training trajectory
- Metrics: PSNR, SSIM, LPIPS on synthesized vs. ground-truth views
- Sample performance (ScanNet++ validation, DSLR→DSLR): NeRF (PSNR=24.11, SSIM=0.833, LPIPS=0.262), TensoRF (PSNR=24.32, SSIM=0.843, LPIPS=0.240), Nerfacto (PSNR=25.02, SSIM=0.858, LPIPS=0.180) (Yeshwanth et al., 2023).
3D Reconstruction Geometry: Assessed via per-point deviation $d_i = \|p_i^{\rm scan} - p_i^{\rm ref}\|$ ; mean computed as $\overline d = (1/N)\sum_i d_i$ .
Scene and Instance Segmentation Metrics: For advanced neural field approaches, scene-level Panoptic Quality ( $\mathrm{PQ}^{\rm scene}$ ), semantic mIoU, and PSNR are also used (Siddiqui et al., 2022).

Task	Replica	ScanNet++	Example Results
Semantic Seg.	mIoU, accuracy	mIoU, AP25/50	PointNet++: 15% mIoU; KPConv: 30% mIoU; MinkowskiNet: 28% mIoU
Novel View Synth.	PSNR, SSIM, LPIPS	PSNR, SSIM, LPIPS	NeRF (24.11, 0.833, 0.262); TensoRF (24.32, 0.843, 0.240); Nerfacto (25.02, 0.858, 0.180)
Panoptic Lifting	PQ $^\text{scene}$	PQ $^\text{scene}$	Replica: 57.9% (+8.4pp over SOTA); ScanNet: 58.9% (+10.6pp over SOTA) (Siddiqui et al., 2022)

4. Integration, Tooling, and Dataset Access

Replica delivers a minimal C++ SDK for loading meshes, HDR textures, and annotations, and provides rendering routines for RGB, depth, normals, and semantic labels. It is natively compatible with the AI Habitat simulator, supporting navigation, instruction following, and egocentric vision research with direct plug-in of mesh and semantic data as navigation and semantic maps. Integration steps include setting up habitat-sim and habitat-api, pointing to the replica dataset, and configuring sensor modalities for high-FPS agent simulation (Straub et al., 2019).

ScanNet++ offers datasets in formats compatible with major geometric learning frameworks and supports hidden-test leaderboards for NVS and semantic segmentation performance, enforcing rigorous benchmark protocols.

Both datasets’ annotation formats (Replica’s segmentation forests, ScanNet++’s multi-label geometries) are designed to be directly usable for geometric deep learning, 2D/3D segmentation, rendering, and embodied agent tasks.

5. Use Cases, Strengths, and Complementarities

Replica’s 100% synthetic nature provides perfect, noise-free ground-truth, error-free simulated depth, and controllable geometry, supporting rapid prototyping, ablations, and SLAM algorithm benchmarking in simulation. Its closed-form semantics and absence of sensor artifacts favor baseline development and synthetic pretraining.

ScanNet++ exposes methods to real sensor noise (motion blur, registration drift), variable illumination, and high-frequency geometric variation. The large scale (460 scenes) and open-vocabulary, multi-label semantics enforce evaluation on generalization, robustness, and the ability to handle long-tail, overlapping, and ambiguous labels.

Together, Replica and ScanNet++ form a complementary suite: Replica for algorithmic prototyping, synthetic validation, and fast iteration; ScanNet++ for robust real-data evaluation and next-generation, generalizable 3D vision research. Their joint adoption ensures both synthetic and real-world readiness of 3D scene understanding models (Yeshwanth et al., 2023).

6. Methodological Advances and Benchmark Results

Recent methods have utilized both Replica and ScanNet++ to advance 3D scene understanding:

Panoptic Lifting over Neural Fields: A neural field-based approach jointly models color and panoptic segmentation using TensoRF backbones and multi-layer MLP semantic/instance heads. This formulation lifts 2D panoptic segmentations into consistent 3D representations using linear assignment (Hungarian algorithm) and cost-minimization. Handling of noisy 2D masks is achieved via test-time augmentations, segment-consistency loss, bounded segmentation fields, and gradient stopping. Panoptic Lifting achieves PQ $^\text{scene}$ = 57.9% on Replica (+8.4pp), 58.9% on ScanNet (+10.6pp) over SOTA (Siddiqui et al., 2022).
Online 3D Scene Reconstruction Using Neural Object Priors: This framework operates on both synthetic (Replica) and real (ScanNet) scenes, constructing object-centric, neural implicit models for each instance in an online manner. Feature grid interpolation is employed for incremental geometry updates, and a pre-mapped object library enables shape prior transfer via registration (CLIP, FPFH, ICP). Incorporation of priors (3D meshes or video-based) improves whole-object completion from CR5mm 75.9%→93.9% on Replica, outperforming vMAP, TSDF, and other object-agnostic baselines. Ablations demonstrate the effects of grid interpolation, synthesized keyframes from priors, and automatic registration quality (Chabal et al., 24 Mar 2025).

These benchmarks validate the datasets’ utility for challenging tasks in object-centric mapping, panoptic scene understanding, and NVS.

7. Limitations and Prospective Directions

Replica is limited in scene diversity (18 unique scans) and by its synthetic origin, which excludes real-world sensor noise and long-tail object variation.
ScanNet++’s real-world complexity introduces challenges for method robustness to registration and mask artifacts, and its open-vocabulary semantics can hinder reproducibility and systematic evaluation across all segmentation tasks.
For object-level neural field reconstruction, persistent limitations include sensitivity to mask/depth noise and difficulty in prior retrieval and registration when view conditions diverge substantially from library priors.

Advances are anticipated in multi-view mask refinement, learned prior retrieval, end-to-end object-aware SLAM, expansive object priors (e.g., ShapeNet-scale), and improved scene-level neural architectures (Chabal et al., 24 Mar 2025).

References: