3RScan Dataset for 3D Object Re-localization
- 3RScan is a large-scale RGB-D dataset designed to benchmark algorithms in 3D object re-localization, persistent scene understanding, and change detection.
- The dataset offers precise 6DoF pose annotations and dense instance-wise segmentations for evaluating object tracking and mapping methods.
- 3RScan supports applications like robotic search and dynamic SLAM by providing reproducible splits, standardized metrics, and robust handling of real-world challenges.
3RScan is a large-scale RGB-D dataset specifically constructed for benchmarking and developing algorithms in 3D object instance re-localization, persistent long-term scene understanding, and change detection in dynamic indoor environments. Spanning 1,482 scans across 478 scenes over multiple time intervals, 3RScan provides ground-truth instance correspondences and 6DoF pose annotations for thousands of rigid objects subjected to natural scene changes, additions, removals, and displacements. The primary motivation for 3RScan is to address the demands of tasks such as robotic search, object relocalization, and change-aware SLAM that require tracking and reasoning about object-level changes in realistic, temporally evolving indoor spaces (Wald et al., 2019).
1. Dataset Composition and Key Statistics
The 3RScan corpus captures 478 unique indoor environments (offices, living rooms, hotel rooms, etc.) using RGB-D sensors from Google Tango devices, covering sites in 13 countries. Each scene is acquired at multiple time points—between 2 and 12 rescans per environment; the average is approximately 2.1, resulting in 1,004 rescans in addition to the initial captures. In total, the dataset contains roughly 363,000 RGB-D frames (resolutions typically 1080×720 or 640×480, temporally resampled for consistency) paired with calibrated camera trajectories (, ) and globally aligned coordinate systems ().
Key statistics:
| Statistic | Value |
|---|---|
| Unique environments | 478 |
| Total scans (reference + rescans) | 1,482 |
| Total object instances | ≈48,000 |
| 3D objects with annotated pose changes | 1,947 (3,289 rigid-body changes) |
| Semantic coverage (voxel mean) | 98.5% |
| Semantic labels | 534 |
Data modalities include per-frame RGB and depth images, 3D mesh reconstructions (PLY or OBJ with textures), dense instance-wise segmentations, symmetry annotations, and volumetric TSDF grids. Each mesh incorporates triangle-level semantic and instance labels, while local TSDF patches (32×32×32 voxels) are provided at two scales—large (1.2m³, 3.75cm resolution) and small (0.6m³, 1.875cm resolution)—optimized for learning robust 3D correspondences.
2. Annotation Protocols and Ground Truth
Semantic and instance labels are initially assigned through manual annotation on a reference scan using a web-based 3D segmentation tool (adapted from ScanNet). These annotations are propagated to subsequent rescans using the global alignment transform (), with manual correction to address segmentation errors arising from genuine object motion or appearance changes. Semantic coverage exceeds 98.5%, with most scans above 98%. Each object is also assigned a symmetry class (, , or ) that informs evaluation (e.g., for rotation-invariant pose matching).
Object pose changes are annotated using a custom keypoint tool in which annotators place corresponding 3D points on reference and rescan meshes. The Kabsch (Procrustes) algorithm computes the optimal rigid-body transform:
with and , minimizing squared keypoint error. For evaluation, pose predictions are compared under the minimal angular deviation over all symmetry-equivalent rotations to properly handle ambiguities in symmetric objects.
3. Data Splits, Benchmarking Tasks, and Evaluation Metrics
3RScan is partitioned for reproducible benchmarking: 385 scenes (793 rescans) form the training split; 47 scenes (110 rescans) for validation; 46 scenes (101 rescans) for held-out test (server-side evaluation).
Primary benchmark task: Given a segmented object from source scan , estimate the set of transformations mapping into a rescan (target) at a later time.
Evaluation:
- For a predicted pose , compute rotational error (axis-angle) for and translation error .
- Success if and , for thresholds .
- Metrics reported: recall (percentage of correct localizations), as well as median errors in rotation and translation.
4. Preprocessing, Loader APIs, and Provided Utilities
Official utilities include:
- Offline pipelines for aligning RGB-D frames into textured meshes.
- Scripts for TSDF volume computation and extraction of multi-scale TSDF patches around 3D keypoints.
- Loaders for accessing per-scan intrinsics/extrinsics, instance segmentation, symmetry flags, and 6DoF pose annotations.
- Server-side evaluation scripts where pose predictions (as matrices or rotation/translation vectors) yield standardized recall metrics.
The dataset encourages explicit handling of incomplete scan overlap, illumination variation, partial views, and segmentation noise (particularly in nonrigid or highly symmetrical categories), reflecting real-world difficulties in scene relocalization and mapping scenarios.
5. Applications and Research Utilization
3RScan supports a diverse set of research objectives:
- Object instance re-localization: Tasking robots or AR agents to find relocated or newly introduced rigid objects using 3D spatial reasoning.
- Change detection and monitoring: Benchmarks for methods capable of segmenting and localizing scene changes from geometry alone, as demonstrated in 3D change detection pipelines operating over supervoxels and graph cuts (Adam et al., 2022).
- Persistent and dynamic SLAM: Long-term mapping and camera relocalization where scene elements may appear, disappear, or move.
- Feature learning and correspondence: Multi-scale correspondence learning on TSDF patches that is robust to spatial context variation, occlusion, and partial scans.
Notable usage includes state-of-the-art methods in 3D change detection exploiting geometric transformation consistency, where the 3RScan validation split (47 scenes, 110 rescans) is used for measuring intersection-over-union (IoU), recall, and precision of detected changes. A recently proposed pipeline achieves a mean IoU of 68.4% and recall of 76.05% for changed objects, outperforming prior baselines by 14–20% IoU and 30–45% recall (Adam et al., 2022). The availability of ground-truth 6DoF transforms further enables ablation studies and upper-bound benchmarking for discovery by propagation.
6. Limitations and Coverage Considerations
Principal limitations of 3RScan include:
- Sensor domain: Restricted to Google Tango RGB-D acquisition; lacks coverage of RealSense, Kinect, or time-of-flight sensors.
- Scene types: Limited to indoor environments (no large-scale outdoor, industrial, or non-Western building typologies).
- Label noise and ambiguity: High symmetry or nonrigid object classes (curtains, blankets) complicate annotation and evaluation. Illumination changes and incomplete scan overlap can lead to imperfect segmentation propagation, with some residual errors after manual cleaning.
- Partial overlap and occlusion: Some object pose changes are only partially observable due to missing data in the rescans.
A plausible implication is that conclusions drawn from benchmark performance on 3RScan may not generalize directly to unseen sensors or environment types, and some subcategories (e.g., high-symmetry objects) remain evaluation challenges.
7. Impact and Significance in 3D Vision Research
3RScan constitutes a foundational benchmark for persistent scene understanding in the presence of real-world complexity, filling a gap left by static or synthetic datasets that lack temporally-varying ground-truth correspondences. Its scale, repeat-capture design, and rigorous annotation pipelines enable progress in robot perception, AR scene manipulation, and dynamic mapping. The inclusion of standard splits and server-side evaluation protocols ensures reproducibility and comparability across methods. Persistent challenges—such as semantic ambiguity, partial views, and context change—position 3RScan as a long-term reference for quantifying advances in robust 3D perception and object-aware scene modeling (Wald et al., 2019, Adam et al., 2022).