TUM RGB-D & 3RScan Datasets Overview

Updated 2 October 2025

TUM RGB-D is a benchmark featuring synchronized RGB, depth, and IMU data with precise ground truth for indoor SLAM, visual odometry, and dense 3D reconstruction.
3RScan is a large-scale dataset offering high-resolution RGB-D scans and detailed semantic annotations that facilitate advanced scene segmentation and cross-modal learning.
Both datasets support reproducible research in computer vision and robotics by providing complementary insights into geometric accuracy and semantic scene understanding.

The TUM RGB-D and 3RScan datasets are foundational resources in the computer vision and robotics communities, providing standardized RGB-D (color and depth) data for benchmarking tasks such as simultaneous localization and mapping (SLAM), visual odometry, 3D reconstruction, and semantic understanding of indoor environments. Both datasets are widely recognized for enabling algorithm development, performance comparison, and the creation of real-world applications that require precise geometric and semantic scene representations.

1. Dataset Overview and Purpose

TUM RGB-D

The TUM RGB-D dataset, captured with Microsoft Kinect structured-light sensors, consists of synchronized RGB and depth streams complemented by inertial measurement unit (IMU) data. With 39–50 GB of data and over 39 recorded sequences, it is designed primarily to benchmark SLAM, visual odometry, dense 3D reconstruction, and motion capture algorithms in indoor environments. Critically, it includes ground truth camera trajectories obtained from external motion-capture systems, enabling precise quantitative evaluation of pose estimation accuracy (Berger, 2013, Lopes et al., 2022).

3RScan

While not detailed in (Berger, 2013), the 3RScan dataset is broadly recognized for its large-scale, high-resolution 3D reconstructions of real-world indoor environments and its rich semantic annotations—attributes that support both geometry-centric and semantics-driven computer vision research. It is typically captured with sensors offering dense RGB-D data and is positioned as a next-generation benchmark addressing tasks such as semantic scene segmentation, object recognition, visual grounding, and dense 3D reconstruction. Its sophisticated annotation protocols and variety in scanning conditions make it a prominent dataset for scene understanding and cross-modal learning (Miyanishi et al., 2023, Liu et al., 2020, Lopes et al., 2022).

2. Modalities, Ground Truth, and Annotation Protocols

Dataset	Sensor Type	Modalities	Ground Truth	Annotation Category
TUM RGB-D	Kinect (SL)	RGB, Depth, Accelerometer	External MoCap pose	Sparse, trajectory-centric
3RScan	Google Tango, SL	RGB, Depth (dense)	Mesh, semantic labels	Per-frame, per-object

TUM RGB-D provides time-synchronized RGB, depth, and IMU data with precise external ground truth for camera trajectories, predominantly targeting trajectory and mapping accuracy. The 3RScan dataset, in contrast, offers high-density volumetric reconstructions, semantic and instance segmentations, and ground-truth meshes, serving as ground truth for a wider spectrum of geometry- and semantics-driven tasks. Its annotation protocols rely on crowd-sourced object instance labeling and verification to ensure consistency across varied reconstruction qualities (Miyanishi et al., 2023, Lopes et al., 2022).

3. Evaluation Metrics and Benchmarking Methodologies

TUM RGB-D sequences are used to benchmark SLAM and odometry algorithms by comparing estimated and ground-truth camera poses (translational and rotational errors). The most common metric is the root-mean-square error (RMSE):

$\mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N \| p^{(\mathrm{est})}_i - p^{(\mathrm{gt})}_i \|^2}$

where $p^{(\mathrm{est})}_i$ and $p^{(\mathrm{gt})}_i$ are estimated and ground-truth pose positions.

In dense 3D reconstruction and semantic tasks, metrics such as intersection-over-union (IoU), surface reconstruction error, and completeness/accuracy are used. For instance, the surface reconstruction error is often written as:

$E_{\mathrm{surface}} = \frac{1}{M} \sum_{j=1}^M \left\| d(p_j, S_{\mathrm{gt}}) \right\|$

where $d(p_j, S_{\mathrm{gt}})$ is the distance from a reconstructed point to the ground-truth surface (Berger, 2013). For 3RScan and related semantic benchmarks, mIoU and per-class/object detection accuracy are standard (Liu et al., 2020, Miyanishi et al., 2023).

4. Research Impact and Applications

TUM RGB-D established itself as a canonical indoor SLAM and odometry benchmark (Berger, 2013). It has facilitated the development and benchmarking of keyframe-based SLAM architectures (e.g., ORB-SLAM, KinectFusion), probabilistic visual odometry incorporating points, lines, planes, and robust multi-objective optimization for sensor fusion (Han et al., 2014, Proenca et al., 2017). Its ground truth posed substantial challenges, directly influencing progress in drift correction, loop closure, and real-time operation.

3RScan, positioned as a large-scale, semantically annotated RGB-D benchmark, supports research in scene-level semantic understanding, cross-modal representation learning, and advanced visual grounding. Its breadth of annotated scenes allows for robust semantic segmentation, object detection, and research into domain adaptation of deep models—enabling tasks like cross-dataset evaluation (e.g., the Cross3DVG task) and multi-view fusion with CLIP-based modalities (Miyanishi et al., 2023). Many data-driven methods in indoor mapping, AR, and robotics rely on 3RScan’s mesh annotations and semantic segmentation for training and evaluation.

5. Limitations, Challenges, and Future Directions

The primary limitations of TUM RGB-D are its exclusive indoor focus, reliance on motion-capture for ground truth (limiting replicability), and data volume constraints (Berger, 2013). It does not provide semantic labels or dense object-level information, restricting its utility for tasks beyond localization and mapping. For 3RScan, the expanse and density of the scans result in very large file sizes and require significant computational resources. Additionally, diverse acquisition hardware introduces sensor-specific noise and density variation, presenting challenges for cross-dataset generalization.

Suggested improvements for future datasets include the integration of more varied scenes (including outdoor and multi-modal sensors), richer multi-sensory data (e.g., higher-resolution IMU, IR), laser-scan ground truth, and automated annotation to standardize labels and support more challenging, real-world conditions (Berger, 2013, Lopes et al., 2022).

6. Theoretical and Practical Contributions

Both datasets set the standards for evaluation protocols and catalyze innovation in RGB-D perception. Formulations for rigid-body transformations:

$T = \begin{bmatrix} R & \mathbf{t} \ \mathbf{0} & 1 \end{bmatrix} ,\quad R \in SO(3),\, t \in \mathbb{R}^3$

remain central in transforming depth maps to 3D point clouds and associating RGB-D measurements with world coordinates (Berger, 2013). TUM RGB-D’s synchronized multimodal data streams and 3RScan’s semantically rich and geometrically dense reconstructions form the basis for robust algorithmic developments in SLAM, scene understanding, and multi-modal representation learning (Liu et al., 2020, Miyanishi et al., 2023).

7. Comparative Role in the Dataset Landscape

TUM RGB-D is a pioneering SLAM and odometry benchmark, known for its rigor in camera pose evaluation and its influence in SLAM and depth prediction research (Berger, 2013, Lopes et al., 2022). 3RScan, within the broader ecosystem of contemporary RGB-D benchmarks, exemplifies the shift to high-resolution, semantically dense, and cross-modal dataset design, supporting a universal range of indoor scene understanding tasks (Miyanishi et al., 2023, Liu et al., 2020). Both are cornerstones for method development, evaluation, and cross-domain comparison, driving the evolution of monocular depth estimation, object-centric SLAM, domain adaptation, and scene reasoning.

These datasets collectively underpin quantitative, reproducible research in RGB-D vision, offering diverse modalities and annotation strategies that address both geometric and semantic challenges inherent to real-world indoor environments.