SUN RGB-D Dataset: Indoor Scene Benchmark

Updated 14 January 2026

SUN RGB-D is a comprehensive benchmark that pairs 10,335 RGB images with aligned HHA-encoded depth maps across 40 indoor scene categories for robust scene recognition.
Its structured data splits and preprocessing pipelines, including weakly-supervised patch training and diverse CNN architectures, enable effective learning from both visual and geometric cues.
RGB-D fusion methods applied on the dataset achieve state-of-the-art performance, with experiments reporting up to 52.4% average accuracy in indoor scene classification.

The SUN RGB-D dataset is a comprehensive benchmark tailored for indoor scene understanding, providing paired RGB images and aligned depth maps for a diverse set of environments. In the context of RGB-D scene recognition, SUN RGB-D is extensively leveraged for training and evaluating models capable of integrating visual and geometric cues, with the dataset’s construction, preprocessing methodologies, evaluation splits, and usage protocols significantly influencing state-of-the-art advances in the field (Song et al., 2018).

1. Dataset Composition and Structure

SUN RGB-D comprises a total of 10,335 RGB-D frames, with each sample containing both an RGB image and an aligned depth map. The full dataset enumerates 40 scene categories; however, much of the RGB-D scene recognition literature, including key works, concentrates on the 19 most frequent indoor scene classes such as bedroom, kitchen, and office. Each depth map is further processed into a three-channel HHA encoding—horizontal disparity, height above ground, and angle with gravity—which is a prerequisite for all depth-based network inputs (Song et al., 2018).

2. Data Splits and Labeling Protocol

The dataset is partitioned into official training and test splits, comprising 4,845 and 4,659 images respectively, specifically over the 19 selected categories most relevant for indoor environmental understanding. No separate validation set or cross-validation protocol is employed, with all experiments adhering strictly to the fixed split defined by Song et al. (2015). Scene labels represent global scene types and are inherited by all local data representations (e.g., patches) used in weakly supervised training stages (Song et al., 2018).

3. Preprocessing and HHA Encoding Pipeline

RGB images undergo standard resizing to a canonical network input resolution (approximately 227×227) and mean subtraction, with the mean computed from either ImageNet or Places datasets. Color augmentation is not reported as part of the preprocessing pipeline. Depth data is converted from raw sensor readings to the HHA format before undergoing the same resizing and mean-subtraction pipeline as RGB. For weakly-supervised pretraining, dense patch extraction is performed: with the AlexNet-style variant, a 4×4 grid of 99×99 patches per image is used; with the final D-CNN architecture, a 7×7 grid of 35×35 patches per image is adopted, where each patch is assigned the global scene label, constituting a weak supervision regime (Song et al., 2018).

4. CNN Architectures and Dataset Adaptation

Multiple CNN architectures are deployed and adapted to the dataset’s scale and modality properties:

AlexNet-style CNN: Standard AlexNet convolutional and fully connected layer topology, with 227×227 input and 11×11 convolution in the first layer, primarily used as a benchmark for evaluating fine-tuning from large-scale RGB models (e.g., Places-CNN) to the depth modality.
WSP-CNN (Weakly Supervised Patch CNN): A compact three-layer convolutional network engineered for 35×35 patches, featuring Conv1 (5×5 kernels, stride 2), Conv2 (3×3, stride 1), and Conv3 (3×3, stride 1), interleaved with max-pooling and concluding with two fully connected layers (fc6 and fc7, each 4096-dimensional). This reduces parameter count to better fit the modest dataset size.
D-CNN (Full-Image Depth Network): Conv1–Conv3 weights are initialized from the pretrained WSP-CNN. An additional Conv4 (3×3, 512 channels) and a Spatial Pyramid Pooling layer with 29×29, 15×15, and 10×10 grids are employed to aggregate the feature maps into a fixed-length vector, which is then processed by a 4096-unit fc6, 4096-unit fc7, and a 19-way softmax output. This architecture is optimized specifically for the scale and content of SUN RGB-D’s depth modality (Song et al., 2018).

5. Training Methodology and Optimization

Training is performed in two major phases:

Weakly Supervised Patch Training (WSP): Dense patch extraction is executed, assigning global scene labels to each patch. The WSP-CNN is trained from scratch using softmax cross-entropy loss without leveraging any pretrained weights from RGB sources for the depth channel.
Full-Image Fine-Tuning: The conv1–conv3 layers from the trained WSP-CNN are transferred to the D-CNN, while subsequent layers are randomly initialized. The network is then fine-tuned on the entire set of full input images and corresponding scene labels using stochastic gradient descent and softmax loss.

Ablation strategies also include FT-top (freezing all but classifier layers), FT-bottom (training only initial layers), and FT-keep (removing and retraining upper layers), highlighting the effects of different transfer/fine-tuning schemas on learning modality-specific features (Song et al., 2018).

6. RGB–Depth Feature Fusion Mechanism

Feature fusion is realized by projecting the penultimate layer (4096-D) features of the RGB and depth streams into a shared space of dimension $D_3$ through modality-specific linear projections, as

$F_{rgbd} = [W_{rgb}\ W_{depth}] \cdot [F_{rgb}; F_{depth}]$

where $W_{rgb}$ and $W_{depth}$ are learnable mixing matrices. A subsequent fully connected layer predicts the 19-class scene labels, with the entire network (including both RGB and depth streams and their fusion layers) optimized end-to-end on the SUN RGB-D training split (Song et al., 2018).

7. Evaluation Protocol and Comparative Results

SUN RGB-D experiments adopt average per-class accuracy as the evaluation metric—the mean accuracy across the 19 target scene categories. Depth-only results demonstrate that the D-CNN architecture (without additional SVM) achieves 41.2% average accuracy, surpassing fine-tuned Places-CNN baselines (37.5–38.9%), and further increasing to 42.4% with a weighted SVM classifier. For RGB-D fusion, the best configuration (end-to-end RGB-D-CNN fusion plus weighted SVM) reaches state-of-the-art performance at 52.4% average accuracy, outperforming alternatives and previous works (e.g., Zhu et al. (CVPR’16) and Wang et al. (CVPR’16)) by a substantial margin. This demonstrates the effectiveness of modality-specific feature learning and fusion within the SUN RGB-D data regime (Song et al., 2018).

Depth-Only and RGB-D Fusion Results on SUN RGB-D

Method	RGB Acc	Depth Acc	RGB-D Acc
FT-Places-CNN (HHA)	41.5	37.5	45.4
RGB-D-CNN (fusion, Places + D-CNN)	41.5	41.2	50.9
RGB-D-CNN (fusion) + weighted SVM	42.7	42.4	52.4

The observed performance improvements underscore that architectures and training paradigms explicitly adapted to the specifics of RGB-D data, as exemplified by SUN RGB-D, provide clear advantage for scene recognition tasks (Song et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SUN RGB-D Dataset.