DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation (2309.09668v2)

Published 18 Sep 2023 in cs.CV

Abstract: We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.

References (90)

Citations (24)

View on Semantic Scholar

Summary

The paper introduces image-depth pair pretraining, integrating depth information from the beginning to naturally encode RGB-D features.
The paper leverages specialized RGB-D blocks that efficiently capture 3D geometry and address representation distribution shifts.
The paper demonstrates state-of-the-art results with 57.2% mIoU on NYU Depth v2 while reducing computational cost by over 50%.

DepthNext: Rethinking RGBD Representation Learning for Semantic Segmentation

In recent advancements in computer vision, especially within the domain of RGB-D (RGB + Depth) segmentation, standard methodologies have been reliant on pretraining on solely RGB data, thereafter incorporating depth information in subsequent finetuning stages. This paper introduces an innovative framework named DepthNext, crafted for enhancing RGB-D representation learning, tailored specifically for semantic segmentation tasks. The approach presented challenges the status quo by integrating depth data during the pretraining phase, offering a paradigm shift in how RGB-D segmentation models are constructed and optimized.

Key Contributions

DepthNext posits two major innovations that distinguish it from existing methodologies in RGB-D segmentation:

Image-Depth Pair Pretraining: Unlike traditional techniques that pretrain backbones with only RGB images, DepthNext leverages image-depth pairs derived from ImageNet-1K during its pretraining phase. This strategic shift ensures that the resulting model is inherently capable of encoding RGB-D representations, as opposed to relying solely on RGB pretrained backbones that do not natively understand depth information.
Specialized RGB-D Blocks: The architecture of DepthNext employs a sequence of RGB-D blocks designed specifically to encode both RGB and depth information. These blocks are instrumental in resolving the issue of representation distribution shifts and are pivotal in accurately capturing and encoding 3D geometry relationships present in depth maps.

Experimental Validation

DepthNext was subjected to rigorous testing across two prominent RGB-D tasks: semantic segmentation and salient object detection. In both domains, it established new state-of-the-art performance benchmarks on various datasets, achieving such efficacy with a computational cost reduced by more than half compared to the current best methods.

For semantic segmentation, notably, DepthNext attained a record 57.2% mean Intersection-over-Union (mIoU) on the NYU Depth v2 dataset, highlighting its efficacy in integrating and leveraging depth data. Concurrently, the approach showcased impressive results in salient object detection across multiple datasets.

Implications and Future Work

The implications of DepthNext are substantial. The integration of depth data during pretraining not only enriches the feature representations learned but also promotes the efficiency and effectiveness of the model across both training and inference phases. This architectural and methodological shift unlocks new possibilities in the design of models for tasks where understanding of spatial geometry is paramount.

Moving forward, the potential for exploring analogous pretraining strategies on other modalities, such as thermal or LIDAR data, is compelling. Furthermore, applications extending beyond scene understanding to real-time applications in robotics and autonomous driving may greatly benefit from such enhanced multimodal representations.

DepthNext sets a precedent for future exploration in unified pretraining frameworks and demonstrates the profound benefits of reconsidering traditional modalities for pretraining in representation learning. The results obtained herein advocate for a continued exploration into the integration of multimodal data early in the model training pipeline, thus offering a deeper and more comprehensive understanding of complex scenes.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (6)

GitHub

GitHub - VCIP-RGBD/DFormer: [ICLR 2024] DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation (123 stars)