Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation (2309.09668v2)

Published 18 Sep 2023 in cs.CV

Abstract: We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 2021.
  2. MultiMAE: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
  3. Adabins: Depth estimation using adaptive bins. In CVPR, 2021.
  4. HRFuser: A multi-resolution sensor fusion architecture for 2D object detection. arXiv preprint arXiv:2206.15157, 2022.
  5. ShapeConv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation. In ICCV, 2021.
  6. Learning aligned cross-modal representations from weakly aligned data. In CVPR, 2016.
  7. Spatial information guided convolution for real-time RGBD semantic segmentation. TIP, 30:2313–2324, 2021a.
  8. 3-d convolutional neural networks for rgb-d salient object detection and beyond. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  9. Xiaokang Chen et al. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In ECCV, 2020a.
  10. An empirical study of adder neural networks for object detection. NeurIPS, 2021b.
  11. Uniter: Universal image-text representation learning. In ECCV, 2020b.
  12. Yolo-ms: Rethinking multi-scale representation learning for real-time object detection. arXiv preprint arXiv:2308.05480, 2023.
  13. Depth enhanced saliency detection method. In ICIMCS, 2014.
  14. Fuqin Deng et al. FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation. In IROS, 2021.
  15. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV, 2021.
  16. Structure-measure: A new way to evaluate foreground maps. In IEEE ICCV, 2017.
  17. Enhanced-alignment measure for binary foreground map evaluation. In IJCAI, 2018.
  18. Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks. TNNLS, 32(5):2075–2089, 2020.
  19. Doodlenet: Double deeplab enhanced feature fusion for thermal-color semantic segmentation. In CVPR, 2022.
  20. Is attention better than matrix decomposition? NeurIPS, 2021.
  21. Cross modal focal loss for rgbd face anti-spoofing. In CVPR, 2021.
  22. Omnivore: A single model for many visual modalities. In CVPR, 2022.
  23. Segnext: Rethinking convolutional attention design for semantic segmentation. NeurIPS, 2022a.
  24. Visual attention network. arXiv preprint arXiv:2202.09741, 2022b.
  25. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In IROS, 2017.
  26. Deep residual learning for image recognition. In CVPR, 2016.
  27. Conv2former: A simple transformer-style convnet for visual recognition. arXiv preprint arXiv:2211.11943, 2022.
  28. ACNet: Attention based network to exploit complementary features for RGBD semantic segmentation. In ICIP, 2019.
  29. Multi-modal sensor fusion for auto driving perception: A survey. arXiv preprint arXiv:2202.02703, 2022.
  30. Calibrated rgb-d salient object detection. In CVPR, 2021.
  31. Depth saliency based on anisotropic center-surround difference. In ICIP, 2014.
  32. N-imagenet: Towards robust, fine-grained object recognition with event cameras. In ICCV, 2021.
  33. Adam: A method for stochastic optimization. In ICLR, 2015.
  34. Spsn: Superpixel prototype sampling network for rgb-d salient object detection. In ECCV, 2022.
  35. Rgb-t semantic segmentation with location, activation, and sharpening. IEEE TCSVT, 33(3):1223–1235, 2022.
  36. Sere: Exploring feature self-relation for self-supervised transformer. IEEE TPAMI, 2023a.
  37. Enhancing representations through heterogeneous self-supervised learning. arXiv preprint arXiv:2310.05108, 2023b.
  38. KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2D and 3D. arXiv preprint arXiv:2109.13410, 2021.
  39. Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. In CVPR, 2021a.
  40. Visual saliency transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4722–4732, 2021b.
  41. A convnet for the 2020s. In CVPR, 2022.
  42. Decoupled weight decay regularization. In ICLR, 2019.
  43. Learning densities in feature space for reliable segmentation of indoor scenes. RA-L, 5(2):1032–1038, 2020.
  44. How to evaluate foreground maps? In IEEE CVPR, 2014.
  45. Leveraging stereopsis for saliency analysis. In CVPR, 2012.
  46. Rgbd salient object detection: A benchmark and algorithms. In ECCV, 2014.
  47. Saliency filters: Contrast based filtering for salient region detection. In IEEE CVPR, 2012.
  48. Multi-modal fusion transformer for end-to-end autonomous driving. In CVPR, 2021.
  49. Learning transferable visual models from natural language supervision. In ICML, 2021.
  50. Olga Russakovsky et al. ImageNet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  51. Efficient rgb-d semantic segmentation for indoor scene analysis. In ICRA, 2021.
  52. Efficient multi-task rgb-d scene analysis for indoor environments. In IJCNN, 2022.
  53. Indoor segmentation and support inference from RGBD images. In ECCV, 2012.
  54. SUN RGB-D: A RGB-D scene understanding benchmark suite. In CVPR, 2015.
  55. Deep rgb-d saliency detection with depth-sensitive attention and automatic multi-modal fusion. In CVPR, 2021a.
  56. Exploring language prior for mode-sensitive visual attention modeling. In Proceedings of the 28th ACM International Conference on Multimedia, pp.  4199–4207, 2020.
  57. FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. T-ASE, 18(3):1000–1011, 2021b.
  58. Attention is all you need. NeurIPS, 2017.
  59. Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In CVPR, 2023.
  60. Depth-aware cnn for rgb-d segmentation. In ECCV, 2018.
  61. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, 2019.
  62. Deep multimodal fusion by channel exchanging. NeurIPS, 2020.
  63. Multimodal token fusion for vision transformers. In CVPR, 2022.
  64. Dynamic selective network for rgb-d salient object detection. IEEE TIP, 30:9179–9192, 2021.
  65. Difnet: Boosting visual information flow for image captioning. In CVPR, 2022.
  66. Depth-adapted cnn for rgb-d cameras. In ACCV, 2020.
  67. Hidanet: Rgb-d salient object detection via hierarchical depth awareness. IEEE TIP, 32:2160–2173, 2023.
  68. SegFormer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  69. Depthtrack: Unveiling the power of rgbd tracking. In ICCV, 2021.
  70. Bi-directional progressive guidance network for rgb-d salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(8):5346–5360, 2022.
  71. Camoformer: Masked separable attention for camouflaged object detection. arXiv preprint arXiv:2212.06570, 2022.
  72. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In CVPR, 2021.
  73. Bifurcated backbone strategy for rgb-d salient object detection. IEEE TIP, 30:8727–8742, 2021.
  74. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. T-ITS, 2023a.
  75. Delivering arbitrary-modal semantic segmentation. In CVPR, 2023b.
  76. RGB-D saliency detection via cascaded mutual information minimization. In ICCV, 2021a.
  77. C22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTdfnet: Criss-cross dynamic filter network for rgb-d salient object detection. IEEE TMM, 2022.
  78. ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation. In CVPR, 2021b.
  79. Rstnet: Captioning with adaptive attention on visual and non-visual words. In CVPR, 2021c.
  80. Temo: Towards text-driven 3d stylization for multi-object meshes. arXiv preprint arXiv:2312.04248, 2023c.
  81. Referring camouflaged object detection. arXiv preprint arXiv:2306.07532, 2023d.
  82. Bilateral attention network for rgb-d salient object detection. IEEE TIP, 30:1949–1961, 2021d.
  83. Rgb-d salient object detection with ubiquitous target awareness. IEEE TIP, 30:7717–7731, 2021.
  84. Multispectral fusion transformer network for rgb-thermal urban scene semantic segmentation. IEEE GRSL, 19:1–5, 2022a.
  85. Mvsalnet: Multi-view augmentation for rgb-d salient object detection. In ECCV, 2022b.
  86. Specificity-preserving rgb-d saliency detection. In ICCV, 2021a.
  87. GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation. TIP, 30:7790–7802, 2021b.
  88. Pgdenet: Progressive guided fusion and depth enhancement network for rgb-d indoor scene parsing. IEEE TMM, 2022c.
  89. Frnet: Feature reconstruction network for rgb-d indoor scene parsing. JSTSP, 16(4):677–687, 2022d.
  90. Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In ICCV, 2021.
Citations (24)

Summary

  • The paper introduces image-depth pair pretraining, integrating depth information from the beginning to naturally encode RGB-D features.
  • The paper leverages specialized RGB-D blocks that efficiently capture 3D geometry and address representation distribution shifts.
  • The paper demonstrates state-of-the-art results with 57.2% mIoU on NYU Depth v2 while reducing computational cost by over 50%.

DepthNext: Rethinking RGBD Representation Learning for Semantic Segmentation

In recent advancements in computer vision, especially within the domain of RGB-D (RGB + Depth) segmentation, standard methodologies have been reliant on pretraining on solely RGB data, thereafter incorporating depth information in subsequent finetuning stages. This paper introduces an innovative framework named DepthNext, crafted for enhancing RGB-D representation learning, tailored specifically for semantic segmentation tasks. The approach presented challenges the status quo by integrating depth data during the pretraining phase, offering a paradigm shift in how RGB-D segmentation models are constructed and optimized.

Key Contributions

DepthNext posits two major innovations that distinguish it from existing methodologies in RGB-D segmentation:

  1. Image-Depth Pair Pretraining: Unlike traditional techniques that pretrain backbones with only RGB images, DepthNext leverages image-depth pairs derived from ImageNet-1K during its pretraining phase. This strategic shift ensures that the resulting model is inherently capable of encoding RGB-D representations, as opposed to relying solely on RGB pretrained backbones that do not natively understand depth information.
  2. Specialized RGB-D Blocks: The architecture of DepthNext employs a sequence of RGB-D blocks designed specifically to encode both RGB and depth information. These blocks are instrumental in resolving the issue of representation distribution shifts and are pivotal in accurately capturing and encoding 3D geometry relationships present in depth maps.

Experimental Validation

DepthNext was subjected to rigorous testing across two prominent RGB-D tasks: semantic segmentation and salient object detection. In both domains, it established new state-of-the-art performance benchmarks on various datasets, achieving such efficacy with a computational cost reduced by more than half compared to the current best methods.

For semantic segmentation, notably, DepthNext attained a record 57.2% mean Intersection-over-Union (mIoU) on the NYU Depth v2 dataset, highlighting its efficacy in integrating and leveraging depth data. Concurrently, the approach showcased impressive results in salient object detection across multiple datasets.

Implications and Future Work

The implications of DepthNext are substantial. The integration of depth data during pretraining not only enriches the feature representations learned but also promotes the efficiency and effectiveness of the model across both training and inference phases. This architectural and methodological shift unlocks new possibilities in the design of models for tasks where understanding of spatial geometry is paramount.

Moving forward, the potential for exploring analogous pretraining strategies on other modalities, such as thermal or LIDAR data, is compelling. Furthermore, applications extending beyond scene understanding to real-time applications in robotics and autonomous driving may greatly benefit from such enhanced multimodal representations.

DepthNext sets a precedent for future exploration in unified pretraining frameworks and demonstrates the profound benefits of reconsidering traditional modalities for pretraining in representation learning. The results obtained herein advocate for a continued exploration into the integration of multimodal data early in the model training pipeline, thus offering a deeper and more comprehensive understanding of complex scenes.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.