DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation (2309.09668v2)
Abstract: We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 2021.
- MultiMAE: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
- Adabins: Depth estimation using adaptive bins. In CVPR, 2021.
- HRFuser: A multi-resolution sensor fusion architecture for 2D object detection. arXiv preprint arXiv:2206.15157, 2022.
- ShapeConv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation. In ICCV, 2021.
- Learning aligned cross-modal representations from weakly aligned data. In CVPR, 2016.
- Spatial information guided convolution for real-time RGBD semantic segmentation. TIP, 30:2313–2324, 2021a.
- 3-d convolutional neural networks for rgb-d salient object detection and beyond. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Xiaokang Chen et al. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In ECCV, 2020a.
- An empirical study of adder neural networks for object detection. NeurIPS, 2021b.
- Uniter: Universal image-text representation learning. In ECCV, 2020b.
- Yolo-ms: Rethinking multi-scale representation learning for real-time object detection. arXiv preprint arXiv:2308.05480, 2023.
- Depth enhanced saliency detection method. In ICIMCS, 2014.
- Fuqin Deng et al. FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation. In IROS, 2021.
- Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV, 2021.
- Structure-measure: A new way to evaluate foreground maps. In IEEE ICCV, 2017.
- Enhanced-alignment measure for binary foreground map evaluation. In IJCAI, 2018.
- Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks. TNNLS, 32(5):2075–2089, 2020.
- Doodlenet: Double deeplab enhanced feature fusion for thermal-color semantic segmentation. In CVPR, 2022.
- Is attention better than matrix decomposition? NeurIPS, 2021.
- Cross modal focal loss for rgbd face anti-spoofing. In CVPR, 2021.
- Omnivore: A single model for many visual modalities. In CVPR, 2022.
- Segnext: Rethinking convolutional attention design for semantic segmentation. NeurIPS, 2022a.
- Visual attention network. arXiv preprint arXiv:2202.09741, 2022b.
- MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In IROS, 2017.
- Deep residual learning for image recognition. In CVPR, 2016.
- Conv2former: A simple transformer-style convnet for visual recognition. arXiv preprint arXiv:2211.11943, 2022.
- ACNet: Attention based network to exploit complementary features for RGBD semantic segmentation. In ICIP, 2019.
- Multi-modal sensor fusion for auto driving perception: A survey. arXiv preprint arXiv:2202.02703, 2022.
- Calibrated rgb-d salient object detection. In CVPR, 2021.
- Depth saliency based on anisotropic center-surround difference. In ICIP, 2014.
- N-imagenet: Towards robust, fine-grained object recognition with event cameras. In ICCV, 2021.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Spsn: Superpixel prototype sampling network for rgb-d salient object detection. In ECCV, 2022.
- Rgb-t semantic segmentation with location, activation, and sharpening. IEEE TCSVT, 33(3):1223–1235, 2022.
- Sere: Exploring feature self-relation for self-supervised transformer. IEEE TPAMI, 2023a.
- Enhancing representations through heterogeneous self-supervised learning. arXiv preprint arXiv:2310.05108, 2023b.
- KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2D and 3D. arXiv preprint arXiv:2109.13410, 2021.
- Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. In CVPR, 2021a.
- Visual saliency transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4722–4732, 2021b.
- A convnet for the 2020s. In CVPR, 2022.
- Decoupled weight decay regularization. In ICLR, 2019.
- Learning densities in feature space for reliable segmentation of indoor scenes. RA-L, 5(2):1032–1038, 2020.
- How to evaluate foreground maps? In IEEE CVPR, 2014.
- Leveraging stereopsis for saliency analysis. In CVPR, 2012.
- Rgbd salient object detection: A benchmark and algorithms. In ECCV, 2014.
- Saliency filters: Contrast based filtering for salient region detection. In IEEE CVPR, 2012.
- Multi-modal fusion transformer for end-to-end autonomous driving. In CVPR, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Olga Russakovsky et al. ImageNet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
- Efficient rgb-d semantic segmentation for indoor scene analysis. In ICRA, 2021.
- Efficient multi-task rgb-d scene analysis for indoor environments. In IJCNN, 2022.
- Indoor segmentation and support inference from RGBD images. In ECCV, 2012.
- SUN RGB-D: A RGB-D scene understanding benchmark suite. In CVPR, 2015.
- Deep rgb-d saliency detection with depth-sensitive attention and automatic multi-modal fusion. In CVPR, 2021a.
- Exploring language prior for mode-sensitive visual attention modeling. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 4199–4207, 2020.
- FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. T-ASE, 18(3):1000–1011, 2021b.
- Attention is all you need. NeurIPS, 2017.
- Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In CVPR, 2023.
- Depth-aware cnn for rgb-d segmentation. In ECCV, 2018.
- Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, 2019.
- Deep multimodal fusion by channel exchanging. NeurIPS, 2020.
- Multimodal token fusion for vision transformers. In CVPR, 2022.
- Dynamic selective network for rgb-d salient object detection. IEEE TIP, 30:9179–9192, 2021.
- Difnet: Boosting visual information flow for image captioning. In CVPR, 2022.
- Depth-adapted cnn for rgb-d cameras. In ACCV, 2020.
- Hidanet: Rgb-d salient object detection via hierarchical depth awareness. IEEE TIP, 32:2160–2173, 2023.
- SegFormer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- Depthtrack: Unveiling the power of rgbd tracking. In ICCV, 2021.
- Bi-directional progressive guidance network for rgb-d salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(8):5346–5360, 2022.
- Camoformer: Masked separable attention for camouflaged object detection. arXiv preprint arXiv:2212.06570, 2022.
- Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In CVPR, 2021.
- Bifurcated backbone strategy for rgb-d salient object detection. IEEE TIP, 30:8727–8742, 2021.
- Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. T-ITS, 2023a.
- Delivering arbitrary-modal semantic segmentation. In CVPR, 2023b.
- RGB-D saliency detection via cascaded mutual information minimization. In ICCV, 2021a.
- C22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTdfnet: Criss-cross dynamic filter network for rgb-d salient object detection. IEEE TMM, 2022.
- ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation. In CVPR, 2021b.
- Rstnet: Captioning with adaptive attention on visual and non-visual words. In CVPR, 2021c.
- Temo: Towards text-driven 3d stylization for multi-object meshes. arXiv preprint arXiv:2312.04248, 2023c.
- Referring camouflaged object detection. arXiv preprint arXiv:2306.07532, 2023d.
- Bilateral attention network for rgb-d salient object detection. IEEE TIP, 30:1949–1961, 2021d.
- Rgb-d salient object detection with ubiquitous target awareness. IEEE TIP, 30:7717–7731, 2021.
- Multispectral fusion transformer network for rgb-thermal urban scene semantic segmentation. IEEE GRSL, 19:1–5, 2022a.
- Mvsalnet: Multi-view augmentation for rgb-d salient object detection. In ECCV, 2022b.
- Specificity-preserving rgb-d saliency detection. In ICCV, 2021a.
- GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation. TIP, 30:7790–7802, 2021b.
- Pgdenet: Progressive guided fusion and depth enhancement network for rgb-d indoor scene parsing. IEEE TMM, 2022c.
- Frnet: Feature reconstruction network for rgb-d indoor scene parsing. JSTSP, 16(4):677–687, 2022d.
- Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In ICCV, 2021.