Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation (2306.10364v1)
Abstract: Semantic segmentation plays an important role in widespread applications such as autonomous driving and robotic sensing. Traditional methods mostly use RGB images which are heavily affected by lighting conditions, \eg, darkness. Recent studies show thermal images are robust to the night scenario as a compensating modality for segmentation. However, existing works either simply fuse RGB-Thermal (RGB-T) images or adopt the encoder with the same structure for both the RGB stream and the thermal stream, which neglects the modality difference in segmentation under varying lighting conditions. Therefore, this work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation. Specifically, we employ an asymmetric encoder to learn the compensating features of the RGB and the thermal images. To effectively fuse the dual-modality features, we generate the pseudo-labels by saliency detection to supervise the feature learning, and develop the Residual Spatial Fusion (RSF) module with structural re-parameterization to learn more promising features by spatially fusing the cross-modality features. RSF employs a hierarchical feature fusion to aggregate multi-level features, and applies the spatial weights with the residual connection to adaptively control the multi-spectral feature fusion by the confidence gate. Extensive experiments were carried out on two benchmarks, \ie, MFNet database and PST900 database. The results have shown the state-of-the-art segmentation performance of our method, which achieves a good balance between accuracy and speed.
- H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020, pp. 11 621–11 631.
- R. T. Collins, A. J. Lipton, and T. Kanade, “Introduction to the special section on video surveillance,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 22, no. 8, pp. 745–746, 2000.
- B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision (IJCV), vol. 127, no. 3, pp. 302–321, 2019.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
- Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 5108–5115.
- Y. Sun, W. Zuo, and M. Liu, “Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes,” IEEE Robotics and Automation Letters (RAL), vol. 4, no. 3, pp. 2576–2583, 2019.
- Y. Sun, W. Zuo, P. Yun, H. Wang, and M. Liu, “Fuseseg: semantic segmentation of urban scenes based on rgb and thermal data fusion,” IEEE Transactions on Automation Science and Engineering (TASE), vol. 18, no. 3, pp. 1000–1011, 2021.
- Z. Chen, R. Cong, Q. Xu, and Q. Huang, “Dpanet: Depth potentiality-aware gated attention network for rgb-d salient object detection,” IEEE Transactions on Image Processing (TIP), vol. 30, pp. 7012–7024, 2020.
- S. Montabone and A. Soto, “Human detection using a mobile platform and novel features derived from a visual saliency mechanism,” Image and Vision Computing (IVC), vol. 28, no. 3, pp. 391–402, 2010.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vgg-style convnets great again,” in CVPR, 2021, pp. 13 733–13 742.
- X. Ding, X. Zhang, J. Han, and G. Ding, “Diverse branch block: Building a convolution as an inception-like unit,” in CVPR, 2021, pp. 10 886–10 895.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in CVPR, 2016, pp. 2818–2826.
- S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V. Kumar, and C. J. Taylor, “Pst900: Rgb-thermal calibration, dataset and segmentation network,” in Proceedings of IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 9441–9447.
- Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3d brain image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 10, pp. 1744–1757, 2009.
- B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation and object localization with superpixel neighborhoods,” in ICCV, 2009, pp. 670–677.
- J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in CVPR, 2008, pp. 1–8.
- J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic segmentation with second-order pooling,” in ECCV, 2012, pp. 430–443.
- J. Carreira and C. Sminchisescu, “Cpmc: Automatic object segmentation using constrained parametric min-cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 34, no. 7, pp. 1312–1328, 2011.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 40, no. 4, pp. 834–848, 2017.
- L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
- H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017, pp. 2881–2890.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
- J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in CVPR, 2019, pp. 3146–3154.
- C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in ECCV, 2018, pp. 325–341.
- C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation,” International Journal of Computer Vision (IJCV), pp. 1–18, 2021.
- S.-A. Liu, H. Xie, H. Xu, Y. Zhang, and Q. Tian, “Partial class activation attention for semantic segmentation,” in CVPR, 2022, pp. 16 836–16 845.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021, pp. 6881–6890.
- H. Wang, P. Cao, J. Wang, and O. R. Zaiane, “Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer,” in AAAI, vol. 36, no. 3, 2022, pp. 2441–2449.
- W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen, “Topformer: Token pyramid transformer for mobile semantic segmentation,” in CVPR, 2022, pp. 12 083–12 093.
- D.-P. Fan, Z. Lin, Z. Zhang, M. Zhu, and M.-M. Cheng, “Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 32, no. 5, pp. 2075–2089, 2020.
- L. Zhu, X. Wang, P. Li, X. Yang, Q. Zhang, W. Wang, C.-B. Schönlieb, and C. P. Chen, “S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT net: Self-supervised self-ensembling network for semi-supervised rgb-d salient object detection,” IEEE Transactions on Multimedia (TMM), vol. 25, pp. 676–689, 2023.
- D. Liu, K. Zhang, and Z. Chen, “Attentive cross-modal fusion network for rgb-d saliency detection,” IEEE Transactions on Multimedia (TMM), vol. 23, pp. 967–981, 2020.
- R. Cong, K. Zhang, C. Zhang, F. Zheng, Y. Zhao, Q. Huang, and S. Kwong, “Does thermal really always matter for rgb-t salient object detection?” IEEE Transactions on Multimedia (TMM), 2022.
- Z. Tu, Y. Ma, Z. Li, C. Li, J. Xu, and Y. Liu, “Rgbt salient object detection: A large-scale dataset and benchmark,” IEEE Transactions on Multimedia (TMM), 2022.
- Q. Xu, Y. Mei, J. Liu, and C. Li, “Multimodal cross-layer bilinear pooling for rgbt tracking,” IEEE Transactions on Multimedia (TMM), vol. 24, pp. 567–580, 2021.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017, pp. 4700–4708.
- W. Zhou, X. Lin, J. Lei, L. Yu, and J.-N. Hwang, “Mffenet: Multiscale feature fusion and enhancement network for rgb–thermal urban road scene parsing,” IEEE Transactions on Multimedia (TMM), vol. 24, pp. 2526–2538, 2021.
- Q. Zhang, S. Zhao, Y. Luo, D. Zhang, N. Huang, and J. Han, “Abmdrnet: Adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation,” in CVPR, 2021, pp. 2633–2642.
- W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “Gmnet: graded-feature multilabel-learning network for rgb-thermal urban scene semantic segmentation,” IEEE Transactions on Image Processing (TIP), vol. 30, pp. 7790–7802, 2021.
- W. Zhou, S. Dong, C. Xu, and Y. Qian, “Edge-aware guidance fusion network for rgb–thermal scene parsing,” in AAAI, vol. 36, no. 3, 2022, pp. 3571–3579.
- W. Zhou, S. Dong, J. Lei, and L. Yu, “Mtanet: Multitask-aware network with hierarchical multimodal fusion for rgb-t urban scene understanding,” IEEE Transactions on Intelligent Vehicles, 2022.
- L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 20, no. 11, pp. 1254–1259, 1998.
- N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Transactions on Systems, Man, and Cybernetics (TSMC), vol. 9, no. 1, pp. 62–66, 1979.
- S. Huang, Z. Lu, R. Cheng, and C. He, “Fapn: Feature-aligned pyramid network for dense image prediction,” in ICCV, 2021, pp. 864–873.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018, pp. 7132–7141.
- Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in CVPR, 2020, pp. 11 534–11 542.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
- R. Girshick, “Fast r-cnn,” in ICCV, 2015, pp. 1440–1448.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
- K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in ICCV, 2015, pp. 1026–1034.
- T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution residual networks for semantic segmentation in street scenes,” in CVPR, 2017, pp. 4151–4160.
- C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learning a discriminative feature network for semantic segmentation,” in CVPR, 2018, pp. 1857–1866.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, 2019, pp. 5693–5703.
- E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems (T-ITS), vol. 19, no. 1, pp. 263–272, 2017.
- Ping Li (421 papers)
- Junjie Chen (89 papers)
- Binbin Lin (50 papers)
- Xianghua Xu (9 papers)