Context-Aware Interaction Network for RGB-T Semantic Segmentation (2401.01624v1)
Abstract: RGB-T semantic segmentation is a key technique for autonomous driving scenes understanding. For the existing RGB-T semantic segmentation methods, however, the effective exploration of the complementary relationship between different modalities is not implemented in the information interaction between multiple levels. To address such an issue, the Context-Aware Interaction Network (CAINet) is proposed for RGB-T semantic segmentation, which constructs interaction space to exploit auxiliary tasks and global context for explicitly guided learning. Specifically, we propose a Context-Aware Complementary Reasoning (CACR) module aimed at establishing the complementary relationship between multimodal features with the long-term context in both spatial and channel dimensions. Further, considering the importance of global contextual and detailed information, we propose the Global Context Modeling (GCM) module and Detail Aggregation (DA) module, and we introduce specific auxiliary supervision to explicitly guide the context interaction and refine the segmentation map. Extensive experiments on two benchmark datasets of MFNet and PST900 demonstrate that the proposed CAINet achieves state-of-the-art performance. The code is available at https://github.com/YingLv1106/CAINet.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE CVPR, Jun. 2015, pp. 3431–3440.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” in ICLR, May 2015, pp. 1–14.
- V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017.
- E. Xie, W. Wang, Z. Yu, A. Anandkuma, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proc. NeurIPS, Dec. 2021, pp. 12 077–12 090.
- G. Gao, G. Xu, J. Li, Y. Yu, H. Lu, and J. Yang, “Fbsnet: A fast bilateral symmetrical network for real-time semantic segmentation,” IEEE Trans. Multimedia, pp. 1–1, 2022.
- T. Chen, G.-S. Xie, Y. Yao, Q. Wang, F. Shen, Z. Tang, and J. Zhang, “Semantically meaningful class prototype learning for one-shot image segmentation,” IEEE Transactions on Multimedia, vol. 24, pp. 968–980, 2022.
- M. Zhang, Y. Zhou, B. Liu, J. Zhao, R. Yao, Z. Shao, and H. Zhu, “Weakly supervised few-shot semantic segmentation via pseudo mask enhancement and meta learning,” IEEE Trans. Multimedia, pp. 1–13, 2022.
- L. Ma, H. Xie, C. Liu, and Y. Zhang, “Learning cross-channel representations for semantic segmentation,” IEEE Trans. Multimedia, vol. 25, pp. 2774–2787, 2023.
- L. Zhao, H. Zhou, X. Zhu, X. Song, H. Li, and W. Tao, “Lif-seg: Lidar and camera image fusion for 3d lidar semantic segmentation,” IEEE Trans. Multimedia, pp. 1–11, 2023.
- Z. Song, L. Zhao, and J. Zhou, “Learning hybrid semantic affinity for point cloud segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4599–4612, 2022.
- Y. Tian and S. Zhu, “Partial domain adaptation on semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 6, pp. 3798–3809, 2022.
- S. Ainetter and F. Fraundorfer, “End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb,” in Proc. IEEE ICRA, 2021, pp. 13 452–13 458.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, Oct. 2015, pp. 234–241.
- W. Li, W. Jia, Y. Fan, and S. Ma, “Virtual reality image dataset vcat helps research on semantic segmentation algorithms,” in Proc. IEEE ICIT, 2022, pp. 1–6.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. ECCV, Sept. 2018, pp. 833–851.
- B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Proc. NeurIPS, vol. 34, pp. 17 864–17 875, Oct. 2021.
- G. Li, Z. Bai, Z. Liu, X. Zhang, and H. Ling, “Salient object detection in optical remote sensing images driven by transformer,” IEEE Trans. Image Process., vol. 32, pp. 5257–5269, Sep. 2023.
- Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in Proc. IEEE/RSJ IROS, Sept. 2017, pp. 5108–5115.
- Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,” IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 2576–2583, Jul. 2019.
- S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V. Kumar, and C. J. Taylor, “PST900: RGB-thermal calibration, dataset and segmentation network,” in Proc. IEEE ICRA, 2020, pp. 9441–9447.
- Z. Guo, X. Li, QiminXu, and Z. Sun, “Robust semantic segmentation based on RGB-thermal in variable lighting scenes,” Meas., vol. 186, p. 110176, Dec. 2021.
- Y. Sun, W. Zuo, P. Yun, H. Wang, and M. Liu, “FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion,” IEEE Trans. Autom. Sci. Eng., vol. 18, no. 3, pp. 1000–1011, Jul. 2021.
- Q. Zhang, S. Zhao, Y. Luo, D. Zhang, N. Huang, and J. Han, “ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation,” in Proc. IEEE CVPR, Jun. 2021, pp. 2633–2642.
- W. Zhou, S. Dong, C. Xu, and Q. Yaguan, “Edge-aware guidance fusion network for RGB-thermal scene parsing,” in Proc. AAAI, Feb. 2022.
- F. Deng, H. Feng, M. Liang, H. Wang, Y. Yang, Y. Gao, J. Chen, J. Hu, X. Guo, and T. L. Lam, “FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation,” in Proc. IEEE/RSJ IROS, Sept. 2021, pp. 4467–4473.
- W. Zhou, S. Dong, J. Lei, and L. Yu, “MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding,” IEEE Trans. Intell. Veh., 2022. [Online]. Available: 10.1109/TIV.2022.3164899
- W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation,” IEEE Trans. Image Process., vol. 30, pp. 7790–7802, 2021.
- G. Li, Y. Wang, Z. Liu, X. Zhang, and D. Zeng, “RGB-T semantic segmentation with location, activation, and sharpening,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1223–1235, Mar. 2023.
- W. Zhou, Y. Lv, J. Lei, and L. Yu, “Embedded control gate fusion and attention residual learning for rgb–thermal urban scene parsing,” IEEE Trans. Intell. Transp. Syst., Feb. 2023.
- W. Zhou, X. Lin, J. Lei, L. Yu, and J.-N. Hwang, “MFFENet: Multiscale feature fusion and enhancement network for RGB-thermal urban road scene parsing,” IEEE Trans. Multimedia, vol. 24, pp. 2526–2538, 2022.
- H. Zhou, C. Tian, Z. Zhang, Q. Huo, Y. Xie, and Z. Li, “Multispectral fusion transformer network for rgb-thermal urban scene semantic segmentation,” IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022.
- W. Wu, T. Chu, and Q. Liu, “Complementarity-aware cross-modal feature fusion network for rgb-t semantic segmentation,” Pattern Recognit., vol. 131, p. 108881, 2022.
- H. Liu, J. Zhang, K. Yang, X. Hu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” arXiv preprint arXiv:2203.04838, 2022.
- A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,” in Proc. ICLR, Feb 2018.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018.
- H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proc. IEEE CVPR, Jul. 2017, pp. 6230–6239.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE CVPR, June 2018.
- J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8, pp. 2011–2023, Aug. 2020.
- S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” in Proc. ECCV, Sept. 2018, pp. 3–19.
- Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proc. IEEE ICCVW, 2019, pp. 0–0.
- X. Zhao, L. Zhang, Y. Pang, H. Lu, and L. Zhang, “A single stream network for robust and real-time RGB-D salient object detection,” in Proc. ECCV, Aug. 2020, pp. 646–662.
- S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proc. IEEE CVPR, June 2021, pp. 6881–6890.
- B. Zhang, Z. Tian, Q. Tang, X. Chu, X. Wei, C. Shen et al., “Segvit: Semantic segmentation with plain vision transformers,” Proc. NeurIPS, vol. 35, pp. 4971–4982, Dec. 2022.
- D. Kim, J. Kim, S. Cho, C. Luo, and S. Hong, “Universal few-shot learning of dense prediction tasks with visual token matching,” in Proc. ICLR, 2023. [Online]. Available: https://openreview.net/forum?id=88nT0j5jAn
- X. Lan, X. Gu, and X. Gu, “MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation,” Appl. Intell., vol. 52, no. 5, pp. 5817–5829, Mar. 2022.
- Y. Wang, T. Lu, Y. Yao, Y. Zhang, and Z. Xiong, “Learning to hallucinate face in the dark,” IEEE Trans. Multimedia, pp. 1–13, Jul. 2023.
- J. Liu, W. Zhou, Y. Cui, L. Yu, and T. Luo, “GCNet: Grid-like context-aware network for RGB-thermal semantic segmentation,” Neurocomputing, vol. 508, pp. 60–67, Sept. 2022.
- T. Gong, W. Zhou, X. Qian, J. Lei, and L. Yu, “Global contextually guided lightweight network for RGB-thermal urban scene understanding,” vol. 117, pp. 1–11, Jan. 2023.
- S. Yi, J. Li, X. Liu, and X. Yuan, “CCAFFMNet: Dual-spectral semantic segmentation network with channel-coordinate attention feature fusion module,” Neurocomputing, vol. 482, pp. 236–251, Apr. 2022.
- C. Xu, Q. Li, X. Jiang, D. Yu, and Y. Zhou, “Dual-space graph-based interaction network for rgb-thermal semantic segmentation in electric power scene,” IEEE Trans. Circuits Syst. Video Technol., Oct. 2022.
- S. Zhao and Q. Zhang, “A feature divide-and-conquer network for RGB-T semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., pp. 1–14, 2022. [Online]. Available: 10.1109/TCSVT.2022.3229359
- O. Frigo, L. Martin-Gaffe, and C. Wacongne, “DooDLeNet: Double deeplab enhanced feature fusion for thermal-color semantic segmentation,” in Proc. IEEE CVPRW, Jun. 2022, pp. 3020–3028.
- Y. Wang, Z. Cui, and Y. Li, “Distribution-consistent modal recovering for incomplete multimodal learning,” in Proc. IEEE ICCV, Oct. 2023, pp. 22 025–22 034.
- Y. Wang, Y. Li, and Z. Cui, “Incomplete multimodality-diffused emotion recognition,” in Proc. NeurIPS, Sep. 2023.
- Y. Fu, Q. Chen, and H. Zhao, “Cgfnet: cross-guided fusion network for rgb-thermal semantic segmentation,” Vis. Comput., vol. 38, no. 9-10, pp. 3243–3252, July 2022.
- Z. Feng, Y. Guo, and Y. Sun, “CEKD: Cross-modal edge-privileged knowledge distillation for semantic scene understanding using only thermal images,” IEEE Robot. Autom. Lett., vol. 8, no. 4, pp. 2205–2212, Jan. 2023.
- Y. Cai, W. Zhou, L. Zhang, L. Yu, and T. Luo, “DHFNet: dual-decoding hierarchical fusion network for rgb-thermal semantic segmentation,” Jan. 2023.
- H. Zhou, C. Tian, Z. Zhang, Q. Huo, Y. Xie, and Z. Li, “Multispectral fusion transformer network for RGB-thermal urban scene semantic segmentation,” IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, Jun. 2022.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE CVPR, Jun. 2018, pp. 4510–4520.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE CVPR, 2009, pp. 248–255.
- X. Ma, Y. Zhou, H. Wang, C. Qin, B. Sun, C. Liu, and Y. Fu, “Image as set of points,” in Proc. ICLR, 2023.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021.
- Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and Y. Kalantidis, “Graph-based global reasoning networks,” in Proc. IEEE CVPR, June 2019, pp. 433–442.
- M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “DenseASPP for semantic segmentation in street scenes,” in Proc. IEEE CVPR, Jun. 2018, pp. 3684–3692.
- W. Zhou, Y. Lv, J. Lei, and L. Yu, “Global and local-contrast guides content-aware fusion for rgb-d saliency prediction,” IEEE Trans. Syst. Man Cybern. Syst., vol. 51, no. 6, pp. 3641–3649, Dec. 2021.
- M. Berman, A. Rannen Triki, and M. B. Blaschko, “The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proc. IEEE CVPR, 2018, pp. 4413–4421.
- A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Proc. NeurIPS, Dec. 2019, pp. 8024–8035.
- W. Wang and U. Neumann, “Depth-aware cnn for RGB-D segmentation,” in Proc. ECCV, Sept. 2018, pp. 144–161.
- X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neural networks for rgbd semantic segmentation,” in Proc. IEEE ICCV, 2017, pp. 5199–5208.
- Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin, “Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling,” in Proc. ECCV, 2016, pp. 541–557.
- G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proc. IEEE CVPR, Jul. 2017, pp. 5168–5177.
- S. Lee, S.-J. Park, and K.-S. Hong, “RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation,” in Proc. IEEE ICCV, Oct. 2017, pp. 4990–4999.
- X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng, “Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation,” in Proc. ECCV, Aug. 2020, pp. 561–577.
- N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Proc. ECCV, Oct. 2012, pp. 746–760.