Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Context-Aware Interaction Network for RGB-T Semantic Segmentation (2401.01624v1)

Published 3 Jan 2024 in cs.CV

Abstract: RGB-T semantic segmentation is a key technique for autonomous driving scenes understanding. For the existing RGB-T semantic segmentation methods, however, the effective exploration of the complementary relationship between different modalities is not implemented in the information interaction between multiple levels. To address such an issue, the Context-Aware Interaction Network (CAINet) is proposed for RGB-T semantic segmentation, which constructs interaction space to exploit auxiliary tasks and global context for explicitly guided learning. Specifically, we propose a Context-Aware Complementary Reasoning (CACR) module aimed at establishing the complementary relationship between multimodal features with the long-term context in both spatial and channel dimensions. Further, considering the importance of global contextual and detailed information, we propose the Global Context Modeling (GCM) module and Detail Aggregation (DA) module, and we introduce specific auxiliary supervision to explicitly guide the context interaction and refine the segmentation map. Extensive experiments on two benchmark datasets of MFNet and PST900 demonstrate that the proposed CAINet achieves state-of-the-art performance. The code is available at https://github.com/YingLv1106/CAINet.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE CVPR, Jun. 2015, pp. 3431–3440.
  2. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” in ICLR, May 2015, pp. 1–14.
  3. V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017.
  4. E. Xie, W. Wang, Z. Yu, A. Anandkuma, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proc. NeurIPS, Dec. 2021, pp. 12 077–12 090.
  5. G. Gao, G. Xu, J. Li, Y. Yu, H. Lu, and J. Yang, “Fbsnet: A fast bilateral symmetrical network for real-time semantic segmentation,” IEEE Trans. Multimedia, pp. 1–1, 2022.
  6. T. Chen, G.-S. Xie, Y. Yao, Q. Wang, F. Shen, Z. Tang, and J. Zhang, “Semantically meaningful class prototype learning for one-shot image segmentation,” IEEE Transactions on Multimedia, vol. 24, pp. 968–980, 2022.
  7. M. Zhang, Y. Zhou, B. Liu, J. Zhao, R. Yao, Z. Shao, and H. Zhu, “Weakly supervised few-shot semantic segmentation via pseudo mask enhancement and meta learning,” IEEE Trans. Multimedia, pp. 1–13, 2022.
  8. L. Ma, H. Xie, C. Liu, and Y. Zhang, “Learning cross-channel representations for semantic segmentation,” IEEE Trans. Multimedia, vol. 25, pp. 2774–2787, 2023.
  9. L. Zhao, H. Zhou, X. Zhu, X. Song, H. Li, and W. Tao, “Lif-seg: Lidar and camera image fusion for 3d lidar semantic segmentation,” IEEE Trans. Multimedia, pp. 1–11, 2023.
  10. Z. Song, L. Zhao, and J. Zhou, “Learning hybrid semantic affinity for point cloud segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4599–4612, 2022.
  11. Y. Tian and S. Zhu, “Partial domain adaptation on semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 6, pp. 3798–3809, 2022.
  12. S. Ainetter and F. Fraundorfer, “End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb,” in Proc. IEEE ICRA, 2021, pp. 13 452–13 458.
  13. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, Oct. 2015, pp. 234–241.
  14. W. Li, W. Jia, Y. Fan, and S. Ma, “Virtual reality image dataset vcat helps research on semantic segmentation algorithms,” in Proc. IEEE ICIT, 2022, pp. 1–6.
  15. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. ECCV, Sept. 2018, pp. 833–851.
  16. B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Proc. NeurIPS, vol. 34, pp. 17 864–17 875, Oct. 2021.
  17. G. Li, Z. Bai, Z. Liu, X. Zhang, and H. Ling, “Salient object detection in optical remote sensing images driven by transformer,” IEEE Trans. Image Process., vol. 32, pp. 5257–5269, Sep. 2023.
  18. Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in Proc. IEEE/RSJ IROS, Sept. 2017, pp. 5108–5115.
  19. Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,” IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 2576–2583, Jul. 2019.
  20. S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V. Kumar, and C. J. Taylor, “PST900: RGB-thermal calibration, dataset and segmentation network,” in Proc. IEEE ICRA, 2020, pp. 9441–9447.
  21. Z. Guo, X. Li, QiminXu, and Z. Sun, “Robust semantic segmentation based on RGB-thermal in variable lighting scenes,” Meas., vol. 186, p. 110176, Dec. 2021.
  22. Y. Sun, W. Zuo, P. Yun, H. Wang, and M. Liu, “FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion,” IEEE Trans. Autom. Sci. Eng., vol. 18, no. 3, pp. 1000–1011, Jul. 2021.
  23. Q. Zhang, S. Zhao, Y. Luo, D. Zhang, N. Huang, and J. Han, “ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation,” in Proc. IEEE CVPR, Jun. 2021, pp. 2633–2642.
  24. W. Zhou, S. Dong, C. Xu, and Q. Yaguan, “Edge-aware guidance fusion network for RGB-thermal scene parsing,” in Proc. AAAI, Feb. 2022.
  25. F. Deng, H. Feng, M. Liang, H. Wang, Y. Yang, Y. Gao, J. Chen, J. Hu, X. Guo, and T. L. Lam, “FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation,” in Proc. IEEE/RSJ IROS, Sept. 2021, pp. 4467–4473.
  26. W. Zhou, S. Dong, J. Lei, and L. Yu, “MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding,” IEEE Trans. Intell. Veh., 2022. [Online]. Available: 10.1109/TIV.2022.3164899
  27. W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation,” IEEE Trans. Image Process., vol. 30, pp. 7790–7802, 2021.
  28. G. Li, Y. Wang, Z. Liu, X. Zhang, and D. Zeng, “RGB-T semantic segmentation with location, activation, and sharpening,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1223–1235, Mar. 2023.
  29. W. Zhou, Y. Lv, J. Lei, and L. Yu, “Embedded control gate fusion and attention residual learning for rgb–thermal urban scene parsing,” IEEE Trans. Intell. Transp. Syst., Feb. 2023.
  30. W. Zhou, X. Lin, J. Lei, L. Yu, and J.-N. Hwang, “MFFENet: Multiscale feature fusion and enhancement network for RGB-thermal urban road scene parsing,” IEEE Trans. Multimedia, vol. 24, pp. 2526–2538, 2022.
  31. H. Zhou, C. Tian, Z. Zhang, Q. Huo, Y. Xie, and Z. Li, “Multispectral fusion transformer network for rgb-thermal urban scene semantic segmentation,” IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022.
  32. W. Wu, T. Chu, and Q. Liu, “Complementarity-aware cross-modal feature fusion network for rgb-t semantic segmentation,” Pattern Recognit., vol. 131, p. 108881, 2022.
  33. H. Liu, J. Zhang, K. Yang, X. Hu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” arXiv preprint arXiv:2203.04838, 2022.
  34. A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,” in Proc. ICLR, Feb 2018.
  35. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018.
  36. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proc. IEEE CVPR, Jul. 2017, pp. 6230–6239.
  37. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE CVPR, June 2018.
  38. J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8, pp. 2011–2023, Aug. 2020.
  39. S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” in Proc. ECCV, Sept. 2018, pp. 3–19.
  40. Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proc. IEEE ICCVW, 2019, pp. 0–0.
  41. X. Zhao, L. Zhang, Y. Pang, H. Lu, and L. Zhang, “A single stream network for robust and real-time RGB-D salient object detection,” in Proc. ECCV, Aug. 2020, pp. 646–662.
  42. S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proc. IEEE CVPR, June 2021, pp. 6881–6890.
  43. B. Zhang, Z. Tian, Q. Tang, X. Chu, X. Wei, C. Shen et al., “Segvit: Semantic segmentation with plain vision transformers,” Proc. NeurIPS, vol. 35, pp. 4971–4982, Dec. 2022.
  44. D. Kim, J. Kim, S. Cho, C. Luo, and S. Hong, “Universal few-shot learning of dense prediction tasks with visual token matching,” in Proc. ICLR, 2023. [Online]. Available: https://openreview.net/forum?id=88nT0j5jAn
  45. X. Lan, X. Gu, and X. Gu, “MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation,” Appl. Intell., vol. 52, no. 5, pp. 5817–5829, Mar. 2022.
  46. Y. Wang, T. Lu, Y. Yao, Y. Zhang, and Z. Xiong, “Learning to hallucinate face in the dark,” IEEE Trans. Multimedia, pp. 1–13, Jul. 2023.
  47. J. Liu, W. Zhou, Y. Cui, L. Yu, and T. Luo, “GCNet: Grid-like context-aware network for RGB-thermal semantic segmentation,” Neurocomputing, vol. 508, pp. 60–67, Sept. 2022.
  48. T. Gong, W. Zhou, X. Qian, J. Lei, and L. Yu, “Global contextually guided lightweight network for RGB-thermal urban scene understanding,” vol. 117, pp. 1–11, Jan. 2023.
  49. S. Yi, J. Li, X. Liu, and X. Yuan, “CCAFFMNet: Dual-spectral semantic segmentation network with channel-coordinate attention feature fusion module,” Neurocomputing, vol. 482, pp. 236–251, Apr. 2022.
  50. C. Xu, Q. Li, X. Jiang, D. Yu, and Y. Zhou, “Dual-space graph-based interaction network for rgb-thermal semantic segmentation in electric power scene,” IEEE Trans. Circuits Syst. Video Technol., Oct. 2022.
  51. S. Zhao and Q. Zhang, “A feature divide-and-conquer network for RGB-T semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., pp. 1–14, 2022. [Online]. Available: 10.1109/TCSVT.2022.3229359
  52. O. Frigo, L. Martin-Gaffe, and C. Wacongne, “DooDLeNet: Double deeplab enhanced feature fusion for thermal-color semantic segmentation,” in Proc. IEEE CVPRW, Jun. 2022, pp. 3020–3028.
  53. Y. Wang, Z. Cui, and Y. Li, “Distribution-consistent modal recovering for incomplete multimodal learning,” in Proc. IEEE ICCV, Oct. 2023, pp. 22 025–22 034.
  54. Y. Wang, Y. Li, and Z. Cui, “Incomplete multimodality-diffused emotion recognition,” in Proc. NeurIPS, Sep. 2023.
  55. Y. Fu, Q. Chen, and H. Zhao, “Cgfnet: cross-guided fusion network for rgb-thermal semantic segmentation,” Vis. Comput., vol. 38, no. 9-10, pp. 3243–3252, July 2022.
  56. Z. Feng, Y. Guo, and Y. Sun, “CEKD: Cross-modal edge-privileged knowledge distillation for semantic scene understanding using only thermal images,” IEEE Robot. Autom. Lett., vol. 8, no. 4, pp. 2205–2212, Jan. 2023.
  57. Y. Cai, W. Zhou, L. Zhang, L. Yu, and T. Luo, “DHFNet: dual-decoding hierarchical fusion network for rgb-thermal semantic segmentation,” Jan. 2023.
  58. H. Zhou, C. Tian, Z. Zhang, Q. Huo, Y. Xie, and Z. Li, “Multispectral fusion transformer network for RGB-thermal urban scene semantic segmentation,” IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, Jun. 2022.
  59. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE CVPR, Jun. 2018, pp. 4510–4520.
  60. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE CVPR, 2009, pp. 248–255.
  61. X. Ma, Y. Zhou, H. Wang, C. Qin, B. Sun, C. Liu, and Y. Fu, “Image as set of points,” in Proc. ICLR, 2023.
  62. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  63. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021.
  64. Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and Y. Kalantidis, “Graph-based global reasoning networks,” in Proc. IEEE CVPR, June 2019, pp. 433–442.
  65. M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “DenseASPP for semantic segmentation in street scenes,” in Proc. IEEE CVPR, Jun. 2018, pp. 3684–3692.
  66. W. Zhou, Y. Lv, J. Lei, and L. Yu, “Global and local-contrast guides content-aware fusion for rgb-d saliency prediction,” IEEE Trans. Syst. Man Cybern. Syst., vol. 51, no. 6, pp. 3641–3649, Dec. 2021.
  67. M. Berman, A. Rannen Triki, and M. B. Blaschko, “The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proc. IEEE CVPR, 2018, pp. 4413–4421.
  68. A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
  69. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Proc. NeurIPS, Dec. 2019, pp. 8024–8035.
  70. W. Wang and U. Neumann, “Depth-aware cnn for RGB-D segmentation,” in Proc. ECCV, Sept. 2018, pp. 144–161.
  71. X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neural networks for rgbd semantic segmentation,” in Proc. IEEE ICCV, 2017, pp. 5199–5208.
  72. Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin, “Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling,” in Proc. ECCV, 2016, pp. 541–557.
  73. G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proc. IEEE CVPR, Jul. 2017, pp. 5168–5177.
  74. S. Lee, S.-J. Park, and K.-S. Hong, “RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation,” in Proc. IEEE ICCV, Oct. 2017, pp. 4990–4999.
  75. X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng, “Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation,” in Proc. ECCV, Aug. 2020, pp. 561–577.
  76. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Proc. ECCV, Oct. 2012, pp. 746–760.
Citations (13)

Summary

  • The paper introduces CAINet, a model that leverages a Context-Aware Complementary Reasoning module to effectively integrate RGB and thermal features.
  • Its Global Context Modeling and Detail Aggregation modules enhance feature representation and precision in semantic segmentation.
  • Experiments on MFNet and PST900 show that CAINet achieves high mIoU with only 12.16 million parameters, highlighting both accuracy and efficiency.

Context-Aware Interaction Network for RGB-T Semantic Segmentation: An In-depth Analysis

The paper "Context-Aware Interaction Network for RGB-T Semantic Segmentation" presents a significant contribution to the field of computer vision, specifically focusing on the semantic segmentation of RGB-T images. This research addresses the critical challenge of effectively integrating multimodal information from RGB and thermal (T) images, which is essential for improving scene understanding in various applications, such as autonomous driving.

Overview and Methodology

The authors introduce the Context-Aware Interaction Network (CAINet), a sophisticated model designed to enhance RGB-T semantic segmentation by leveraging the complementary nature of RGB and thermal data. The primary innovation of CAINet is the introduction of several novel modules that strategically utilize contextual and feature interactions across different levels of the network, enhancing both local and global feature representations.

  1. Context-Aware Complementary Reasoning (CACR) Module: This module establishes complementary relationships between multimodal features by reasoning across long-term contexts in both spatial and channel dimensions. By doing so, the network is able to exploit the unique information provided by each modality effectively.
  2. Global Context Modeling (GCM) Module: This component is responsible for capturing global context to guide the interaction of multi-level features, thus improving the semantic consistency of the resulting feature maps.
  3. Detail Aggregation (DA) Module: Recognizing the importance of boundary detail for refining segmentation maps, the DA module aggregates fine details from lower-level features, facilitating more precise segmentation outputs.
  4. Residual Learning with Auxiliary Supervision: The model further incorporates an auxiliary supervision mechanism, which guides the network towards more robust feature representations through residual learning and explicit supervision at multiple levels.

Experimental Results

The effectiveness of CAINet is demonstrated through its performance on two benchmark datasets: MFNet and PST900. Notably, CAINet achieves state-of-the-art results, with a mean Intersection over Union (mIoU) of 58.6% on the MFNet dataset, significantly outperforming other contemporary methods. This impressive performance underscores the model's ability to harness the complementary strengths of RGB and thermal inputs effectively.

Furthermore, the model's efficiency is highlighted by its relatively low computational complexity, boasting only 12.16 million parameters and 123.62 GFLOPs, which is considerably lower than many competing models. This efficiency, coupled with high segmentation accuracy, makes CAINet an attractive solution for applications that demand both high performance and computational feasibility.

Implications and Future Directions

The proposed CAINet model holds substantial implications for the advancement of multimodal semantic segmentation. By addressing the challenge of effectively integrating RGB and thermal data, the model paves the way for more accurate and reliable scene understanding systems, which are crucial for safe and efficient autonomous systems. The integration of the CACR module and global context modeling represents a promising paradigm shift in how multimodal information is processed and utilized.

Looking forward, future work could focus on further optimizing the CAINet architecture to enhance its deployment in real-time applications. Additionally, exploring the adaptability of CAINet to other multimodal or multi-sensor data types could open new avenues for research in diverse fields such as robotics, surveillance, and healthcare.

In conclusion, the CAINet framework offers a robust and efficient approach for RGB-T semantic segmentation, setting a new benchmark in the field and inspiring further innovations in multimodal scene understanding.

Youtube Logo Streamline Icon: https://streamlinehq.com