Papers
Topics
Authors
Recent
Search
2000 character limit reached

CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Published 9 Mar 2022 in cs.CV, cs.RO, and eess.IV | (2203.04838v5)

Abstract: Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality (X-modality). However, covering a wide variety of sensors with a modality-agnostic model remains an unresolved problem due to variations in sensor characteristics among different modalities. Unlike previous modality-specific methods, in this work, we propose a unified fusion framework, CMX, for RGB-X semantic segmentation. To generalize well across different modalities, that often include supplements as well as uncertainties, a unified cross-modal interaction is crucial for modality fusion. Specifically, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate bi-modal features by leveraging the features from one modality to rectify the features of the other modality. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to perform sufficient exchange of long-range contexts before mixing. To verify CMX, for the first time, we unify five modalities complementary to RGB, i.e., depth, thermal, polarization, event, and LiDAR. Extensive experiments show that CMX generalizes well to diverse multi-modal fusion, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal, RGB-Polarization, and RGB-LiDAR datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (104)
  1. W. Zhou, J. S. Berrio, S. Worrall, and E. Nebot, “Automated evaluation of semantic segmentation robustness for autonomous driving,” T-ITS, vol. 21, no. 5, pp. 1951–1963, 2020.
  2. K. Yang, X. Hu, Y. Fang, K. Wang, and R. Stiefelhagen, “Omnisupervised omnidirectional semantic segmentation,” T-ITS, vol. 23, no. 2, pp. 1184–1199, 2022.
  3. L. Sun, K. Yang, X. Hu, W. Hu, and K. Wang, “Real-time fusion network for RGB-D semantic segmentation incorporating unexpected obstacle detection for road-driving images,” RA-L, vol. 5, no. 4, pp. 5558–5565, 2020.
  4. J. Zhang, K. Yang, A. Constantinescu, K. Peng, K. Müller, and R. Stiefelhagen, “Trans4Trans: Efficient transformer for transparent object and semantic scene segmentation in real-world navigation assistance,” T-ITS, vol. 23, no. 10, pp. 19 173–19 186, 2022.
  5. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” TPAMI, vol. 40, no. 4, pp. 834–848, 2018.
  6. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017.
  7. J. Fu et al., “Dual attention network for scene segmentation,” in CVPR, 2019.
  8. X. Hu, K. Yang, L. Fei, and K. Wang, “ACNet: Attention based network to exploit complementary features for RGBD semantic segmentation,” in ICIP, 2019.
  9. X. Chen et al., “Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation,” in ECCV, 2020.
  10. Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in IROS, 2017.
  11. Q. Zhang, S. Zhao, Y. Luo, D. Zhang, N. Huang, and J. Han, “ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation,” in CVPR, 2021.
  12. K. Xiang, K. Yang, and K. Wang, “Polarization-driven semantic segmentation via efficient attention-bridged fusion,” OE, vol. 29, no. 4, pp. 4802–4820, 2021.
  13. J. Zhang, K. Yang, and R. Stiefelhagen, “ISSAFE: Improving semantic segmentation in accidents by fusing event-based data,” in IROS, 2021.
  14. Z. Zhuang, R. Li, K. Jia, Q. Wang, Y. Li, and M. Tan, “Perception-aware multi-sensor fusion for 3D LiDAR semantic segmentation,” in ICCV, 2021.
  15. J. Cao, H. Leng, D. Lischinski, D. Cohen-Or, C. Tu, and Y. Li, “ShapeConv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation,” in ICCV, 2021.
  16. L.-Z. Chen, Z. Lin, Z. Wang, Y.-L. Yang, and M.-M. Cheng, “Spatial information guided convolution for real-time RGBD semantic segmentation,” TIP, vol. 30, pp. 2313–2324, 2021.
  17. F. Deng et al., “FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation,” in IROS, 2021.
  18. D. Sun, X. Huang, and K. Yang, “A multimodal vision sensor for autonomous driving,” in SPIE, 2019.
  19. R. Girdhar, M. Singh, N. Ravi, L. van der Maaten, A. Joulin, and I. Misra, “Omnivore: A single model for many visual modalities,” in CVPR, 2022.
  20. A. Vaswani et al., “Attention is all you need,” in NeurIPS, 2017.
  21. A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  22. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in ICML, 2021.
  23. Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
  24. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in ECCV, 2012.
  25. Y. Liao, J. Xie, and A. Geiger, “KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2D and 3D,” TPAMI, vol. 45, no. 3, pp. 3292–3310, 2023.
  26. D. Gehrig, M. Rüegg, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, “Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction,” RA-L, vol. 6, no. 2, pp. 2822–2829, 2021.
  27. W. Wang, T. Zhou, F. Yu, J. Dai, E. Konukoglu, and L. Van Gool, “Exploring cross-image pixel contrast for semantic segmentation,” ICCV, 2021.
  28. T. Zhou, W. Wang, E. Konukoglu, and L. Van Gool, “Rethinking semantic segmentation: A prototype view,” in CVPR, 2022.
  29. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018.
  30. Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet: Criss-cross attention for semantic segmentation,” in ICCV, 2019.
  31. S. Zheng et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021.
  32. R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in ICCV, 2021.
  33. E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in NeurIPS, 2021.
  34. W. Wang et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in ICCV, 2021.
  35. Y. Yuan et al., “HRFormer: High-resolution transformer for dense prediction,” in NeurIPS, 2021.
  36. Y. Zhang, B. Pang, and C. Lu, “Semantic segmentation by early region proxy,” in CVPR, 2022.
  37. F. Lin, Z. Liang, J. He, M. Zheng, S. Tian, and K. Chen, “StructToken : Rethinking semantic segmentation with structural prior,” TCSVT, 2023.
  38. Y. Qian, L. Deng, T. Li, C. Wang, and M. Yang, “Gated-residual block for semantic segmentation using RGB-D data,” T-ITS, vol. 23, no. 8, pp. 11 836–11 844, 2022.
  39. H. Zhou, L. Qi, H. Huang, X. Yang, Z. Wan, and X. Wen, “CANet: Co-attention network for RGB-D semantic segmentation,” PR, vol. 124, p. 108468, 2022.
  40. Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,” RA-L, vol. 4, no. 3, pp. 2576–2583, 2019.
  41. Y. Sun, W. Zuo, P. Yun, H. Wang, and M. Liu, “FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion,” T-ASE, vol. 18, no. 3, pp. 1000–1011, 2021.
  42. W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation,” TIP, vol. 30, pp. 7790–7802, 2021.
  43. A. Kalra, V. Taamazyan, S. K. Rao, K. Venkataraman, R. Raskar, and A. Kadambi, “Deep polarization cues for transparent object segmentation,” in CVPR, 2020.
  44. J. Zhang, K. Yang, and R. Stiefelhagen, “Exploring event-driven dynamic context for accident scene segmentation,” T-ITS, vol. 23, no. 3, pp. 2606–2622, 2022.
  45. W. Wang and U. Neumann, “Depth-aware CNN for RGB-D segmentation,” in ECCV, 2018.
  46. Y. Xing, J. Wang, and G. Zeng, “Malleable 2.5D convolution: Learning receptive fields along the depth-axis for RGB-D scene parsing,” in ECCV, 2020.
  47. Z. Wu, G. Allibert, C. Stolz, and C. Demonceaux, “Depth-adapted CNN for RGB-D cameras,” in ACCV, 2020.
  48. Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang, “Pattern-affinitive propagation across depth, surface normal and semantic segmentation,” in CVPR, 2019.
  49. R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir, “MultiMAE: Multi-modal multi-task masked autoencoders,” in ECCV, 2022.
  50. P. Zhang, W. Liu, Y. Lei, and H. Lu, “Hyperfusion-net: Hyper-densely reflective feature fusion for salient object detection,” PR, vol. 93, pp. 521–533, 2019.
  51. Y. Pang, X. Zhao, L. Zhang, and H. Lu, “CAVER: Cross-modal view-mixed transformer for bi-modal salient object detection,” TIP, 2023.
  52. L. Chen et al., “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in CVPR, 2017.
  53. Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” in WACV, 2021.
  54. J. Li, A. Hassani, S. Walton, and H. Shi, “ConvMLP: hierarchical convolutional MLPs for vision,” arXiv preprint arXiv:2109.04454, 2021.
  55. S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in ECCV, 2014.
  56. R. Yan, K. Yang, and K. Wang, “NLFNet: Non-local fusion towards generalized multimodal semantic segmentation across RGB-depth, polarization, and thermal images,” in ROBIO, 2021.
  57. I. Alonso and A. C. Murillo, “EV-SegNet: Semantic segmentation for event-based cameras,” in CVPRW, 2019.
  58. E. Mohammadbagher, N. P. Bhatt, E. Hashemi, B. Fidan, and A. Khajepour, “Real-time pedestrian localization and state estimation using moving horizon estimation,” in ITSC, 2020.
  59. S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” in CVPR, 2015.
  60. G. Zhang, J.-H. Xue, P. Xie, S. Yang, and G. Wang, “Non-local aggregation for RGB-D semantic segmentation,” SPL, vol. 28, pp. 658–662, 2021.
  61. I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D-semantic data for indoor scene understanding,” arXiv preprint arXiv:1702.01105, 2017.
  62. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3D reconstructions of indoor scenes,” in CVPR, 2017.
  63. M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016.
  64. A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in CoRL, 2017.
  65. Z. Sun, N. Messikommer, D. Gehrig, and D. Scaramuzza, “ESS: Learning event-based semantic segmentation from still images,” in ECCV, 2022.
  66. O. Russakovsky et al., “ImageNet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
  67. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  68. X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural networks for RGBD semantic segmentation,” in ICCV, 2017.
  69. S. Kong and C. C. Fowlkes, “Recurrent scene parsing with perspective understanding in the loop,” in CVPR, 2018.
  70. Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation,” in CVPR, 2017.
  71. D. Lin, G. Chen, D. Cohen-Or, P.-A. Heng, and H. Huang, “Cascaded feature network for semantic segmentation of RGB-D images,” in ICCV, 2017.
  72. S.-J. Park, K.-S. Hong, and S. Lee, “RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation,” in ICCV, 2017.
  73. F. Fooladgar and S. Kasaei, “Multi-modal attention-based fusion model for semantic segmentation of RGB-depth images,” arXiv preprint arXiv:1912.11691, 2019.
  74. Y. Yue, W. Zhou, J. Lei, and L. Yu, “Two-stage cascaded decoder for semantic segmentation of RGB-D images,” SPL, vol. 28, pp. 1115–1119, 2021.
  75. A. Valada, R. Mohan, and W. Burgard, “Self-supervised model adaptation for multimodal semantic segmentation,” IJCV, vol. 128, no. 5, pp. 1239–1285, 2019.
  76. A. Dai and M. Nießner, “3DMV: Joint 3D-multi-view prediction for 3D semantic scene segmentation,” in ECCV, 2018.
  77. C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture,” in ACCV, 2016.
  78. W. Shi et al., “Multilevel cross-aware RGBD indoor semantic segmentation for bionic binocular robot,” T-MRB, vol. 2, no. 3, pp. 382–390, 2020.
  79. W. Shi et al., “RGB-D semantic segmentation and label-oriented voxelgrid fusion for accurate 3D semantic mapping,” TCSVT, vol. 32, no. 1, pp. 183–197, 2022.
  80. M. Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In defense of pre-trained ImageNet architectures for real-time semantic segmentation of road-driving images,” in CVPR, 2019.
  81. D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H.-M. Gross, “Efficient RGB-D semantic segmentation for indoor scene analysis,” in ICRA, 2021.
  82. T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-SCNN: Gated shape CNNs for semantic segmentation,” in ICCV, 2019.
  83. F. Zhang et al., “ACFNet: Attentional class feature network for semantic segmentation,” in ICCV, 2019.
  84. D. Xu, W. Ouyang, X. Wang, and N. Sebe, “PAD-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in CVPR, 2018.
  85. Y. Wang, F. Sun, M. Lu, and A. Yao, “Learning deep multimodal feature representation with asymmetric multi-layer fusion,” in MM, 2020.
  86. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  87. X. Zhang, S. Zhang, Z. Cui, Z. Li, J. Xie, and J. Yang, “Tube-embedded transformer for pixel prediction,” TMM, vol. 25, pp. 2503–2514, 2023.
  88. W. Hu, H. Zhao, L. Jiang, J. Jia, and T.-T. Wong, “Bidirectional projection network for cross dimension scene understanding,” in CVPR, 2021.
  89. E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation,” T-ITS, vol. 19, no. 1, pp. 263–272, 2018.
  90. J. Wang et al., “Deep high-resolution representation learning for visual recognition,” TPAMI, vol. 43, no. 10, pp. 3349–3364, 2021.
  91. S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V. Kumar, and C. J. Taylor, “PST900: RGB-thermal calibration, dataset and segmentation network,” in ICRA, 2020.
  92. J. Xu, K. Lu, and H. Wang, “Attention fusion network for multi-spectral semantic segmentation,” PRL, vol. 146, pp. 179–184, 2021.
  93. Y. Cai, W. Zhou, L. Zhang, L. Yu, and T. Luo, “DHFNet: Dual-decoding hierarchical fusion network for RGB-thermal semantic segmentation,” The Visual Computer, pp. 1–11, 2023.
  94. T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution residual networks for semantic segmentation in street scenes,” in CVPR, 2017.
  95. C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learning a discriminative feature network for semantic segmentation,” in CVPR, 2018.
  96. C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: Bilateral segmentation network for real-time semantic segmentation,” in ECCV, 2018.
  97. R. P. K. Poudel, S. Liwicki, and R. Cipolla, “Fast-SCNN: Fast semantic segmentation network,” in BMVC, 2019.
  98. T. Wu, S. Tang, R. Zhang, and Y. Zhang, “CGNet: A light-weight context guided network for semantic segmentation,” TIP, vol. 30, pp. 1169–1179, 2021.
  99. J. Zhang, K. Yang, A. Constantinescu, K. Peng, K. Müller, and R. Stiefelhagen, “Trans4Trans: Efficient transformer for transparent object segmentation to help visually impaired people navigate in the real world,” in ICCVW, 2021.
  100. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
  101. T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in ECCV, 2018.
  102. T. Broedermann, C. Sakaridis, D. Dai, and L. Van Gool, “HRFuser: A multi-resolution sensor fusion architecture for 2D object detection,” in ITSC, 2023.
  103. Y. Wang, X. Chen, L. Cao, W. Huang, F. Sun, and Y. Wang, “Multimodal token fusion for vision transformers,” in CVPR, 2022.
  104. A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in CVPR, 2021.
Citations (219)

Summary

  • The paper presents a transformer-based framework that fuses RGB and complementary modalities through cross-modal feature rectification and bidirectional feature fusion.
  • It demonstrates state-of-the-art performance on RGB-Depth, RGB-Thermal, and RGB-Event benchmarks with significant mIoU improvements.
  • The CMX architecture, featuring the CM-FRM and FFM, enables effective multi-modal interaction and reliable segmentation for autonomous vehicle applications.

CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Introduction

The complexity of semantic segmentation for Autonomous Vehicles (AVs) has drawn considerable attention, primarily due to its critical role in providing detailed scene understanding for Advanced Driver-Assistance Systems (ADAS). Segmenting RGB images can be significantly enhanced by integrating complementary features from additional sensing modalities such as depth, thermal, polarization, event, and LiDAR data. The CMX framework is proposed as a unified solution for RGB-X semantic segmentation, overcoming the limitations that arise from the varying characteristics of different sensors by leveraging a transformer-based approach. CMX's architecture includes a Cross-Modal Feature Rectification Module (CM-FRM) and a Feature Fusion Module (FFM) to achieve comprehensive cross-modal interaction and effective feature calibration, ultimately aiming for robust segmentation across diverse modality combinations. Figure 1

Figure 1: RGB-X semantic segmentation unifies diverse sensing modality combinations: RGB-Depth, -Thermal, -Polarization, -Event, and -LiDAR segmentation.

Methodology

Framework Overview

CMX is structured as a two-stream architecture, where separate branches process RGB and X-modal data, facilitating parallel but interactive feature extraction (Figure 2). Key modules within this architecture are the Cross-Modal Feature Rectification Module (CM-FRM) and the Feature Fusion Module (FFM). Figure 2

Figure 2: a) Overview of CMX for RGB-X semantic segmentation. b) CM-FRM with information flows of two modalities. c) FFM with stages for information exchange and fusion.

Cross-Modal Feature Rectification Module (CM-FRM)

CM-FRM addresses the challenge of noisy measurements intrinsic to different sensor modalities. It enhances RGB-X feature extraction using two sub-modules: channel-wise and spatial-wise feature rectification. The intent is to enable mutually beneficial interactions between modalities:

  • Channel-wise Rectification: Introduces global weights by utilizing global max and average pooling to embed features along the spatial axis. This results in bidirectional calibration of features, thus improving multi-modal interaction.
  • Spatial-wise Rectification: Embeds bi-modal features into spatial weight maps using consecutive convolutional layers. These maps rectify each other by exploiting spatial and channel-wise correlations.

At the output, rectified modal features are fused into the next stage, allowing for further enhancement of the learned features.

Feature Fusion with FFM

Within the Feature Fusion Module (FFM), feature interactions and fusions occur at multiple levels. The module is split into two primary stages:

  • Information Exchange Stage: Utilizes a dual-path structure for bidirectional cross-modal feature rectification, incorporating a cross-attention mechanism. This enables enhanced cross-modal interactions and improved multi-modal feature extraction.
  • Fusion Stage: Implements a mixed channel embedding to combine rectified features, resulting in comprehensive multi-modal feature maps (Figure 3).

(Figure 3)

Figure 4: Comparison of different fusion methods. (a) Input fusion merges inputs with modality-specific operations. (b) Feature fusion applies channel attention in a unidirectional manner. (c) Our interactive fusion employs bidirectional cross-modal feature rectification and sequence-to-sequence cross-attention.

Experimental Results and Analysis

Results on RGB-Depth Datasets

Experiments conducted on multiple datasets demonstrated CMX's superiority in RGB-Depth semantic segmentation. In the NYU Depth V2 benchmark, CMX achieves an mIoU of up to 56.9%, surpassing existing methods designed specifically for RGB-D data.

Key Numerical Results:

  • NYU Depth V2 Benchmark:
    • CMX with MiT-B5 achieved an mIoU of 56.9% with pixel accuracy of 80.1%.
    • Figure 1
    • Figure 1: Performance comparison on different RGB-X semantic segmentation benchmarks, demonstrating the superior versatility of CMX over other methods.

Results on Additional RGB-X Modalities

  • RGB-Thermal (MFNet Dataset): CMX outpaces earlier methods by achieving a mean Intersection over Union (mIoU) of 59.7%, clearly outperforming specialized models like ABMDRNet and GMNet. Notably, pedestrian segmentation accuracy improved due to more effective utilization of thermal data characteristics, achieving an improvement of over 11%.

(Table 1)

Table 1: Per-class results on MFNet dataset for RGB-Thermal segmentation.

Results on RGB-Event Dataset

CMX establishes a robust RGB-Event semantic segmentation benchmark based on the EventScape dataset. The model sets a new state-of-the-art with substantial improvements in performance metrics over previous methods, confirming the effectiveness of CMX for RGB-X tasks.

Numerical Results:

  • On the EventScape dataset, CMX with MiT-B4 achieved the highest mIoU of 64.28% with pixel accuracy of 92.60%. Figure 5

    Figure 5: Per-class IoU results of the RGB-only baseline and our RGB-Event model on our RGB-Event benchmark.

Conclusion

The CMX framework represents an advance in the area of RGB-X semantic segmentation by offering a universal solution for robust multi-modal scene understanding applicable across a multitude of sensory combinations. The framework demonstrates significant improvements over modality-specific methods on multiple benchmarks, extraordinarily enhancing performance in scenarios characterized by supplements and uncertainties. Future research should focus on refining cross-modal fusion strategies, potentially leading to further improvement and application across broader sensor combinations and domain variants. Figure 6

Figure 6: Visualization of failure cases. From top to bottom: RGB-Depth, RGB-Thermal, RGB-Polarization (AoLP), and RGB-Event semantic segmentation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.