CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers (2203.04838v5)

Published 9 Mar 2022 in cs.CV, cs.RO, and eess.IV

Abstract: Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality (X-modality). However, covering a wide variety of sensors with a modality-agnostic model remains an unresolved problem due to variations in sensor characteristics among different modalities. Unlike previous modality-specific methods, in this work, we propose a unified fusion framework, CMX, for RGB-X semantic segmentation. To generalize well across different modalities, that often include supplements as well as uncertainties, a unified cross-modal interaction is crucial for modality fusion. Specifically, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate bi-modal features by leveraging the features from one modality to rectify the features of the other modality. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to perform sufficient exchange of long-range contexts before mixing. To verify CMX, for the first time, we unify five modalities complementary to RGB, i.e., depth, thermal, polarization, event, and LiDAR. Extensive experiments show that CMX generalizes well to diverse multi-modal fusion, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal, RGB-Polarization, and RGB-LiDAR datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.

References (104)

Authors (6)

Jiaming Zhang (117 papers)
Huayao Liu (3 papers)
Kailun Yang (136 papers)
Xinxin Hu (10 papers)
Ruiping Liu (25 papers)
Rainer Stiefelhagen (155 papers)

Citations (219)

View on Semantic Scholar

Summary

An Analysis of Cross-Modal Fusion for RGB-X Semantic Segmentation

Introduction

"CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers" introduces a unified framework for enhancing semantic segmentation by leveraging multiple sensor modalities. The paper recognizes that conventional RGB-based semantic segmentation can struggle in scenarios where images are obscured by environmental noise or lack depth cues. Thus, this paper proposes the CMX architecture, specifically designed to incorporate complementary modalities—such as depth, thermal, polarization, event, and LiDAR data—into a cohesive segmentation framework using a transformer-based model.

Methodology

At the core of CMX lies the integration of a Cross-Modal Feature Rectification Module (CM-FRM) and a Feature Fusion Module (FFM), which are pivotal in refining cross-modal interactions. CM-FRM operates at each sensor stream's layer, rectifying the noisy features inherent in raw inputs by calibrating the bi-modal data. It achieves this through careful attention-based recalibration, applying channel-wise and spatial-wise adjustments to extract relevant cues and mitigate noise.

FFM builds on post-rectification features, utilizing a two-staged process to ensure comprehensive feature integration. The first stage employs a novel cross-attention mechanism, which equips the model with long-range contextual understanding from sequences of cross-modal features. The second stage involves mixed channel embedding, ensuring that the rectified model outputs yield a robust and semantically accurate global prediction.

Experimental Results

Results are presented for a suite of benchmarks across five modality combinations. On RGB-Depth datasets, such as NYU Depth V2 and Cityscapes, CMX achieves remarkable improvements compared to state-of-the-art baselines. With MiT-B2, CMX sets the precedent on segmentation accuracy, evidencing robust generalization capabilities not only for indoor environments but also intricate urban settings.

For RGB-Thermal segmentation, CMX effectively maximizes infrared cues during low-light and nighttime conditions, avoiding common pitfalls experienced in monochromatic and traditionally lit environments. Similarly, tasks utilizing polarization data reflect definitive benefits in scenarios with reflective surfaces, where CMX outperforms both RGB-only and earlier RGB-Polarization models.

Further experimentation on RGB-Event datasets highlighted the effectiveness of CMX in understanding dynamic scenes, whereas RGB-LiDAR assessments confirmed its strength in environments requiring fine-grained spatial accuracy.

Discussion

The integration of CMX effectively addresses the inherent challenge of generalizing across disparate sensor inputs. This architecture's comprehensive interaction modeling provides substantial advancements by eliminating modality-specific constraints inherent in previous approaches. Notably, CMX's use of transformers capitalizes on their attention mechanisms for long-range dependency learning, which is less pronounced in convolutional architectures.

Implications and Future Work

The practical implications of a framework like CMX are substantial in real-world applications, particularly in autonomous systems and advanced driver-assistance systems (ADAS), where resiliency to environmental variances and sensor noise is critical. Theoretically, CMX's approach contributes to vision science by expanding the understanding of how multi-modal data can be effectively fused to enrich semantic understanding.

Future developments may focus on streamlining the computational load associated with high-dimensional attention maps, further enhancing real-time applicability. Extending CMX to integrate unsupervised or weakly supervised learning could further evolve its utility, addressing data scarcity in novel modalities.

In conclusion, the CMX framework significantly advances RGB-X semantic segmentation by incorporating a versatile and unified model capable of leveraging diverse sensor modalities. This research enriches both the theoretical underpinnings and practical applications of multi-modal data fusion in computer vision.

PDF Markdown

GitHub

GitHub - huaaaliu/RGBX_Semantic_Segmentation (357 stars)