AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation (2309.14065v7)

Published 25 Sep 2023 in cs.CV

Abstract: Understanding indoor scenes is crucial for urban studies. Considering the dynamic nature of indoor environments, effective semantic segmentation requires both real-time operation and high accuracy.To address this, we propose AsymFormer, a novel network that improves real-time semantic segmentation accuracy using RGB-D multi-modal information without substantially increasing network complexity. AsymFormer uses an asymmetrical backbone for multimodal feature extraction, reducing redundant parameters by optimizing computational resource distribution. To fuse asymmetric multimodal features, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. The AsymFormer demonstrates competitive results with 54.1% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS (79 FPS after implementing mixed precision quantization) on RTX3090, demonstrating that AsymFormer can strike a balance between high accuracy and efficiency.

References (51)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces AsymFormer, an efficient network for real-time RGB-D segmentation using an asymmetrical backbone and attention-guided modules.
The methodology leverages a Local Attention-Guided Feature Selection module and a Cross-Modal Attention module for dynamic, multi-modal feature fusion.
Experimental results on NYUv2 and SUNRGBD demonstrate competitive accuracy and inference speed, enabling effective mobile platform deployment.

The paper "AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation" (2309.14065) introduces AsymFormer, a novel and efficient network designed for real-time semantic segmentation using RGB-D multi-modal information. The network employs an asymmetrical backbone for feature extraction, a Local Attention-Guided Feature Selection (LAFS) module for selective feature fusion, and a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module for cross-modal representation learning. The authors demonstrate the efficacy of AsymFormer on the NYUv2 and SUNRGBD datasets, achieving a balance between accuracy and inference speed suitable for mobile platform deployment.

Motivation and Background

Real-time semantic segmentation of indoor scenes is critical for various applications, including emergency evacuation [yoo2022indoor], robotic navigation [yeboah2018semantic], and virtual reality [zhang2017physically]. While existing methods often struggle to balance accuracy and efficiency, this work addresses the challenge by leveraging RGB-D data and optimizing computational resource allocation. The authors highlight the redundancy introduced by symmetric backbones in existing RGB-D segmentation networks [du2022pscnet] and propose an asymmetric design to mitigate this issue.

AsymFormer Architecture and Key Components

The AsymFormer architecture consists of three main components: an asymmetric backbone, a Local Attention-Guided Feature Selection (LAFS) module, and a Cross-Modal Attention (CMA) module (Figure 1).

Figure 1: An overview of the AsymFormer architecture, illustrating the flow of information from the asymmetric backbone through the LAFS and CMA modules.

Asymmetric Backbone

The asymmetric backbone is a core innovation, employing a larger, more parameter-rich CNN (ConvNeXt [liu2022convnet]) for RGB feature extraction and a lightweight Transformer (Mix-Transformer [xie2021segformer]) for Depth feature extraction. This design choice reflects the observation that RGB information typically plays a more prominent role in semantic segmentation tasks [du2022pscnet]. By allocating more computational resources to the RGB branch, the network reduces redundant parameters and improves efficiency.

Local Attention-Guided Feature Selection (LAFS)

The LAFS module is designed to selectively fuse features from different modalities by leveraging their dependencies. Unlike existing attention mechanisms that use fixed strategies for feature compression [woo2018cbam], LAFS employs a learnable method for spatial information compression. This is achieved through a feedforward neural network that learns dynamic spatial information compression rules, enabling the network to adaptively select and fuse features based on their relevance. The LAFS module computes attention weights using Adaptive Average Pooling to extract a global information vector $Avg$ . Spatial attention weights $W_S$ are then calculated using the following formula:

$W_{S}=\text{Sigmoid}(\frac{\text{Dot}(Input.\text{Reshape}(C,H \times W)^T,R_{Avg})}{C^{2}})$

This allows the network to focus on the most informative regions of the feature maps. (Figure 2) shows the details of the LAFS module.

Figure 2: An illustration of the LAFS module, detailing the learnable method for spatial information compression.

The CMA module is introduced to further extract cross-modal representations by embedding cross-modal information into pixel-wise fused features. The key to CMA is defining cross-modal self-similarity using a linear sum and embedding its result into the fused features. For a pixel $(i_0, j_0)$ , its cross-modal self-similarity with other pixels $(i, j)$ is defined as:

$W(i,j)=\sum_{n=1}^{N}(Kr_{n,i,j} \cdot Qr_{n,i_0,j_0})+\sum_{n=1}^{N}(Kd_{n,i,j} \cdot Qd_{n,i_0,j_0})$

The CMA module has three input features: RGB features, Depth features, and the fused features selected by LAFS. These features are embedded into a vector space, and the embedded features are split into two independent vectors. A shuffle mechanism is introduced to ensure that each vector contains information from both modalities, enabling the network to learn features from multiple subspaces. (Figure 3) shows the feature embedding process.

Figure 3: Feature embedding in the CMA module, showcasing the splitting of the output Value into two independent vectors.

(Figure 4) illustrates the splitting and mixing of multimodal information.

Figure 4: Splitting and mixing of multimodal information within the CMA module, ensuring that each subspace contains information from both RGB and Depth modalities.

Experimental Results

The authors evaluated AsymFormer on the NYUv2 and SUNRGBD datasets, demonstrating its competitive performance in terms of accuracy and inference speed. On NYUv2, AsymFormer achieved 54.1\% mIoU with an inference speed of 65 FPS on an RTX 3090 GPU. With mixed precision quantization, the inference speed further increased to 79 FPS. On SUNRGBD, AsymFormer achieved 49.1\% mIoU. Ablation studies validate the effectiveness of the LAFS and CMA modules. The attention map of the LAFS module is visualized and compared against CBAM in (Figure 5).

Figure 5: A comparison of the spatial attention weights generated by CBAM and LAFS, highlighting the improved coverage and consistency of LAFS.

(Figure 6) visualizes the semantic segmentation results of AsymFormer on the NYUv2 dataset.

Figure 6: Visualization of semantic segmentation results on the NYUv2 dataset, showcasing the accuracy of AsymFormer in identifying various objects and regions within indoor scenes.

Conclusion

The AsymFormer architecture presents a compelling approach to real-time RGB-D semantic segmentation by balancing accuracy and efficiency. The asymmetric backbone, LAFS module, and CMA module collectively contribute to reducing redundant parameters and improving feature representation. Experimental results on NYUv2 and SUNRGBD demonstrate the potential of AsymFormer for deployment on mobile platforms and real-time applications. Future work may focus on self-supervised pre-training and further optimization of the network architecture to achieve even greater improvements in performance.

PDF Markdown

Follow-up Questions

Related Papers

Authors (6)

GitHub

GitHub - Fourier7754/AsymFormer: AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation (48 stars)