CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation (2405.10530v1)

Published 17 May 2024 in cs.CV

Abstract: Due to the large-scale image size and object variations, current CNN-based and Transformer-based approaches for remote sensing image semantic segmentation are suboptimal for capturing the long-range dependency or limited to the complex computational complexity. In this paper, we propose CM-UNet, comprising a CNN-based encoder for extracting local image features and a Mamba-based decoder for aggregating and integrating global information, facilitating efficient semantic segmentation of remote sensing images. Specifically, a CSMamba block is introduced to build the core segmentation decoder, which employs channel and spatial attention as the gate activation condition of the vanilla Mamba to enhance the feature interaction and global-local information fusion. Moreover, to further refine the output features from the CNN encoder, a Multi-Scale Attention Aggregation (MSAA) module is employed to merge the different scale features. By integrating the CSMamba block and MSAA module, CM-UNet effectively captures the long-range dependencies and multi-scale global contextual information of large-scale remote-sensing images. Experimental results obtained on three benchmarks indicate that the proposed CM-UNet outperforms existing methods in various performance metrics. The codes are available at https://github.com/XiaoBuL/CM-UNet.

References (27)

Authors (6)

Mushui Liu (15 papers)
Jun Dan (8 papers)
Ziqian Lu (8 papers)
Yunlong Yu (34 papers)
Yingming Li (14 papers)
Xi Li (198 papers)

Citations (5)

View on Semantic Scholar

Summary

CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation

The paper presents CM-UNet, a hybrid model specifically developed for the semantic segmentation of remote sensing images. CM-UNet combines Convolutional Neural Networks (CNN) and a selective state space model architecture known as Mamba, effectively balancing the typically trade-off between capturing local and global dependencies in image data.

Methodological Contributions

CM-UNet's architecture consists of a CNN-based encoder and a Mamba-based decoder. The encoder is responsible for extracting detailed local features using established CNN frameworks, particularly the ResNet backbone. Meanwhile, the decoder leverages the capabilities of the Mamba architecture for aggregating and integrating global contextual information, therefore allowing for effective long-range dependency and multi-scale contextual feature modeling in remote sensing images.

The core innovation lies in the CSMamba block, which enhances vanilla Mamba blocks with attention mechanisms. The Channel and Spatial Mamba (CSMamba) block employs channel and spatial attention to improve feature interaction and ensure robust local-global information fusion. This design overcomes limitations of previous architectures, which either struggle to adequately capture global context or exhibit excessive computational burdens, making it particularly suited for the complex and varied features typically found in remote sensing datasets.

Additionally, a Multi-Scale Attention Aggregation (MSAA) module is introduced to refine features extracted by the CNN encoder. This module effectively merges multi-scale features via dual pathways, leveraging both spatial and channel attention mechanisms to enrich the information provided to the decoder.

Experimental Evaluation

The CM-UNet has been empirically validated against three benchmark remote sensing image datasets—ISPRS Potsdam, ISPRS Vaihingen, and LoveDA. The model outperformed several competitive baselines across multiple key metrics:

On the ISPRS Potsdam dataset, CM-UNet achieved an mF1 of 93.05%, mIoU of 87.21%, and an OA of 91.86%.
For the ISPRS Vaihingen dataset, it recorded an mF1 of 92.01%, an OA of 93.81%, and an mIoU of 85.48%.
On the LoveDA dataset, it attained an mIoU of 52.17%, affirming its robustness across diverse land categories.

The strong comparative performance underscores the model's robustness in capturing spatial contextual features important for accurately segmenting varied land features inherent in large-scale remote sensing images.

Computational Efficiency

The integration of Mamba architecture yields computational efficiency, as highlighted by its linear time complexity in modeling long-range dependencies. Experimentally, it exhibits favorable trade-offs in terms of FLOPs, parameter count, and memory usage compared to existing segmentation models. This efficiency in processing is crucial when dealing with high-resolution remote sensing images, making CM-UNet especially practical for real-world applications.

Practical and Theoretical Implications

CM-UNet's ability to balance the demands of local feature extraction and global context modeling offers significant potential across several domains, including urban planning, environmental monitoring, and autonomous navigation. The architecture's design may influence future developments in remote sensing and elsewhere, where managing multi-scale and dependency relationships is critical. Moreover, the integration of state space models for visual tasks could inspire new architectural innovations across wider computer vision applications.

In summary, CM-UNet demonstrates a sophisticated approach to remote sensing image segmentation, achieving high performance while maintaining computational efficiency. This progress offers a promising direction for future research endeavors in the domain of geospatial data analysis and beyond.

PDF Markdown

Related Papers

GitHub

GitHub - XiaoBuL/CM-UNet (110 stars)