Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation (2405.10530v1)

Published 17 May 2024 in cs.CV

Abstract: Due to the large-scale image size and object variations, current CNN-based and Transformer-based approaches for remote sensing image semantic segmentation are suboptimal for capturing the long-range dependency or limited to the complex computational complexity. In this paper, we propose CM-UNet, comprising a CNN-based encoder for extracting local image features and a Mamba-based decoder for aggregating and integrating global information, facilitating efficient semantic segmentation of remote sensing images. Specifically, a CSMamba block is introduced to build the core segmentation decoder, which employs channel and spatial attention as the gate activation condition of the vanilla Mamba to enhance the feature interaction and global-local information fusion. Moreover, to further refine the output features from the CNN encoder, a Multi-Scale Attention Aggregation (MSAA) module is employed to merge the different scale features. By integrating the CSMamba block and MSAA module, CM-UNet effectively captures the long-range dependencies and multi-scale global contextual information of large-scale remote-sensing images. Experimental results obtained on three benchmarks indicate that the proposed CM-UNet outperforms existing methods in various performance metrics. The codes are available at https://github.com/XiaoBuL/CM-UNet.

CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation

The paper presents CM-UNet, a hybrid model specifically developed for the semantic segmentation of remote sensing images. CM-UNet combines Convolutional Neural Networks (CNN) and a selective state space model architecture known as Mamba, effectively balancing the typically trade-off between capturing local and global dependencies in image data.

Methodological Contributions

CM-UNet's architecture consists of a CNN-based encoder and a Mamba-based decoder. The encoder is responsible for extracting detailed local features using established CNN frameworks, particularly the ResNet backbone. Meanwhile, the decoder leverages the capabilities of the Mamba architecture for aggregating and integrating global contextual information, therefore allowing for effective long-range dependency and multi-scale contextual feature modeling in remote sensing images.

The core innovation lies in the CSMamba block, which enhances vanilla Mamba blocks with attention mechanisms. The Channel and Spatial Mamba (CSMamba) block employs channel and spatial attention to improve feature interaction and ensure robust local-global information fusion. This design overcomes limitations of previous architectures, which either struggle to adequately capture global context or exhibit excessive computational burdens, making it particularly suited for the complex and varied features typically found in remote sensing datasets.

Additionally, a Multi-Scale Attention Aggregation (MSAA) module is introduced to refine features extracted by the CNN encoder. This module effectively merges multi-scale features via dual pathways, leveraging both spatial and channel attention mechanisms to enrich the information provided to the decoder.

Experimental Evaluation

The CM-UNet has been empirically validated against three benchmark remote sensing image datasets—ISPRS Potsdam, ISPRS Vaihingen, and LoveDA. The model outperformed several competitive baselines across multiple key metrics:

  • On the ISPRS Potsdam dataset, CM-UNet achieved an mF1 of 93.05%, mIoU of 87.21%, and an OA of 91.86%.
  • For the ISPRS Vaihingen dataset, it recorded an mF1 of 92.01%, an OA of 93.81%, and an mIoU of 85.48%.
  • On the LoveDA dataset, it attained an mIoU of 52.17%, affirming its robustness across diverse land categories.

The strong comparative performance underscores the model's robustness in capturing spatial contextual features important for accurately segmenting varied land features inherent in large-scale remote sensing images.

Computational Efficiency

The integration of Mamba architecture yields computational efficiency, as highlighted by its linear time complexity in modeling long-range dependencies. Experimentally, it exhibits favorable trade-offs in terms of FLOPs, parameter count, and memory usage compared to existing segmentation models. This efficiency in processing is crucial when dealing with high-resolution remote sensing images, making CM-UNet especially practical for real-world applications.

Practical and Theoretical Implications

CM-UNet's ability to balance the demands of local feature extraction and global context modeling offers significant potential across several domains, including urban planning, environmental monitoring, and autonomous navigation. The architecture's design may influence future developments in remote sensing and elsewhere, where managing multi-scale and dependency relationships is critical. Moreover, the integration of state space models for visual tasks could inspire new architectural innovations across wider computer vision applications.

In summary, CM-UNet demonstrates a sophisticated approach to remote sensing image segmentation, achieving high performance while maintaining computational efficiency. This progress offers a promising direction for future research endeavors in the domain of geospatial data analysis and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1907–1915.
  2. Y. Liu, B. Fan, L. Wang, J. Bai, S. Xiang, and C. Pan, “Semantic labeling in very high resolution images via a self-cascaded convolutional neural network,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 78–95, 2018.
  3. S.-W. Chen, “Sar image speckle filtering with context covariance matrix formulation and similarity test,” TIP, vol. 29, pp. 6641–6654, 2020.
  4. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015, pp. 234–241.
  5. F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,” ISPRS J. Photogramm. Remote Sens., vol. 162, pp. 94–114, 2020.
  6. X. He, Y. Zhou, J. Zhao, D. Zhang, R. Yao, and Y. Xue, “Swin transformer embedding unet for remote sensing image semantic segmentation,” IEEE Geosci. Remote Sens., vol. 60, pp. 1–15, 2022.
  7. A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv:2312.00752, 2023.
  8. A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv:2111.00396, 2021.
  9. L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv:2401.09417, 2024.
  10. Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, “Vmamba: Visual state space model,” arXiv:2401.10166, 2024.
  11. X. He, K. Cao, K. Yan, R. Li, C. Xie, J. Zhang, and M. Zhou, “Pan-mamba: Effective pan-sharpening with state space model,” arXiv:2402.12192, 2024.
  12. K. Chen, B. Chen, C. Liu, W. Li, Z. Zou, and Z. Shi, “Rsmamba: Remote sensing image classification with state space model,” arXiv:2403.19654, 2024.
  13. S. Zhao, H. Chen, X. Zhang, P. Xiao, L. Bai, and W. Ouyang, “Rs-mamba for large remote sensing image dense prediction,” arXiv:2404.02668, 2024.
  14. X. Ma, X. Zhang, and M.-O. Pun, “Rs3mamba: Visual state space model for remote sensing images semantic segmentation,” arXiv:2404.02457, 2024.
  15. L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, and P. M. Atkinson, “UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery,” ISPRS J. Photogramm. Remote Sens., vol. 190, pp. 196–214, 2022.
  16. H. Wu, P. Huang, M. Zhang, W. Tang, and X. Yu, “CMTFNet: CNN and multiscale transformer fusion network for remote sensing image semantic segmentation,” IEEE Geosci. Remote Sens., 2023.
  17. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Int. Conf. Learn. Represent. (ICLR), 2020.
  18. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 10 012–10 022.
  19. J. Wang, Z. Zheng, A. Ma, X. Lu, and Y. Zhong, “LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,” in Proc. NeurIPS Datasets Benchmarks, vol. 1, 2021, pp. 1–17.
  20. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 801–818.
  21. J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 3146–3154.
  22. R. Li, S. Zheng, C. Zhang, C. Duan, L. Wang, and P. M. Atkinson, “Abcnet: Attentive bilateral contextual network for efficient semantic segmentation of fine-resolution remotely sensed imagery,” ISPRS J. Photogramm. Remote Sens., vol. 181, pp. 84–98, 2021.
  23. L. Wang, R. Li, D. Wang, C. Duan, T. Wang, and X. Meng, “Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images,” Remote Sens., vol. 13, no. 16, p. 3065, 2021.
  24. X. Zhang, Z. Weng, P. Zhu, X. Han, J. Zhu, and L. Jiao, “Esdinet: Efficient shallow-deep interaction network for semantic segmentation of high-resolution aerial images,” IEEE Geosci. Remote Sens., vol. 62, pp. 1–15, 2024.
  25. R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 7262–7272.
  26. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778.
  27. W. Bo, J. Liu, X. Fan, T. Tjahjadi, Q. Ye, and L. Fu, “Basnet: Burned area segmentation network for real-time detection of damage maps in remote sensing images,” IEEE Geosci. Remote Sens., vol. 60, pp. 1–13, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Mushui Liu (15 papers)
  2. Jun Dan (8 papers)
  3. Ziqian Lu (8 papers)
  4. Yunlong Yu (34 papers)
  5. Yingming Li (14 papers)
  6. Xi Li (197 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub