Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RSMamba: Remote Sensing Image Classification with State Space Model (2403.19654v1)

Published 28 Mar 2024 in cs.CV

Abstract: Remote sensing image classification forms the foundation of various understanding tasks, serving a crucial function in remote sensing image interpretation. The recent advancements of Convolutional Neural Networks (CNNs) and Transformers have markedly enhanced classification accuracy. Nonetheless, remote sensing scene classification remains a significant challenge, especially given the complexity and diversity of remote sensing scenarios and the variability of spatiotemporal resolutions. The capacity for whole-image understanding can provide more precise semantic cues for scene discrimination. In this paper, we introduce RSMamba, a novel architecture for remote sensing image classification. RSMamba is based on the State Space Model (SSM) and incorporates an efficient, hardware-aware design known as the Mamba. It integrates the advantages of both a global receptive field and linear modeling complexity. To overcome the limitation of the vanilla Mamba, which can only model causal sequences and is not adaptable to two-dimensional image data, we propose a dynamic multi-path activation mechanism to augment Mamba's capacity to model non-causal data. Notably, RSMamba maintains the inherent modeling mechanism of the vanilla Mamba, yet exhibits superior performance across multiple remote sensing image classification datasets. This indicates that RSMamba holds significant potential to function as the backbone of future visual foundation models. The code will be available at \url{https://github.com/KyanChen/RSMamba}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu, “Aid: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, 2017.
  2. Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, 2010, pp. 270–279.
  3. G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,” Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017.
  4. K. Chen, W. Li, J. Chen, Z. Zou, and Z. Shi, “Resolution-agnostic remote sensing scene classification with implicit neural representations,” IEEE Geoscience and Remote Sensing Letters, vol. 20, pp. 1–5, 2022.
  5. Y. Li, H. Zhang, X. Xue, Y. Jiang, and Q. Shen, “Deep learning for remote sensing image classification: A survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 6, p. e1264, 2018.
  6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  7. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  8. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  9. K. Xu, P. Deng, and H. Huang, “Vision transformer: An excellent teacher for guiding small networks in remote sensing image scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.
  10. J. Chen, K. Chen, H. Chen, W. Li, Z. Zou, and Z. Shi, “Contrastive learning for fine-grained ship classification in remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
  11. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  12. A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
  13. A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  14. L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417, 2024.
  15. Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, “Vmamba: Visual state space model,” arXiv preprint arXiv:2401.10166, 2024.
  16. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning.   PMLR, 2021, pp. 10 347–10 357.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Keyan Chen (34 papers)
  2. Bowen Chen (50 papers)
  3. Chenyang Liu (26 papers)
  4. Wenyuan Li (47 papers)
  5. Zhengxia Zou (52 papers)
  6. Zhenwei Shi (77 papers)
Citations (62)

Summary

RSMamba: Remote Sensing Image Classification with State Space Model

The paper presents a novel architecture, RSMamba, designed for remote sensing image classification utilizing the State Space Model (SSM). Remote sensing image classification is essential for earth observation applications such as land mapping and urban planning. However, it suffers from complexities related to spatio-temporal resolution variations and the diversity of scenarios. While CNNs and Transformers have advanced the field, their inherent limitations motivate the exploration of alternative architectures. RSMamba seeks to integrate CNNs' linear complexity with the global receptive fields characteristic of Transformers.

Key Contributions

  1. Introduction of RSMamba: RSMamba is structured around the State Space Model and improved by a hardware-efficient design called Mamba. It is characterized by linear modeling complexity, making it computationally efficient while capable of handling large-scale remote sensing image classification.
  2. Dynamic Multi-Path Activation: To overcome the inherent limitations of the original Mamba, which is suited for unidirectional and causal sequence modeling, RSMamba integrates a dynamic multi-path activation mechanism. This allows RSMamba to model non-causal, position-sensitive data, thus enhancing its ability to process two-dimensional image data effectively.
  3. Performance and Efficiency: Through comprehensive evaluations on datasets such as UC Merced, AID, and RESISC45, RSMamba demonstrated superior performance relative to contemporary CNN and Transformer-based methods. The approach significantly balances accuracy with resource efficiency, making it promising for future visual foundation models.

Methodology Overview

RSMamba efficiently processes remote sensing images by transforming them into one-dimensional sequences and capturing long-range dependencies using an innovative Multi-Path SSM Encoder. The method begins by converting images into patches and adding positional encodings, then employs dynamic pathing to model relationships effectively. The multi-path mechanism involves processing sequences in forward, reverse, and shuffled orders, enhancing the model's capacity to understand spatial structures.

Numerical Results and Analysis

RSMamba's empirical performance is detailed across various datasets where it consistently outperforms traditional architectures. For instance, the base version of RSMamba, with significantly fewer parameters, outperformed ResNet, DeiT, and ViT models, achieving F1 scores of 93.88, 91.66, and 94.84 on UC Merced, AID, and RESISC45 datasets, respectively. Such results substantiate the hypothesis that RSMamba's efficient, global modeling strategy offers substantial advantages in remote sensing image classification.

Implications and Future Directions

The findings of RSMamba have significant implications both practically and theoretically. Practically, the model's ability to integrate CNN-like efficiency with Transformer's global understanding opens avenues for deployment in computationally constrained environments, while still achieving high accuracy. Theoretically, it suggests that further exploration into state space models in computer vision could yield architectures with improved performance and efficiency.

Future research could extend RSMamba by exploring additional modalities of remote sensing data, such as hyperspectral and SAR imagery, or by incorporating it into broader remote sensing tasks like object detection and change detection. There is also potential for integrating RSMamba with existing large-scale modeling paradigms, thereby enhancing its versatility and impact on practical applications.

Thus, RSMamba represents a significant advancement in remote sensing image classification by effectively blending state space models with contemporary hardware-aware design, paving the way for more accurate and efficient processing of complex visual data.

X Twitter Logo Streamline Icon: https://streamlinehq.com