MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection (2403.02148v4)
Abstract: Recently, infrared small target detection (ISTD) has made significant progress, thanks to the development of basic models. Specifically, the models combining CNNs with transformers can successfully extract both local and global features. However, the disadvantage of the transformer is also inherited, i.e., the quadratic computational complexity to sequence length. Inspired by the recent basic model with linear complexity for long-distance modeling, Mamba, we explore the potential of this state space model for ISTD task in terms of effectiveness and efficiency in the paper. However, directly applying Mamba achieves suboptimal performances due to the insufficient harnessing of local features, which are imperative for detecting small targets. Instead, we tailor a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient ISTD. It consists of Outer and Inner Mamba blocks to adeptly capture both global and local features. Specifically, we treat the local patches as "visual sentences" and use the Outer Mamba to explore the global information. We then decompose each visual sentence into sub-patches as "visual words" and use the Inner Mamba to further explore the local information among words in the visual sentence with negligible computational costs. By aggregating the visual word and visual sentence features, our MiM-ISTD can effectively explore both global and local information. Experiments on NUAA-SIRST and IRSTD-1k show the superior accuracy and efficiency of our method. Specifically, MiM-ISTD is $8 \times$ faster than the SOTA method and reduces GPU memory usage by 62.2$\%$ when testing on $2048 \times 2048$ images, overcoming the computation and memory constraints on high-resolution infrared images.
- Three-order tensor creation and tucker decomposition for infrared small-target detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–16, 2021.
- Infrared small target detection via low-rank tensor completion with top-hat regularization. IEEE Transactions on Geoscience and Remote Sensing, 58(2):1004–1016, 2019.
- Derivative entropy-based contrast measure for infrared small-target detection. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2452–2466, 2018.
- Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model. IEEE Transactions on Geoscience and Remote Sensing, 59(5):3737–3752, 2020.
- Flying small target detection in ir images based on adaptive toggle operator. IET Computer Vision, 12(4):527–534, 2018.
- Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sensing, 11(4):382, 2019.
- A local contrast method combined with adaptive background estimation for infrared small target detection. IEEE Geoscience and Remote Sensing Letters, 16(9):1442–1446, 2019.
- Detection of dim targets in digital infrared imagery by morphological image processing. Optical Engineering, 35(7):1886–1893, 1996.
- A local contrast method for small infrared target detection. IEEE transactions on geoscience and remote sensing, 52(1):574–581, 2013.
- A novel pattern for infrared small target detection with generative adversarial network. IEEE Transactions on Geoscience and Remote Sensing, 59(5):4481–4492, 2020.
- Bauenet: Boundary-aware uncertainty enhanced network for infrared small target detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Attentional local contrast networks for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing, 59(11):9813–9824, 2021.
- Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Transactions on Aerospace and Electronic Systems, 2023.
- Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 950–959, 2021.
- Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8509–8518, 2019.
- Dense nested attention network for infrared small target detection. IEEE Transactions on Image Processing, 2022.
- Isnet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 877–886, 2022.
- Tci-former: Thermal conduction-inspired transformer for infrared small target detection. arXiv preprint arXiv:2402.02046, 2024.
- Abmnet: Coupling transformer with cnn based on adams-bashforth-moulton method for infrared small target detection. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 1901–1906. IEEE, 2023.
- Interior attention-aware network for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2022.
- Ftc-net: Fusion of transformer and cnn features for infrared small target detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:8613–8623, 2022.
- Infrared small-dim target detection with transformer under complex backgrounds. arXiv preprint arXiv:2109.14379, 2021.
- Rkformer: Runge-kutta transformer with random-connection attention for infrared small target detection. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1730–1738, 2022.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
- Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
- Zi Ye and Tianxiang Chen. P-mamba: Marrying perona malik diffusion with mamba for efficient pediatric echocardiographic left ventricular segmentation. arXiv preprint arXiv:2402.08506, 2024.
- Transformer in transformer. Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
- Dim2clear network for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing, 61:1–14, 2023.
- Exploring feature compensation and cross-level correlation for infrared small target detection. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1857–1865, 2022.
- Irstformer: A hierarchical vision transformer for infrared small target detection. Remote Sensing, 14(14):3258, 2022.
- Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
- Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491, 2024.
- Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv preprint arXiv:2402.05079, 2024.
- Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560, 2024.
- nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2):203–211, 2021.
- Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
- Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pages 272–284. Springer, 2021.
- Swin-umamba: Mamba-based unet with imagenet-based pretraining. arXiv preprint arXiv:2402.03302, 2024.
- Weak-mamba-unet: Visual mamba makes cnn and vit work better for scribble-based medical image segmentation. arXiv preprint arXiv:2402.10887, 2024.
- Semi-mamba-unet: Pixel-level contrastive cross-supervised visual mamba-based unet for semi-supervised medical image segmentation. arXiv preprint arXiv:2402.07245, 2024.
- Vivim: a video vision mamba for medical video object segmentation. arXiv preprint arXiv:2401.14168, 2024.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pages 240–248. Springer, 2017.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.