Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model (2401.09417v3)

Published 17 Jan 2024 in cs.CV and cs.LG

Abstract: Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

Introduction

The field of computer vision has witnessed remarkable advancements, primarily driven by the success of convolutional neural networks (CNNs) and, more recently, vision transformers (ViTs). The established paradigms, however, face challenges when processing high-resolution images, a critical capability for numerous applications. One promising approach to address these computational challenges involves state space models (SSMs), specifically, the Mamba model adept at capturing long-range dependencies efficiently. The novel research introduces Vision Mamba (Vim), an approach to building a pure SSM-based vision backbone that offers competitive performance for visual tasks without the usual reliance on self-attention mechanisms.

Methodology

The proposed Vim employs bidirectional Mamba blocks, integrating SSMs with a keen awareness of global visual context and spatial information. By marking image sequences with position embeddings and utilizing bidirectional selective state space models, Vim elegantly compresses visual representations. This methodology permits efficient feature extraction at significantly higher speeds and lower memory costs compared to current transformer-based models.

Vim's approach is validated through extensive evaluations against existing models. It is noteworthy that Vim shines on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. It outperforms DeiT, a widely recognized vision transformer, both in terms of accuracy and computational efficiency.

Efficiency Analysis

The researchers conducted a thorough analysis of Vim's efficiency, highlighting its superior performance on hardware accelerators like GPUs, particularly in managing input/output operations and memory. Vim demonstrates notably lower IO requirements and implements recomputation strategies to minimize memory footprint when calculating gradients and activations.

Moreover, Vim's computation efficiency is underscored when compared with self-attention in transformers. Due to its linear scaling in sequence length, Vim holds the potential to handle much larger sequence lengths, therefore extending its applicability to image resolutions previously deemed challenging for transformer-style models.

Experimental Results

Empirical evidence supports the practicality and robustness of Vim. For instance, when evaluating image classification on the ImageNet-1K dataset, Vim achieves a top-1 accuracy surpassing that of DeiT with fewer parameters. Semantic segmentation on the ADE20K dataset echoes these results, with Vim showing similar performance to that of ResNet-101 while requiring significantly less computational resources.

The performance gains extend to object detection and instance segmentation tasks on the COCO dataset. Vim demonstrates a stronger ability to capture long-range context compared to DeiT, as illustrated by its superior performance in detecting medium and large-sized objects.

Conclusion

In summary, Vim is a compelling alternative to traditional CNNs and ViTs, offering an efficient and effective solution to the challenge of visual representation learning. With its capability to process long sequences more efficiently and its exceptional handling of high-resolution images, Vim presents itself as a potential backbone for the next generation of vision foundation models. Future research may leverage Vim for large-scale unsupervised visual data pretraining, multimodal task processing, and the analysis of complex images in various domains such as medical imaging and remote sensing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Beit: BERT pre-training of image transformers. In ICLR, 2022.
  2. 2-d ssm: A general spatial layer for visual transformers. arXiv preprint arXiv:2306.06635, 2023.
  3. Introducing our multimodal models, 2023.
  4. Cascade r-cnn: High quality object detection and instance segmentation. TPAMI, 2019.
  5. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  6. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  7. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  8. Rethinking attention with performers. In ICLR, 2021.
  9. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34, 2021.
  10. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  11. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  12. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, 2022.
  13. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  15. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, 2021.
  16. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In CVPR, 2022.
  17. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
  18. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2023.
  19. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021.
  20. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  21. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021a.
  22. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In NeurIPS, 2021b.
  23. On the parameterization and initialization of diagonal state space models. In NeurIPS, 2022.
  24. Diagonal state spaces are as effective as structured state spaces. In NeurIPS, 2022.
  25. Deep residual learning for image recognition. In CVPR, 2016.
  26. Densely connected convolutional networks. In CVPR, 2017.
  27. Long movie clip classification with state-space video models. In ECCV, 2022.
  28. Efficient movie scene detection using state-space transformers. In CVPR, 2023.
  29. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  30. Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960.
  31. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  32. Reformer: The efficient transformer. In ICLR, 2020.
  33. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  34. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  35. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  36. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  37. What makes convolutional models great on long sequence modeling? In ICLR, 2022b.
  38. Exploring plain vision transformer backbones for object detection. In ECCV, 2022c.
  39. Microsoft coco: Common objects in context. In ECCV, 2014.
  40. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  41. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620, 2022a.
  42. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  43. A convnet for the 2020s. In CVPR, 2022b.
  44. Decoupled weight decay regularization. In ICLR, 2019.
  45. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  46. Long range language modeling via gated state spaces. In ICLR, 2023.
  47. S4nd: Modeling images and videos as multidimensional signals with state spaces. In NeurIPS, 2022.
  48. Hierarchically gated recurrent neural network for sequence modeling. In NeurIPS, 2023.
  49. Learning transferable visual models from natural language supervision. In ICML, 2021.
  50. Designing network design spaces. In CVPR, 2020.
  51. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  52. Convolutional state space models for long-range spatiotemporal modeling. In NeurIPS, 2023a.
  53. Simplified state space layers for sequence modeling. In ICLR, 2023b.
  54. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  55. Retentive network: A successor to transformer for large language modelss. arXiv preprint arXiv:2307.08621, 2023.
  56. Going deeper with convolutions. In CVPR, 2015.
  57. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  58. Efficientnetv2: Smaller models and faster training. In ICML, 2021.
  59. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021.
  60. Training data-efficient image transformers & distillation through attention. In ICML, 2021a.
  61. Training data-efficient image transformers & distillation through attention. In ICML, 2021b.
  62. Resmlp: Feedforward networks for image classification with data-efficient training. TPAMI, 2022.
  63. Deep high-resolution representation learning for visual recognition. TPAMI, 2020a.
  64. Selective structured state-spaces for long-form video understanding. In CVPR, 2023a.
  65. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020b.
  66. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  67. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023b.
  68. When an image is worth 1,024 x 1,024 words: A case study in computational pathology. arXiv preprint arXiv:2312.03558, 2023c.
  69. Cvt: Introducing convolutions to vision transformers. In ICCV, 2021.
  70. Unified perceptual parsing for scene understanding. In ECCV, 2018a.
  71. Unified perceptual parsing for scene understanding. In ECCV, 2018b.
  72. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
  73. Diffusion models without attention. arXiv preprint arXiv:2311.18257, 2023.
  74. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lianghui Zhu (12 papers)
  2. Bencheng Liao (20 papers)
  3. Qian Zhang (308 papers)
  4. Xinlong Wang (56 papers)
  5. Wenyu Liu (146 papers)
  6. Xinggang Wang (163 papers)
Citations (450)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

  1. GitHub - hustvl/Vim (2,446 stars)
Youtube Logo Streamline Icon: https://streamlinehq.com