Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VMamba: Visual State Space Model (2401.10166v3)

Published 18 Jan 2024 in cs.CV

Abstract: Designing computationally efficient network architectures persists as an ongoing necessity in computer vision. In this paper, we transplant Mamba, a state-space LLM, into VMamba, a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments showcase VMamba's promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Crossvit: Cross-attention multi-scale vision transformer for image classification. In IEEE ICCV, 2021.
  2. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  3. Dual path networks. Advances in neural information processing systems, 30, 2017.
  4. François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  5. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
  6. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34:3965–3977, 2021.
  7. Imagenet: A large-scale hierarchical image database. In IEEE CVPR, pages 248–255, 2009.
  8. Davit: Dual attention vision transformers. In European Conference on Computer Vision, pages 74–92. Springer, 2022.
  9. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In IEEE CVPR, pages 12124–12134, 2022.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  11. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022.
  12. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  13. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  14. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  15. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
  16. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  17. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  18. Liquid structural state-space models. In The Eleventh International Conference on Learning Representations, 2022.
  19. Deep residual learning for image recognition. In IEEE CVPR, pages 770–778, 2016.
  20. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  21. Pointwise convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 984–993, 2018.
  22. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  23. Imagenet classification with deep convolutional neural networks. In NeurIPS, pages 1106–1114, 2012.
  24. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  25. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  26. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014.
  27. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  28. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE ICCV, pages 10012–10022, 2021.
  29. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  30. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  31. Container: Context aggregation networks. NeurIPS, 34:19160–19171, 2021.
  32. Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29, 2016.
  33. Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations, 2022.
  34. Long range language modeling via gated state spaces. In International Conference on Learning Representations, 2023.
  35. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
  36. Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
  37. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  38. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  39. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2022.
  40. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021.
  41. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  42. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019.
  43. Efficient transformers: A survey. volume 55, 2022.
  44. Integrally pre-trained transformer pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18610–18620, 2023.
  45. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
  46. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12894–12904, 2021.
  47. Selective structured state-spaces for long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6387–6397, 2023.
  48. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
  49. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
  50. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
  51. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  52. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
  53. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  54. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, pages 558–567, 2021.
  55. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
  56. Hivit: A simpler and more efficient design of hierarchical vision transformer. In International Conference on Learning Representations, 2023.
  57. Graformer: Graph-oriented transformer for 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20438–20447, 2022.
  58. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
  59. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yue Liu (257 papers)
  2. Yunjie Tian (17 papers)
  3. Yuzhong Zhao (18 papers)
  4. Hongtian Yu (5 papers)
  5. Lingxi Xie (137 papers)
  6. Yaowei Wang (149 papers)
  7. Qixiang Ye (110 papers)
  8. Yunfan Liu (24 papers)
Citations (376)

Summary

Overview of Visual State Space Model (VMamba)

In recent advancements in visual representation learning, two primary foundation models, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have dominated the field. CNNs are known for their scalability, with a computational complexity that increases linearly with image resolution. Conversely, ViTs are celebrated for their superior fitting capabilities, albeit facing challenges with their quadratic computational complexity. Upon close examination, what gives ViTs their edge are the global receptive fields and dynamic weights in their architecture.

Introducing VMamba

A novel architecture known as the Visual State Space Model (VMamba) has been introduced to combine the strengths of CNNs and ViTs while also tackling their respective computational efficiency issues. VMamba leverages the advantages of ViTs in retaining global receptive fields and dynamic weights, yet manages to do so with linear computational complexity. To surmount the inherent direction-sensitive problem associated with non-causal visual data, VMamba employs a new module called the Cross-Scan Module (CSM), which allows for traversing the spatial domain in a way that maintains these global properties without the computational expense typically incurred by ViTs.

The Backbone of VMamba

At the heart of VMamba is a mechanism inspired by state space models, particularly the Selective Scan Space State Sequential Model (S6), initially designed to enhance NLP tasks. The selective scan mechanism built within S6 is what enables VMamba to maintain a global receptive field and circumvent the quadratic complexity. CSM plays a crucial role as well in ensuring that every element within the spatial domain of an image can integrate information from all other locations. This is achieved via a four-way scanning strategy, which avoids increasing the linear computational complexity.

Benchmarking VMamba's Performance

VMamba was put through rigorous testing across a variety of visual perception tasks. The results are revealing: VMamba consistently exhibits strong performance, and as the resolution of the input images increases, its advantages become even more pronounced. Compared to established benchmarks such as ResNet, ViT, and Swin transformers, VMamba holds its own, especially when dealing with larger image inputs where other models would see a significant rise in computational demands. Importantly, VMamba shows that it is feasible to have a model architecture that combines the desirable qualities of a global receptive field and dynamic weights without becoming computationally prohibitive.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com