VMamba: Visual State Space Model (2401.10166v3)

Published 18 Jan 2024 in cs.CV

Abstract: Designing computationally efficient network architectures persists as an ongoing necessity in computer vision. In this paper, we transplant Mamba, a state-space LLM, into VMamba, a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments showcase VMamba's promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.

View on arXiv

References (59)

Authors (8)

Yue Liu (257 papers)
Yunjie Tian (17 papers)
Yuzhong Zhao (18 papers)
Hongtian Yu (5 papers)
Lingxi Xie (137 papers)
Yaowei Wang (149 papers)
Qixiang Ye (110 papers)
Yunfan Liu (24 papers)

Citations (376)

View on Semantic Scholar

Summary

Overview of Visual State Space Model (VMamba)

In recent advancements in visual representation learning, two primary foundation models, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have dominated the field. CNNs are known for their scalability, with a computational complexity that increases linearly with image resolution. Conversely, ViTs are celebrated for their superior fitting capabilities, albeit facing challenges with their quadratic computational complexity. Upon close examination, what gives ViTs their edge are the global receptive fields and dynamic weights in their architecture.

Introducing VMamba

A novel architecture known as the Visual State Space Model (VMamba) has been introduced to combine the strengths of CNNs and ViTs while also tackling their respective computational efficiency issues. VMamba leverages the advantages of ViTs in retaining global receptive fields and dynamic weights, yet manages to do so with linear computational complexity. To surmount the inherent direction-sensitive problem associated with non-causal visual data, VMamba employs a new module called the Cross-Scan Module (CSM), which allows for traversing the spatial domain in a way that maintains these global properties without the computational expense typically incurred by ViTs.

The Backbone of VMamba

At the heart of VMamba is a mechanism inspired by state space models, particularly the Selective Scan Space State Sequential Model (S6), initially designed to enhance NLP tasks. The selective scan mechanism built within S6 is what enables VMamba to maintain a global receptive field and circumvent the quadratic complexity. CSM plays a crucial role as well in ensuring that every element within the spatial domain of an image can integrate information from all other locations. This is achieved via a four-way scanning strategy, which avoids increasing the linear computational complexity.

Benchmarking VMamba's Performance

VMamba was put through rigorous testing across a variety of visual perception tasks. The results are revealing: VMamba consistently exhibits strong performance, and as the resolution of the input images increases, its advantages become even more pronounced. Compared to established benchmarks such as ResNet, ViT, and Swin transformers, VMamba holds its own, especially when dealing with larger image inputs where other models would see a significant rise in computational demands. Importantly, VMamba shows that it is feasible to have a model architecture that combines the desirable qualities of a global receptive field and dynamic weights without becoming computationally prohibitive.