EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba (2403.09977v1)

Published 15 Mar 2024 in cs.CV and cs.AI

Abstract: Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands $\mathcal{O}(N^2)$. This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as LLMing and computer vision, while reducing the time complexity of global information extraction to $\mathcal{O}(N)$. Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with $1.3$G FLOPs improves Vim-Ti with $1.5$G FLOPs by a large margin of $5.6\%$ accuracy on ImageNet. Code is available at: \url{https://github.com/TerryPei/EfficientVMamba}.

PDF HTML Abstract

An Analytical Overview of "EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba"

The paper presents "EfficientVMamba", an innovative lightweight visual representation model, emphasizing efficiency and accuracy. The focal point is the Atrous Selective Scan (ASS) method embedded within the Visual Mamba framework. The paper primarily explores the adaptability of EfficientVMamba as a lightweight backbone within established object detection frameworks such as RetinaNet.

Performance Evaluation on RetinaNet

EfficientVMamba's performance is examined through rigorous comparisons against popular architectures including ResNet and Pyramid Vision Transformer (PVT) series. In the COCO dataset evaluations with RetinaNet, EfficientVMamba-T achieved a significant performance boost with a 0.8% and 0.9% increase in AP and AP ${}_{50}$ respectively over PVTv1-Tiny, while reducing parameters from 23M to 13M. Concurrently, EfficientVMamba-B enhanced performance by 0.9% in AP compared to PVTv1-Medium, decreasing parameter count from 53.9M to 44M. These improvements underscore the model's potential to retain high detection accuracy alongside compact model size, crucial for deployment in resource-constrained environments.

Comparative Analysis with MobileNetV2

The paper further compares EfficientVMamba's integration of Atrous Selective Scan (EVSS) blocks against the traditional Inverted Residual (InRes) blocks in MobileNetV2 architectures. The results demonstrate that utilizing a hybrid approach, where EVSS blocks are applied in initial network stages followed by InRes blocks, achieves superior performance. This strategy yields an accuracy of 76.5% for the tiny variant and 81.8% for the base variant on the ImageNet dataset. The hybrid model effectively combines the computational efficiency of EVSS in earlier stages with the enhanced representational capabilities of InRes in later stages.

Limitations and Future Research Directions

Despite achieving favorable results, the authors identify certain limitations with EfficientVMamba. Visual state space models, while beneficial for high-resolution tasks, exhibit increased computational complexity when compared to CNNs and Transformers. This complexity challenges parallel processing efficiency. Future research is recommended to enhance the computational scalability and efficiency of visual state space models, potentially modifying SSMs to better align with parallel processing requirements while retaining their structural advantages.

Implications and Future Prospects

EfficientVMamba presents a promising advancement in the landscape of lightweight deep learning models. The successful integration of Atrous Selective Scan within the Visual Mamba framework highlights the potential for more effective resource management in deploying deep learning models. In view of the increasing demand for models that operate efficiently on edge devices, EfficientVMamba's approach could inspire new architectures that strike a balance between performance and model size. As AI continues to expand into different application domains, architectures embodying these principles may become indispensable in fields requiring robust yet efficient computational frameworks.

This model's performance benchmarks against prevalent backbone architectures present invaluable insights into the achievable compromises between computational load and precise visual task execution. As research progresses, EfficientVMamba could serve as a foundational model that informs future developments in lightweight neural network design.