SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series (2403.15360v2)

Published 22 Mar 2024 in cs.CV, cs.LG, cs.SY, eess.IV, and eess.SY

Abstract: Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~\url{https://github.com/badripatro/Simba}.

PDF HTML Abstract

Simplified Mamba-based Architecture for Vision and Multivariate Time Series Analysis

Introduction

Recent advancements in the field of deep learning have highlighted the efficiency of Transformer models in handling sequential data across various domains including NLP and computer vision. However, the quadratic complexity of multi-headed self-attention (MHSA) presents challenges in scaling these models, particularly for long sequences. In response, State Space Models (SSMs) such as S4, and more recently, Mamba, have emerged as potent alternatives. This paper introduces SiMBA, a simplified Mamba-based architecture that leverages Einstein FFT (EinFFT) for channel modeling alongside the Mamba block for sequence modeling. Through extensive evaluation, SiMBA is demonstrated to outperform both existing SSMs and state-of-the-art transformers across a range of benchmarks.

Motivation and Background

Transformers and SSMs have become pivotal in processing sequential data, with their ability to capture long-range dependencies. However, the limitation of MHSA's quadratic computational cost with respect to sequence length motivates the exploration of SSMs. Mamba, a recent SSM, aims to address the efficiency and inductive bias issues in transformers by introducing a selective state space mechanism to optimize information propagation. Despite its innovations, Mamba's scalability to large network sizes, especially in computer vision tasks, suffers due to training instability. To overcome these obstacles, we propose SiMBA, which incorporates a novel channel modeling technique, EinFFT, to enhance stability and performance.

Model Architecture

SiMBA innovates by introducing EinFFT for spectral channel mixing, complementing the Mamba block's selective state space approach for sequence modeling. This combination addresses both the quadratic complexity problem and the stability issues observed with Mamba, positioning SiMBA as a leading architecture for processing long sequences.

Sequence Modeling with Mamba: SiMBA utilizes the Mamba architecture to model sequences, incorporating it as a modular component capable of handling long-range dependencies efficiently.
Channel Modeling with EinFFT: The EinFFT technique is a pivotal innovation in SiMBA, designed specifically to tackle the challenges in channel modeling presented by Mamba. Through complex eigenvalue computations and the application of Fourier Transforms, EinFFT significantly enhances the model's stability and information capture capabilities.

Main Contributions

Stability and Performance: SiMBA's primary contribution lies in its ability to maintain model stability while scaling to large networks, surpassing Mamba and other SSMs in performance benchmarks.
Cross-Domain Efficacy: Extensive testing on both image and time-series datasets establishes SiMBA's versatility, demonstrating its superior performance compared to leading attention-based transformers and traditional SSMs across multiple domains.
EinFFT for Channel Modeling: The introduction of EinFFT represents a significant advancement in the field of state space modeling, offering a robust solution for spectral channel mixing that solves the previously noted stability issues.

Experimental Results

SiMBA achieves new state-of-the-art performance across numerous benchmarks, including a notable improvement on the ImageNet dataset and several time series forecasting tasks. In comparison to attention-based models and other SSMs, SiMBA not only demonstrates superior accuracy but also exhibits remarkable efficiency and generalization across transfer learning tasks.

Future Directions

This paper paves the way for future investigations into alternative sequence and channel modeling techniques within the SiMBA framework. The adaptability of SiMBA suggests potential enhancements through the exploration of different structural configurations, promising further advancements in both theoretical and practical applications of deep learning for sequential data processing.

Conclusion

SiMBA addresses the critical challenges faced by both traditional transformers and recent SSMs, offering a robust solution that combines the strengths of Mamba's selective state space mechanism and the novel EinFFT channel modeling technique. The model's exceptional performance across a diverse set of benchmarks underscores its potential as a transformative approach in the domain of sequential data processing.