The paper introduces Mamba, a novel sequence modeling architecture based on selective state space models (SSMs) that aims to address the limitations of Transformers in terms of computational efficiency and long-range dependency modeling. The core innovation lies in incorporating an input-dependent selection mechanism into structured SSMs, enabling the model to selectively attend to or filter out information along the sequence dimension. This selectivity allows Mamba to achieve Transformer-level performance while maintaining linear scaling in sequence length.
The authors identify the inability to perform content-based reasoning as a key weakness in existing subquadratic-time architectures, such as linear attention, gated convolution, recurrent models, and structured SSMs. To address this, they propose parameterizing the SSM parameters as functions of the input, enabling the model to selectively propagate or forget information based on the current token. While this modification prevents the use of efficient convolutions, the authors design a hardware-aware parallel algorithm in recurrent mode to maintain computational efficiency.
The resulting Mamba architecture integrates these selective SSMs into a simplified end-to-end neural network design, eliminating the need for attention mechanisms or even MLP blocks. This streamlined architecture offers several advantages:
- Fast inference: Mamba achieves a 5x higher throughput compared to Transformers.
- Linear scaling: Computation and memory scale linearly with sequence length.
- Long context: Performance improves on real data with sequences up to million-length.
The authors validate Mamba's effectiveness across various modalities and settings:
- Synthetic tasks: Mamba excels at tasks such as selective copying and induction heads, demonstrating its ability to extrapolate solutions indefinitely. On the Selective Copying task, S6-based models achieved >97% accuracy, while baseline LTI models struggled (18.3% accuracy for S4 without gating). On the Induction Heads task, Mamba was able to generalize perfectly to million-length sequences, 4000x longer than its training sequence length.
- Audio and genomics: Mamba outperforms state-of-the-art models like SaShiMi, Hyena, and Transformers in modeling audio waveforms and DNA sequences, achieving significant improvements in downstream metrics such as FID.
- LLMing: Mamba achieves Transformer-quality performance in LLMing tasks, with the Mamba-3B model outperforming Transformers of the same size and matching Transformers twice its size in both pretraining and downstream evaluation. Mamba-3B quality matches Transformers twice its size, such as a 4-point increase on average on common sense reasoning when compared to Pythia-3B and outperforming Pythia-7B.
The paper provides a detailed explanation of structured state space models (S4) and their relation to RNNs, CNNs, and classical state space models. S4 models are defined by four parameters $(A, A, B, C)$, which define a sequence-to-sequence transformation. The "continuous parameters" $(A, A, B)$ are transformed to "discrete parameters" $(A, B)$ through fixed formulas using a discretization rule. The model can then be computed as a linear recurrence or a global convolution.
A key contribution of the paper is the introduction of a selection mechanism that addresses the limitations of LTI models. By making the SSM parameters input-dependent, the model can selectively filter out irrelevant information and remember relevant information indefinitely. This is achieved by parameterizing $A, B,$ and $C$ as functions of the input, resulting in time-varying SSMs. The authors specifically choose $S_B(x) = \text{Linear}_N(x)$, $S_C(x) = \text{Linear}_N(x)$, $S_A(x) = \text{Broadcast}_D(\text{Linear}_1(x))$, and $t_A = \text{softplus}$.
To address the computational challenges posed by time-varying SSMs, the authors develop a hardware-aware algorithm called Selective Scan that computes the model recurrently using a scan instead of convolution. This algorithm leverages kernel fusion, parallel scan, and recomputation techniques to optimize performance on modern GPUs. The fused selective scan layer has the same memory requirements as an optimized Transformer implementation with FlashAttention.
The paper also presents a simplified SSM architecture called Mamba, which combines the design of prior SSM architectures with the MLP block of Transformers into a single block. This architecture involves expanding the model dimension $D$ by a controllable expansion factor $E$.
The authors discuss the connection between the selection mechanism and classical gating mechanisms in RNNs, highlighting that RNN gating is an instance of the selection mechanism for SSMs.
The paper includes an extensive empirical evaluation of Mamba on various tasks. On LLMing, Mamba achieves state-of-the-art performance, matching the performance of strong Transformer models while scaling linearly in sequence length. The authors also demonstrate Mamba's effectiveness on DNA modeling and audio generation tasks.
Additional ablations investigated the effects of the architecture and inner SSM layer. Among previous non-selective (LTI) SSMs, performance was similar. Replacing the complex-valued S4 variant with a real-valued one did not affect performance much. Replacing any of these with a selective SSM (S6) significantly improved performance. The Mamba architecture performed similarly to the H3 architecture.
The paper concludes by discussing related work, limitations, and future directions. The authors express excitement about the broad applications of selective state space models in building foundation models for different domains, particularly in emerging modalities requiring long context.