Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2312.00752v2)

Published 1 Dec 2023 in cs.LG and cs.AI

Abstract: Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On LLMing, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

PDF Abstract

The paper introduces Mamba, a novel sequence modeling architecture based on selective state space models (SSMs) that aims to address the limitations of Transformers in terms of computational efficiency and long-range dependency modeling. The core innovation lies in incorporating an input-dependent selection mechanism into structured SSMs, enabling the model to selectively attend to or filter out information along the sequence dimension. This selectivity allows Mamba to achieve Transformer-level performance while maintaining linear scaling in sequence length.

The authors identify the inability to perform content-based reasoning as a key weakness in existing subquadratic-time architectures, such as linear attention, gated convolution, recurrent models, and structured SSMs. To address this, they propose parameterizing the SSM parameters as functions of the input, enabling the model to selectively propagate or forget information based on the current token. While this modification prevents the use of efficient convolutions, the authors design a hardware-aware parallel algorithm in recurrent mode to maintain computational efficiency.

The resulting Mamba architecture integrates these selective SSMs into a simplified end-to-end neural network design, eliminating the need for attention mechanisms or even MLP blocks. This streamlined architecture offers several advantages:

Fast inference: Mamba achieves a 5x higher throughput compared to Transformers.
Linear scaling: Computation and memory scale linearly with sequence length.
Long context: Performance improves on real data with sequences up to million-length.

The authors validate Mamba's effectiveness across various modalities and settings:

Synthetic tasks: Mamba excels at tasks such as selective copying and induction heads, demonstrating its ability to extrapolate solutions indefinitely. On the Selective Copying task, S6-based models achieved >97% accuracy, while baseline LTI models struggled (18.3% accuracy for S4 without gating). On the Induction Heads task, Mamba was able to generalize perfectly to million-length sequences, 4000x longer than its training sequence length.
Audio and genomics: Mamba outperforms state-of-the-art models like SaShiMi, Hyena, and Transformers in modeling audio waveforms and DNA sequences, achieving significant improvements in downstream metrics such as FID.
LLMing: Mamba achieves Transformer-quality performance in LLMing tasks, with the Mamba-3B model outperforming Transformers of the same size and matching Transformers twice its size in both pretraining and downstream evaluation. Mamba-3B quality matches Transformers twice its size, such as a 4-point increase on average on common sense reasoning when compared to Pythia-3B and outperforming Pythia-7B.

The paper provides a detailed explanation of structured state space models (S4) and their relation to RNNs, CNNs, and classical state space models. S4 models are defined by four parameters $(A, A, B, C)$, which define a sequence-to-sequence transformation. The "continuous parameters" $(A, A, B)$ are transformed to "discrete parameters" $(A, B)$ through fixed formulas using a discretization rule. The model can then be computed as a linear recurrence or a global convolution.

A key contribution of the paper is the introduction of a selection mechanism that addresses the limitations of LTI models. By making the SSM parameters input-dependent, the model can selectively filter out irrelevant information and remember relevant information indefinitely. This is achieved by parameterizing $A, B,$ and $C$ as functions of the input, resulting in time-varying SSMs. The authors specifically choose $S_B(x) = \text{Linear}_N(x)$, $S_C(x) = \text{Linear}_N(x)$, $S_A(x) = \text{Broadcast}_D(\text{Linear}_1(x))$, and $t_A = \text{softplus}$.

To address the computational challenges posed by time-varying SSMs, the authors develop a hardware-aware algorithm called Selective Scan that computes the model recurrently using a scan instead of convolution. This algorithm leverages kernel fusion, parallel scan, and recomputation techniques to optimize performance on modern GPUs. The fused selective scan layer has the same memory requirements as an optimized Transformer implementation with FlashAttention.

The paper also presents a simplified SSM architecture called Mamba, which combines the design of prior SSM architectures with the MLP block of Transformers into a single block. This architecture involves expanding the model dimension $D$ by a controllable expansion factor $E$.

The authors discuss the connection between the selection mechanism and classical gating mechanisms in RNNs, highlighting that RNN gating is an instance of the selection mechanism for SSMs.

The paper includes an extensive empirical evaluation of Mamba on various tasks. On LLMing, Mamba achieves state-of-the-art performance, matching the performance of strong Transformer models while scaling linearly in sequence length. The authors also demonstrate Mamba's effectiveness on DNA modeling and audio generation tasks.

Additional ablations investigated the effects of the architecture and inner SSM layer. Among previous non-selective (LTI) SSMs, performance was similar. Replacing the complex-valued S4 variant with a real-valued one did not affect performance much. Replacing any of these with a selective SSM (S6) significantly improved performance. The Mamba architecture performed similarly to the H3 architecture.

The paper concludes by discussing related work, limitations, and future directions. The authors express excitement about the broad applications of selective state space models in building foundation models for different domains, particularly in emerging modalities requiring long context.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Albert Gu (40 papers)
Tri Dao (47 papers)

Citations (1,421)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - state-spaces/mamba (10,783 stars)

Tweets

https://twitter.com/mallocmyheart/status/1782214416110371161

https://twitter.com/QuentinAnthon15/status/1767761811393163264

https://twitter.com/AiEleuther/status/1769722653999263826

https://twitter.com/Kseniase_/status/1772967640530886889

https://twitter.com/1055883888159932416/status/1733817873800237345

https://twitter.com/1368999025165426690/status/1732101612909494565

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2312.00752v2)

Related Papers

GitHub

Tweets

YouTube

HackerNews

Reddit