Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mamba Selective State-Space Models

Updated 4 June 2026
  • The paper introduces a neural architecture that replaces quadratic-cost self-attention with a data-dependent, input-selective linear state-space module for near-linear scaling.
  • It details a five-stage design—including RMSNorm, Gated MLP, local convolution, selective SSM, and a projection with skip connection—each engineered for computational efficiency.
  • Empirical profiling and structured pruning demonstrate significant reductions in FLOPs, memory usage, and latency, making these models ideal for resource-constrained applications.

Mamba-based Selective State-Space Models (SSMs) are a class of neural architectures that replace the quadratic-cost self-attention mechanism typical of Transformers with a data-dependent, input-selective linear state-space module. These models enable near-linear scaling with sequence length while retaining the ability to model long-range dependencies and achieve state-of-the-art performance in language, vision, audio, and structured data modeling. Rigorous empirical and theoretical analysis has elucidated their computational advantages, resource utilization patterns, and pruning opportunities, establishing them as a core backbone for efficient sequence processing (Asif et al., 28 Nov 2025).

1. Architectural Anatomy of Mamba-Based Selective State-Space Models

A canonical Mamba block comprises five sequential stages, each designed for both expressivity and computational efficiency:

  1. RMSNorm: Applies per-channel normalization to the hidden activations.
  2. Gated MLP: Implements a position-wise affine transformation, projecting the normalized hidden state into four streams:

[zxBtCtΔt]=Wprojxt[z_x \,\Vert\, B_t \,\Vert\, C_t \,\Vert\, \Delta_t] = W_{\text{proj}} x_t

Here, zxRDz_x\in\mathbb{R}^D, Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N.

  1. Local Convolution: Captures short-range dependencies. In Mamba-1, this is implemented as a local kk-window convolution (O(kL2)O(kL^2)), while Mamba-2 adopts a 2D scan reducing prefill cost to O(kL)O(kL).
  2. Selective SSM (core long-range module): Evolves the state via a time-discretized update

ht=Atht1+Btxt,yt=Cthth_t = A_t h_{t-1} + B_t x_t,\quad y_t = C_t^\top h_t

with input-adaptive gates:

At=exp(ΔtA),Bt=(ΔtA)1(eΔtAI)ΔtBA_t = \exp(\Delta_t \odot A),\qquad B_t = (\Delta_t A)^{-1}(e^{\Delta_t A} - I)\,\Delta_t B

Δt\Delta_t interpolates between retain-state (Δt1\Delta_t\ll1) and overwrite-with-input (zxRDz_x\in\mathbb{R}^D0).

  1. Final Projection & Skip Connection: Projects back to model dimension zxRDz_x\in\mathbb{R}^D1 and adds a residual path.

Mamba-2 further introduces state-space duality, permitting an attention-like quadratic form zxRDz_x\in\mathbb{R}^D2 during training, but maintains zxRDz_x\in\mathbb{R}^D3 per-token recurrent cost in decoding (Asif et al., 28 Nov 2025).

2. Computational Profiling and Performance Bottlenecks

Empirical analysis on NVIDIA A100 GPUs has established that the SSM component dominates resource usage across all significant sequence lengths (zxRDz_x\in\mathbb{R}^D4–zxRDz_x\in\mathbb{R}^D5):

  • FLOPs: At zxRDz_x\in\mathbb{R}^D6, SSM requires zxRDz_x\in\mathbb{R}^D7T (Mamba-1) and zxRDz_x\in\mathbb{R}^D8T (Mamba-2) FLOPs.
  • Memory: SSM state buffers consume zxRDz_x\in\mathbb{R}^D9GB (Mamba-1) and Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N0GB (Mamba-2) at Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N1.
  • Latency Profiling: Decoder (auto-regressive) mode runtime is >60% attributed to SSM recurrence. Prefill (full-sequence) mode bottleneck shifts from convolution (Mamba-1) to the Gated MLP (Mamba-2), but SSM remains a top contributor in wall-clock cost.

Asymptotic complexity:

  • SSM recurrence per token: Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N2 (decode mode)
  • Convolution (prefill): Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N3 in Mamba-1, Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N4 in Mamba-2
  • RMSNorm & final layers: Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N5
  • SSM memory and I/O bandwidth: superlinear with Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N6 in Mamba-1, near-linear in Mamba-2 (block-wise materialization and improved cache locality lead to a Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N7 bandwidth gain at Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N8) (Asif et al., 28 Nov 2025).

3. Structured Pruning of State Channels

Channel-level activity analysis demonstrates that many SSM state dimensions exhibit persistently low gating activity (Bt,Ct,ΔtRNB_t, C_t, \Delta_t\in\mathbb{R}^N9), implying they are negligible for most inputs. This observation enables a practical pruning regime as follows:

  • Profiling step: For each layer kk0 and state channel kk1, collect average gating activity kk2.
  • Ranking & Selection: Retain the top kk3 channels with highest activity, for pruning ratio kk4.
  • Channel Removal: Physically eliminate inactive rows/columns from kk5 and insert a lightweight bridge layer kk6 to preserve shape compatibility.
  • No retraining required: Algorithm can be applied post-hoc without fine-tuning; channels with persistently low kk7 have negligible impact.

Empirical results:

  • Up to kk8 (kk9 of states): O(kL2)O(kL^2)0 mean accuracy loss.
  • O(kL2)O(kL^2)1: O(kL2)O(kL^2)2 mean drop, with task-specific variability (ARC-Easy most sensitive).
  • O(kL2)O(kL^2)3: O(kL2)O(kL^2)4 loss.
  • At O(kL2)O(kL^2)5, O(kL2)O(kL^2)6: O(kL2)O(kL^2)7 latency speedup and O(kL2)O(kL^2)8 memory reduction (Asif et al., 28 Nov 2025).

4. Hardware Co-Design and Efficiency Analysis

Optimization guidelines established by profiling and ablation include:

  • SSM-centric optimization: Focus algorithmic (block-wise state materialization) and hardware (cache-efficient recurrence kernels) co-design efforts on the SSM module, as it dominates compute and memory.
  • Fused kernel design: Gated MLP fusion (projecting all streams O(kL2)O(kL^2)9 in a single pass) lowers memory and kernel launch overhead.
  • Prefill bottlenecks: For long sequences or large O(kL)O(kL)0, convolution can bottleneck in Mamba-1; Mamba-2's 2D scan (O(kL)O(kL)1) is preferable.
  • Structured pruning: Enables trade-off of FLOPs and memory against accuracy with minimal architectural modifications, focusing on O(kL)O(kL)2 for most deployments.

Practical trade-offs:

  • Prune up to O(kL)O(kL)3 of SSM state channels with negligible accuracy cost — this is highly beneficial for latency- or memory-constrained scenarios.
  • Aggressive pruning (O(kL)O(kL)4) degrades accuracy nonlinearly and is only recommended for throughput-first applications (Asif et al., 28 Nov 2025).

5. Theoretical Properties and Dynamics of Selective SSMs

Recent theoretical analysis addresses both the asymptotic token dynamics within selective SSMs and learned information pathways:

  • Only two dynamical scenarios are admitted for 1D selective SSMs: convergence to zero if O(kL)O(kL)5, or divergence to infinity if O(kL)O(kL)6; convergence is empirically found to reduce model performance.
  • Tokens in the divergent regime contribute unequally to learning, motivating differential treatment and reordering of token presentation (Vo et al., 2024).
  • Practical refinements include (i) ensuring that the input–output matrix O(kL)O(kL)7 is positive (or positive-definite) at initialization, and (ii) token reordering based on computed importance scores. Both improve perplexity and classification accuracy.

6. Broader Impact and Applicability

The measurement-driven findings in PerfMamba and related works have several implications:

  • Design of long-context, resource-efficient models for language, vision, structured data, and specialized applications requiring both long-range context and scalable inference.
  • Structured state pruning delivers a clean and scalable mechanism for deployment in latency- and memory-sensitive inference environments, without significant architectural overhaul or retraining.
  • Mamba-based selective SSMs lay a foundation for simultaneous advances in both algorithmic expressiveness and systems-level performance, challenging the quadratic scaling and resource footprint of Transformer architectures.

References:

PerfMamba: "PerfMamba: Performance Analysis and Pruning of Selective State Space Models" (Asif et al., 28 Nov 2025) Demystifying Token Dynamics: "Demystifying the Token Dynamics of Deep Selective State Space Models" (Vo et al., 2024)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba-based Selective State-Space Models (SSMs).