Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Selective State Space Models (sSSMs)

Updated 26 June 2025

Selective State Space models represent a class of neural sequence architectures in which the parameters governing state transitions and input–state mappings are dynamically determined by the input itself. This selective, input-dependent mechanism enables the models to adaptively modulate information flow, providing the expressive power and computational efficiency necessary for modeling complex dependencies in data ranging from one-dimensional sequences (text, audio) to high-dimensional signals (images, videos, spatial-temporal graphs, and scientific tensors). Recent architectural advances—most notably those exemplified by Mamba and its multi-dimensional extensions (Mamba-ND)—demonstrate that, by employing selective state space modeling, it is possible to achieve state-of-the-art performance on a wide variety of real-world tasks, while maintaining linear computational and memory complexity with respect to sequence or data size.

1. Mathematical Structure and Selective Parameterization

Selective State Space Models (sSSMs) evolve a hidden state h(t)h(t) according to input-dependent state space equations: h(t)=A(x(t))h(t)+B(x(t))x(t) y(t)=C(x(t))h(t)+D(x(t))x(t)\begin{aligned} h'(t) &= A(x(t)) h(t) + B(x(t)) x(t) \ y(t) &= C(x(t)) h(t) + D(x(t)) x(t) \end{aligned} where AA, BB, CC, DD, and the discretization step Δ\Delta may all be functions of the current input x(t)x(t), in contrast to classical linear time-invariant (LTI) SSMs with fixed parameters.

In practice, for discrete time with zero-order hold, the update is formulated (omitting DD for simplicity) as: Aˉt=exp(ΔtAt) Bˉt=(ΔtAt)1(exp(ΔtAt)I)ΔtBt ht=Aˉtht1+Bˉtxt yt=Ctht\begin{aligned} \bar{A}_t &= \exp(\Delta_t A_t) \ \bar{B}_t &= (\Delta_t A_t)^{-1}(\exp(\Delta_t A_t) - I)\Delta_t B_t \ h_t &= \bar{A}_t h_{t-1} + \bar{B}_t x_t \ y_t &= C_t h_t \end{aligned} The selectivity arises as AtA_t, BtB_t, CtC_t, and Δt\Delta_t are computed via learnable mappings from xtx_t, allowing sequence- and context-adaptive gating and information routing.

2. Multi-Dimensional Data Processing and Alternating Scan Orders

A principal challenge in extending sSSMs to multi-dimensional data lies in adapting inherently sequential one-dimensional operations to N-dimensional arrays, such as images (2D), videos (3D), or high-dimensional scientific datasets.

Mamba-ND introduces a solution by systematically interleaving 1D sSSM (Mamba) layers along different data axes within the overall architecture. For example, in 2D data, the method alternates between flattening along rows (row-major order) and columns (column-major order) at successive layers. In 3D, cycling through various scan orderings such as height, width, and depth—each in forward or reverse direction—enables the sequential 1D sSSM operator to propagate information across every spatial and temporal axis.

For an N-dimensional tensor XX with axes (k1,...,kN)(k_1, ..., k_N), sSSM layers process sequences obtained by flattening XX via permutations and reversals of axes: s=(k1,k2,...,kN)±s = (k_1, k_2, ..., k_N)\pm After each layer, the data is reshaped back to its original dimensions, ensuring that with sufficient depth, all tokens (pixels, voxels) receive globally mixed context through repeated, alternating unidimensional scans.

3. Comparison with Bi-directional and Parallel SSM Designs

Selective state space approaches are contrasted with several alternative architectures:

  • Bi-directional LSTM/SSM: These apply forward and reverse passes along each axis. In high dimensions, however, the effective information distance between nearby patches can become large, and the increase in compute and memory overhead is substantial. In practice, simply adding bi-directionality at each layer does not outperform the combine-and-alternate strategy of Mamba-ND.
  • S4ND (multi-dimensional SSM): Extensions of LTI SSMs to NND admit highly parallelized global convolutions. However, these are not applicable to input-selective (LTV) SSMs like Mamba, as selectivity precludes efficient kernel unrolling.

Notably, Mamba-ND achieves high performance with a single scan per layer, only requiring alternation of scan order, thereby minimizing memory usage and maximizing computational efficiency.

4. Empirical Performance across Domains

Mamba-ND's selective state space modeling yields highly competitive results versus Transformer and SSM variants on a range of tasks:

Task Model Params Metric Performance
ImageNet-1K Classification Mamba-2D-S 24M Top-1 Acc (%) 81.7
ViT-B 86M 77.9
Video (HMDB51/UCF-101) Mamba-3D 36M HMDB-51 (%) 60.9
Video Swin-T 30M 53.0
ERA5 Weather Forecasting Mamba-3D 50M ACC 90.1
3D Medical Segmentation Mamba-3D-UNETR 107M DICE 84.7

The results indicate substantial parameter efficiency: exceeding or matching strong Transformer baselines with a fraction of the parameter count, and maintaining or improving predictive accuracy across diverse, high-dimensional benchmarks.

5. Selective State Space Modeling: Mechanisms and Benefits

Selective state space modeling leverages content-dependent parameterization to realize:

  • Content-aware, adaptive memory: Parameters can selectively incorporate, ignore, or amplify inputs based on their local content, a critical feature for tasks with variable or non-stationary information flow.
  • Global Receptive Field via Depth and Alternation: Alternating scan orders allow local operations to achieve full spatial/temporal context over sufficient depth, enabling models to capture long-range and complex dependencies in a parameter-efficient manner.
  • Linear Complexity: The approach retains the key advantage of linear compute and memory scaling with sequence size, enabling practical training and inference on high-resolution and long-context data.

6. Future Research Directions

Several open questions and research directions are emphasized for selective state space modeling in multi-dimensional data:

  • Scan Ordering Optimization: Beyond row/column (and their NND generalizations), exploring alternative orderings (e.g., diagonal, zig-zag) may yield improved mixing and faster contextualization.
  • Efficient Factorizations: Reducing the intermediate memory required by sequential factorizations, possibly by combining sub-sequence processing, offers room for improving throughput further.
  • Task- and Domain-Adaptive Variations: While Mamba-ND is generic, adapting selective state space structures for specific modalities (e.g., motion in video, physical priors in scientific data) may close remaining gaps to specialized Transformer architectures.
  • Hardware Acceleration: As SSMs have unique dataflow patterns (e.g., sequential scan), further aligning these operators with hardware kernels may unlock additional efficiency.

7. Relation to Broader Selective SSM Literature

The selective state space paradigm extends beyond single-modality sequence modeling and encompasses vision, multimodal, graph, and scientific learning problems. Mamba-based designs, with their principle of alternate scan and selective gating, represent a general, domain-agnostic pattern for scaling efficient, content-driven modeling to complex, high-dimensional real-world data. The architectural strategy is flexible and computationally scalable, and variants or hybrids with convolutional or attention models remain an area of active research—particularly as cross-domain adoption (e.g., in video models, medical imaging, or multi-modal learning) increases.


Selective State Space models, as operationalized by Mamba-ND and its family, present a scalable, adaptive, and domain-agnostic solution for multi-dimensional sequence modeling, delivering empirical advantages over traditional attention- and convolution-based approaches while aligning closely with hardware, theoretical, and application-centric requirements.