S6: Selective Scan Sequence Model

Updated 22 February 2026

S6 is a neural sequence model that generalizes classical state space models by incorporating input-conditioned dynamic transitions for selective state evolution.
The architecture combines hardware-efficient parallel scans with dynamic gating to aggregate both long-range and local context across modalities.
Empirical results demonstrate that S6-based models achieve competitive accuracy with reduced memory usage in language, time series, and vision applications.

The Selective Scan Space State Sequential Model (S6) is a neural sequence modeling architecture that generalizes classical state space models (SSMs) by incorporating content-dependent, dynamically-parameterized transitions and input gates, combined with efficient scan-based context aggregation. S6 serves as the mathematical and algorithmic core for a growing family of high-performance models (notably Mamba and its derivatives) spanning language, time series, vision, point cloud, and multi-modal domains. The distinctive hallmark of S6 is its combination of input-conditioned dynamic state evolution, hardware-efficient parallel scan implementation, and linear scaling in both sequence and feature dimensions, supporting both long-range and local dependency modeling.

1. Mathematical Foundations of S6

At its base, S6 extends the linear time-invariant (LTI) SSM: $\begin{aligned} \dot{h}(t) &= A\,h(t) + B\,x(t) \ y(t) &= C\,h(t) \end{aligned}$ with hidden state $h(t)\in\mathbb{R}^N$ and input $x(t)\in\mathbb{R}^D$ (Gu et al., 2023).

Discrete time evolution—using a step $\Delta$ —yields: $\begin{aligned} h_t &= \overline{A}\, h_{t-1} + \overline{B}\, x_t \ y_t &= C_h_t \end{aligned}$ where $\overline{A} = \exp(\Delta A)$ , $\overline{B} = (\Delta A)^{-1}(\exp(\Delta A) - I)\,\Delta B$ .

The unique S6 innovation is to make these parameters data-dependent: $\begin{aligned} \Delta_t &= \mathrm{softplus}(W_\Delta\,x_t + b_\Delta) \ B_t &= W_B\,x_t \ C_t &= W_C\,x_t \ \bar{A}_t &= \exp(\Delta_t\,A) \ \bar{B}_t &= (\Delta_t A)^{-1}(\exp(\Delta_t A)-I)\,\Delta_t\,B_t \end{aligned}$

The hidden state is then updated as: $h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t \quad y_t = C_t h_t$ The selection mechanism allows dynamic gating of how much of the past state $h_{t-1}$ is carried forward versus how much new information $x_t$ is injected, enabling selective propagation or erasure of information (Gu et al., 2023, Shi, 2024).

2. Content-Dependent Selection Mechanisms

Whereas classical SSMs or recurrrent models use fixed transition dynamics, S6 introduces input-conditioned parameter generation via shallow MLPs or linear projections. Specifically, at each time step, the input $x_t$ (sometimes augmented by a local context window) is fed into small neural nets $f_B, f_C, f_\Delta$ to yield per-token parameters $B_t, C_t, \Delta_t$ (Shi, 2024, Shi, 2024).

This mechanism allows S6 to adaptively "select" which latent subspaces are updated, reset, or ignored—crucial for non-stationary, structured, or multimodal data. In simplified cases (diagonal $A$ , scalar state), S6 reduces to a class of gated RNNs: $h_t = (1 - g_t) h_{t-1} + g_t x_t$ with $g_t$ a function of the current input [(Gu et al., 2023), Appendix C].

For sequence tasks with non-linearly evolving regime (e.g., different phases in battery cycling or financial markets), the S6 gating adapts the memory timescale and selection mask dynamically (Shi, 2024, Shi, 2024).

3. Scan Modules and Efficient Sequence Processing

The S6 framework universally combines the selective state evolution step with a scan (windowed context) module. After the core recurrence, a learnable operation (e.g., sliding 1D convolution, local MLP, or simple recurrence) traverses the sequence, aggregating local context in a hardware-parallelizable fashion (Shi, 2024, Shi, 2024).

This scan augments the modeling of short-term patterns and local coarticulations that pure long-range state propagation may miss. S6 implementations leverage associative scan (Blelloch scan) for computation, yielding linear scaling in sequence length and state size, which is critical for practical deployment at scale (Gu et al., 2023).

The generic S6 block, expressed in a computation graph: $\text{Input: } x_t \rightarrow \text{Selector MLPs} \rightarrow \text{State Recurrence} \rightarrow \text{Scan Module} \rightarrow \text{Output}$

4. Empirical Performance and Application Domains

S6-derived models deliver state-of-the-art or highly competitive accuracy in a spectrum of domains. Examples include:

Language modeling: S6/Mamba matches or surpasses much larger Transformer baselines on The Pile and other long-context datasets, and exhibits robust extrapolation ((Gu et al., 2023), Table 1).
Time-series prediction: S6 architectures outperform Kalman Filter, ARIMA, LSTM, and Transformer models on stock forecasting and battery state-of-health estimation tasks (Shi, 2024, Shi, 2024).
Vision tasks: Extensions of S6 to two-dimensional and multi-dimensional scan modules underpin high-performance image, video, and point cloud architectures (Ji, 2024, Zhang et al., 2024, Xiao et al., 2024, Qu et al., 11 Nov 2025, Qu et al., 26 Jul 2025).

Representative results (selected from (Shi, 2024, Shi, 2024)):

Task	S6 Model	Baseline	Metric (RMSE/MAE/R²)
Battery RUL	MambaLithium	PINN	39.7 vs. 45.9
Stock Pred.	MambaStock	LSTM/Transf/etc.	0.0450 vs 0.0454 / 0.9434 vs 0.9383

S6 also shows strictly linear complexity in theory and hardware practice (Gu et al., 2023):

Model	Time Complexity	Memory Complexity
S6	$O(L N D)$	$O(L D)$
Transformer	$O(L^2 D)$	$O(L^2)$

5. Information-Theoretic Properties and Convergence Guarantees

Recent theoretical work provides a rigorous foundation for S6 memory compression and stability (Bhat, 2024). Key insights include:

Memory compression: The gating mechanism can be viewed as a form of input-dependent sparsification, leading to lower effective state dimensionality and reduced memory usage.
Rate-distortion tradeoff: Selective gating introduces an information bottleneck, balanceable via constraints on mutual information and formal rate-distortion bounds.
Contraction and stability: Provided the gating is Lipschitz and the state transition matrix norm is bounded, S6 guarantees mean-square convergence to a unique stationary distribution.
Empirical resource gains: Selective SSMs show 1–3% higher accuracy and up to 40–50% lower memory usage compared to LSTMs and GRUs in sequence modeling tasks.

6. Extensions, Specializations, and Research Directions

S6 provides the architectural scaffold for a number of specialized models and research directions:

Multi-modal and spatial SSMs: S6 has been enriched to support multidirectional recurrence (e.g., Octopus (Mahatha et al., 31 Jan 2026), Spatial-Mamba (Xiao et al., 2024)), bidirectional and multi-head variants (HydraMamba (Qu et al., 26 Jul 2025), MHS-VM (Ji, 2024)), and content-aware scan patterns for images, videos, and point clouds.
Grouped and parameter-shared SSMs: Parameter grouping (GS6) enables better generalization and prevents overfitting in high-dimensional domains (CloudMamba (Qu et al., 11 Nov 2025)).
State-feedback enhancements: Extensions such as COFFEE replace input-based gating with closed-loop state feedback to further improve context selectivity and parameter efficiency (Zattra et al., 15 Oct 2025).
Layerwise aggregation: S6LA employs selective state space dynamics for adaptive aggregation across CNN and transformer layers, boosting feature flow in deep vision architectures (Liu et al., 12 Feb 2025).
Memory-constrained and information-theoretic control: Future work is exploring sparsity-inducing regularization, adaptive depth/scan schedules, and information bottleneck formulations (Bhat, 2024).

7. Limitations, Open Problems, and Future Work

Key challenges and limitations identified in current S6 research are:

Precise design of the selector (gating) networks, including trade-offs between expressivity, stability, and computational cost (Shi, 2024, Bhat, 2024).
Optimal parameter sharing, grouping, or localization for spatial/multimodal tasks, to control overfitting and avoid redundant computation (Qu et al., 11 Nov 2025).
Integration with self-supervised or masked modeling strategies for efficient pretraining in high-dimensional signal domains (Qu et al., 11 Nov 2025).
Further scaling analysis and architectural regularization for deployment in very deep or multimodal architectures, and for memory constrained inference (Liu et al., 12 Feb 2025).
Comprehensive ablation studies on the interaction between selection, scan, and backbone architecture remain underexplored due to limited reporting in current literature (Shi, 2024).

The S6 framework, as formalized in foundational works (Gu et al., 2023, Shi, 2024, Bhat, 2024), marks a significant convergence of control-theoretic models, neural sequence modeling, and efficient architectural design. The exploration of content-selective dynamics via hardware-aware scan operations continues to drive rapid progress in both theoretical understanding and empirical capability across the sequence modeling landscape.