Papers
Topics
Authors
Recent
Search
2000 character limit reached

Simplified Structured State Space Models (S5)

Updated 5 February 2026
  • S5 is a state-space layer that models long-range dependencies using a simplified MIMO diagonalizable system, unifying and streamlining previous SSMs like S4.
  • It achieves computational efficiency and parallelization by leveraging eigenvalue diagonalization and an associative scan, reducing both parameters and runtime complexity.
  • Empirical results on benchmarks such as LRA and speech commands confirm S5’s superior performance and robustness, partly due to its PTD-based robustness to frequency perturbations.

Simplified Structured State Space Sequence Models (S5) are a class of state-space layers designed for long-range sequence modeling in deep learning architectures. S5 generalizes and simplifies the structure of previous Structured State Space Models (SSMs) such as S4, opting for a single multi-input, multi-output (MIMO) diagonalizable state-space system. This shift combines computational efficiency, parallelization capabilities, and strong empirical performance on tasks requiring long-sequence memory. S5 also introduces new theoretical considerations in model initialization and robustness to frequency-domain perturbations.

1. Mathematical Formulation of S5

S5 is defined by a continuous-time linear time-invariant MIMO SSM with input u(t)RHu(t) \in \mathbb{R}^H, latent state x(t)RPx(t) \in \mathbb{R}^P, and output y(t)RHy(t) \in \mathbb{R}^H, governed by: dx(t)dt=Ax(t)+Bu(t),y(t)=Cx(t)+Du(t),\frac{d\,x(t)}{dt} = A\,x(t) + B\,u(t), \qquad y(t) = C\,x(t) + D\,u(t), where:

  • ARP×PA \in \mathbb{R}^{P \times P} is the state-transition matrix,
  • BRP×HB \in \mathbb{R}^{P \times H} maps inputs to state,
  • CRH×PC \in \mathbb{R}^{H \times P} maps state to outputs,
  • DRH×HD \in \mathbb{R}^{H \times H} is a feed-through term (often diagonal).

Diagonalization of AA is performed via eigendecomposition A=VΛV1A = V \Lambda V^{-1} with Λ=diag(λ1,,λP)\Lambda = \mathrm{diag}(\lambda_1, \dots, \lambda_P). The system is recast in the eigenbasis and discretized for sequence data using zero-order-hold: Λˉ=exp(ΛΔ),Bˉ=Λ1(ΛˉI)B~,Cˉ=C~,Dˉ=D,\bar\Lambda = \exp(\Lambda \Delta), \quad \bar B = \Lambda^{-1} (\bar\Lambda - I) \tilde B, \quad \bar C = \tilde C, \quad \bar D = D, yielding the recurrence: xk=Λˉxk1+Bˉuk,yk=Cˉxk+Dˉuk,x_k = \bar\Lambda x_{k-1} + \bar B u_k, \qquad y_k = \bar C x_k + \bar D u_k, where all operations involving Λˉ\bar\Lambda exploit its diagonality for parallelism.

2. Architectural Simplification and Relationship to S4

S4 layers use HH independent single-input, single-output (SISO) SSMs; equivalently, a block-diagonal SSM of size HNHN. Each S4 SSM processes a channel separately, followed by a mixing step. S5 replaces this structure with a single dense PP-dimensional MIMO SSM, with HH-dimensional input/output and typically PHNP \ll HN. This structure reduces parameterization and computational cost by consolidating the dynamics, enabling channel communication intrinsically through BB, CC, and DD.

Theoretical analysis under certain assumptions (shared AA, Δ\Delta across S4 SSMs, and BB in S5 assembled from the constituent S4s) shows that S5 states represent the sum of the corresponding S4 states, and outputs are projected equivalently by CC (Smith et al., 2022). Thus, S5 can be viewed as a reparameterization and consolidation of the block-diagonal S4 with a different readout.

3. Initialization, Parameterization, and the HiPPO Framework

The success of S4 is attributable in part to its initialization with the HiPPO–LegS matrix, an optimal online projection operator that is not stably diagonalizable but admits a normal-plus-low-rank decomposition (NPLR). S5, following S4D and DSS, initializes AA with the normal (diagonalizable) component, yielding a diagonally structured system.

Initialization proceeds with:

  • AAHiPPO-NA \gets A_{\text{HiPPO-N}} (normal part of HiPPO–LegS),
  • Diagonalization A=VΛV1A = V \Lambda V^{-1},
  • BB and CC mapped into the eigenbasis.

S5 enforces conjugate symmetry (for real AA), learning only P/2P/2 complex conjugate eigenpairs. Block-diagonal variants with JJ HiPPO–N blocks of size RR (with P=JRP = JR) further increase diversity.

Each state variable's decay timescale is parameterized independently via a learned vector ΔR>0P\Delta \in \mathbb{R}^P_{>0}, increasing flexibility and aiding optimization.

4. Computational Efficiency and Parallelization

S5 leverages the diagonalization of the state matrix for efficient parallel computation. The hidden state sequence (x1,,xL)(x_1, \ldots, x_L) can be computed via a parallel prefix-sum (scan) in O(PL)\mathcal{O}(P L) work and O(logL)\mathcal{O}(\log L) depth given LL processors. The scan's binary operator,

(Ai,Biui)(Aj,Bjuj)=(AjAi,AjBiui+Bjuj),(A_i, B_i u_i) \bullet (A_j, B_j u_j) = (A_j A_i,\, A_j B_i u_i + B_j u_j),

is associative, enabling this efficient computation.

Complexity comparisons:

  • Offline (full sequence): S4 uses FFTs and has cost O(HLlogL+H2L)\mathcal{O}(HL \log L + H^2 L). S5 achieves O(PHL+PL)\mathcal{O}(PH L + PL), which matches S4 for P=O(H)P = O(H).
  • Online (stepwise): Both achieve O(H2)\mathcal{O}(H^2) per step for P=O(H)P = O(H) (Smith et al., 2022).

GPU parallelism is enabled at both sequence and channel levels, with no need for custom CUDA kernels due to reliance on standard associative scan primitives.

5. Robustness: Ill-Posed Diagonalization and PTD Methodology

Diagonalizing the HiPPO–LegS matrix is ill-conditioned; the eigenvector matrix VHV_H exhibits exponential norm growth in nn, leading to large errors. Direct diagonalization (as in S5/S4D) discards the low-rank correction, giving well-conditioned computation but diverging in transfer function from the HiPPO original.

The perturb-then-diagonalize (PTD) methodology (Yu et al., 2023) addresses this by:

  1. Adding a small random or optimized perturbation EE to AHA_H,
  2. Diagonalizing AH+EA_H + E to obtain well-conditioned eigenvectors,
  3. Initializing S5 as APTD=Λ~A_{\text{PTD}} = \tilde\Lambda, BPTD=V~1BHB_{\text{PTD}} = \tilde V^{-1} B_H, CPTD=CV~C_{\text{PTD}} = C \tilde V.

Forward error bounds show that PTD provides frequency-domain robustness proportional to εlnn\varepsilon \ln n and ensures strong (uniform) convergence to the HiPPO+low-rank (S4) transfer function. In contrast, S5 with purely diagonal initialization achieves only weak convergence, manifesting as isolated large spikes in the frequency response. PTD removes this pathology and enhances robustness to Fourier-mode perturbations, as empirically validated on noise-injected sequence modeling tasks.

6. Empirical Performance and Ablation Studies

S5 demonstrates state-of-the-art results on several long-range sequence learning benchmarks:

  • Long Range Arena (LRA): Average 87.46%, best on Path-X at 98.58% (S4: 86.09%/96.35%). S5-PTD further improves average to 87.61% (Smith et al., 2022, Yu et al., 2023).
  • Speech Commands (35-way): S5 achieves 96.52% (vs S4D-LegS 95.83%); zero-shot at 8 kHz yields 94.53% (vs 91.08%).
  • Pendulum regression: S5 matches or exceeds CRU models with an 86× runtime improvement.
  • 1-D pixel-level image tasks: S5 matches or surpasses S4/S4D and beats most RNN baselines.

Ablation studies confirm the necessity of continuous-time parameterization (learned Δ\Delta) and HiPPO-based initialization; discrete-time or random initialization fails on challenging tasks (e.g., Path-X). PTD improves robustness and accuracy without runtime overhead.

7. Implementation Considerations

Standard implementation uses JAX with associative scan primitives (jax.lax.associative_scan). Only half-complex eigenmodes are stored for efficiency, reconstructing conjugate pairs as needed. Numerical stability is managed by clipping the eigenvalue exponentials, and post-SSM normalization uses GELU or a lightweight GLU. Default architectures use 6–8 S5 layers, H=128H = 128–512, PHP \approx H, with AdamW (differentiated learning rates for SSM and non-SSM weights). Both bidirectional and causal variants are supported based on application.

PTD initialization involves an O(n3)O(n^3) eigendecomposition once at initialization; all subsequent training and inference proceeds using the highly efficient diagonal SSM recurrences.

8. Significance and Outlook

S5 provides an efficient and highly parallelizable alternative to earlier SSM-based sequence models, streamlining architecture while preserving empirical performance and long-range dependency modeling. The PTD refinement addresses the diagonalization ill-posedness intrinsic to the HiPPO framework, ensuring robustness and strong theoretical convergence guarantees (Smith et al., 2022, Yu et al., 2023). S5 and its variants represent a foundational advance in state-space-based sequence modeling, with ongoing research extending these capabilities to more diverse data modalities and hybrid architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Simplified Structured State Space Sequence Models (S5).