Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diagonal State-Space Models

Updated 19 February 2026
  • Diagonal State-Space (DSS) models are linear state-space architectures with a strictly diagonal state matrix that efficiently captures long-range dependencies using per-mode memory control.
  • They leverage initialization schemes such as HiPPO-inspired, Fourier-based, and DFT-based methods to ensure stable, alias-free frequency coverage and robust performance across tasks.
  • DSS architectures offer computational simplicity via exact kernel computation with Vandermonde matrices, efficient O(N) online inference, and effective compression techniques for deployment.

Diagonal State-Space (DSS) models are a class of sequence modeling architectures that parameterize the state matrix of a state-space model (SSM) as strictly diagonal. Rooted in classical control theory and motivated by the need for simple, efficient, and expressive models for long-range dependencies, DSS architectures dramatically simplify and generalize recent Structured State-Space (S4) models, matching their empirical performance while streamlining implementation and analysis. The diagonal construction, which enables per-mode memory control and exact kernel computation, underpins a rich spectrum of theoretical properties, practical initializations, and connections to other efficient sequence models.

1. Mathematical Formulation and Theoretical Expressivity

DSS models are discrete- or continuous-time linear SSMs with a diagonal state transition matrix. For continuous-time systems: x˙(t)=Ax(t)+Bu(t),y(t)=Cx(t)+Du(t),\dot{x}(t) = A x(t) + B u(t), \quad y(t) = C x(t) + D u(t), where A=diag(λ1,...,λN)A = \mathrm{diag}(\lambda_1, ..., \lambda_N) (λiC\lambda_i \in \mathbb{C}), BB, and CC define input/output couplings. The discrete-time equivalent with sampling step Δ\Delta is: xk+1=Aˉxk+Bˉuk,yk=Cxk.x_{k+1} = \bar{A} x_k + \bar{B} u_k, \quad y_k = C x_k. Here Aˉ=eAΔ\bar{A} = e^{A\Delta}, Bˉ=(eAΔI)A1B\bar{B} = (e^{A\Delta} - I)A^{-1}B.

The fundamental property is that any diagonalizable SSM is equivalent to a diagonal SSM via change of basis. The convolutional kernel generated by a DSS layer is: Kj=CAˉjBˉ,K_j = C \bar{A}^j \bar{B}, which decomposes as a sum of per-mode geometric progressions: Kj=i=1NCi(λˉi)jBˉi.K_j = \sum_{i=1}^N C_i\,(\bar{\lambda}_i)^j\,\bar{B}_i. This general Vandermonde structure enables closed-form kernel computation and exact frequency analysis (Gu et al., 2022).

Theoretical completeness is established by the result that almost all SSMs are diagonalizable; thus, DSS models are maximally expressive for linear sequence modeling given sufficient state dimension (Gupta et al., 2022).

2. Initialization Schemes and Spectral Representation

The performance and long-range memory properties of DSS models are highly sensitive to the placement of pole locations (λi\lambda_i). HiPPO-inspired and Fourier-inspired initializations dominate:

  • S4D-LegS: Uses the diagonal of the normal HiPPO matrix as eigenvalues; asymptotically recovers the Legendre polynomial structure and the S4 kernel as NN \to \infty.
  • S4D-Inv: Approximates the HiPPO spectrum with λn=12+iNπ(N2n+11)\lambda_n = -\frac12 + i \frac{N}{\pi} \bigl(\frac{N}{2n+1} - 1\bigr).
  • S4D-Lin: Uniform imaginary part, i.e., λn=12+iπn\lambda_n = -\frac12 + i\, \pi n, giving a truncated Fourier basis.

The S4D-DFouT initialization places poles directly in the discrete Fourier domain to decouple decay and frequency:

λn=exp(ξ2+i2πnN),\overline{\lambda}_n = \exp\left(-\frac{\xi}{2} + i \frac{2\pi n}{N}\right),

with (shared/learnable) decay ξ>0\xi > 0 and uniformly-distributed phases (Solozabal et al., 28 Aug 2025).

These schemes ensure stable, alias-free coverage of relevant frequencies and enable robust, memory-efficient learning. Accurate initialization is critical; empirical ablations consistently demonstrate catastrophic performance if real/imaginary parts are mistuned (Gu et al., 2022, Gupta et al., 2022).

3. Spectral Properties and “Spectral Bias”

The diagonal parameterization causes the DSS kernel’s discrete-time frequency response to decompose as: H(z)=k=1NCkBk1λkz1H(z) = \sum_{k=1}^N \frac{C_k B_k}{1 - \lambda_k z^{-1}} Each mode acts as a single-pole IIR filter, where the pole’s modulus determines decay (memory span) and argument determines frequency. Therefore, DSS layers exhibit an intrinsic “spectral bias”: they are maximally efficient for tasks that decompose into a small number of fixed, narrowband frequency components (e.g., periodic or slowly-varying temporal phenomena) (Solozabal et al., 28 Aug 2025).

Placement of poles (especially uniform DFT initialization) ensures efficient, alias-free tiling of the frequency plane. If decay and frequency are entangled (e.g., via inappropriate discretization), modes may collapse to low-frequency regions or alias, hampering long-range expressivity.

4. Computational and Architectural Simplicity

DSS models deliver strong computational advantages:

  • Kernel Construction: Owing to diagonal structure, the entire sequence kernel can be computed via a simple Vandermonde matrix multiplication (two lines of code), without recourse to matrix inversion, Woodbury identities, or partial fraction expansions as in S4 (Gu et al., 2022).
  • Online Inference: Each time step, the hidden state and output can be advanced in O(N)O(N) per channel.
  • Parameterization: A DSS layer typically maintains only O(N)O(N) eigenvalue, input, and output parameters per channel, versus O(N2)O(N^2) or more for general SSMs or full attention.
  • GPU Friendliness: Kernel construction does not require complex linear algebra, enabling practical deep stacking and memory efficiency (Saon et al., 2023).
  • Architecture Integration: DSS layers flexibly replace convolutions in Transformer/Conformer blocks (see DSSformer), yielding models with long-range capacity and linear scaling with sequence length (Saon et al., 2023).

5. Duality to Masked Attention and Expressivity Limits

A diagonal SSM is mathematically equivalent to a masked self-attention mechanism with a 1-semiseparable causal mask: the kernel matrix is rank-one in every lower triangular submatrix, and the transformation can be realized either as a linear-time recurrence or quadratic-time masked attention operation (Hu et al., 6 Oct 2025).

This duality tightens the relationship between recurrent and attention-based sequence models, showing that DSS architectures occupy an efficient subset of possible attention operators. However, they cannot realize the full expressive power of generic softmax attention because the entry-wise nonlinear activation (exponential, softmax) destroys the low-rank/separable structure. DSS layers thus excel for tasks decomposable into fixed kernels, but provably cannot emulate context-dependent or input-conditioned sequence transformations achievable by attention (Gupta et al., 2022, Hu et al., 6 Oct 2025).

6. Robustness, Compression, and Practical Optimization

Diagonalization of non-normal matrices (e.g., arising from HiPPO) can be ill-conditioned, leading to non-uniform kernel approximation and poor resilience to high-frequency input noise. The PTD (Perturb-Then-Diagonalize) methodology regularizes the non-normal matrix before diagonalization, guaranteeing backward-stable transfer function approximation and uniform closeness to structured SSMs (Yu et al., 2023). This significantly improves robustness to adversarial inputs and maintains expressivity across the frequency spectrum.

Compression of DSS layers for deployment uses control-theoretic model reduction techniques:

  • Balanced Truncation: Recovers a reduced-size DSS by truncating states of low Hankel singular value, preserving stability and ensuring strong HH_\infty error bounds (Ezoe et al., 2024).
  • H2^2 Optimal Reduction: Further improves finite-time approximation by explicitly minimizing the integrated impulse response error; empirically achieves up to 32x parameter reduction without loss—and sometimes with gain—of task accuracy after retraining (Sakamoto et al., 14 Jul 2025).

7. Empirical Performance and Applications

DSS architectures match or surpass S4 and attention-based models across a wide range of benchmarks:

Task DSS Variant / Initialization Typical Accuracy (%) Reference
Long Range Arena (avg) S4D-Inv, S4D-LegS, S4D-DFouT 84.5 – 87.9 (Gu et al., 2022, Solozabal et al., 28 Aug 2025)
Speech Commands DSS, DLR 97.1 – 98.3 (Gupta et al., 2022)
PathX-256 S4D-DFouT 87.9 (Solozabal et al., 28 Aug 2025)

DSS-augmented Transformers (DSSformer) set new baselines for end-to-end speech recognition on Switchboard, Fisher, and domain-specific corpora, with evidence that the learned spectrum self-organizes into damped, linearly-spaced Fourier modes (Saon et al., 2023).

Compression via balanced truncation and H2^2 optimal reduction consistently yields significant parameter savings with negligible or even positive impact on downstream classification accuracy, facilitating efficient deployment in edge and embedded scenarios (Ezoe et al., 2024, Sakamoto et al., 14 Jul 2025).

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diagonal State-Space (DSS).