Diagonal State-Space Models
- Diagonal State-Space (DSS) models are linear state-space architectures with a strictly diagonal state matrix that efficiently captures long-range dependencies using per-mode memory control.
- They leverage initialization schemes such as HiPPO-inspired, Fourier-based, and DFT-based methods to ensure stable, alias-free frequency coverage and robust performance across tasks.
- DSS architectures offer computational simplicity via exact kernel computation with Vandermonde matrices, efficient O(N) online inference, and effective compression techniques for deployment.
Diagonal State-Space (DSS) models are a class of sequence modeling architectures that parameterize the state matrix of a state-space model (SSM) as strictly diagonal. Rooted in classical control theory and motivated by the need for simple, efficient, and expressive models for long-range dependencies, DSS architectures dramatically simplify and generalize recent Structured State-Space (S4) models, matching their empirical performance while streamlining implementation and analysis. The diagonal construction, which enables per-mode memory control and exact kernel computation, underpins a rich spectrum of theoretical properties, practical initializations, and connections to other efficient sequence models.
1. Mathematical Formulation and Theoretical Expressivity
DSS models are discrete- or continuous-time linear SSMs with a diagonal state transition matrix. For continuous-time systems: where (), , and define input/output couplings. The discrete-time equivalent with sampling step is: Here , .
The fundamental property is that any diagonalizable SSM is equivalent to a diagonal SSM via change of basis. The convolutional kernel generated by a DSS layer is: which decomposes as a sum of per-mode geometric progressions: This general Vandermonde structure enables closed-form kernel computation and exact frequency analysis (Gu et al., 2022).
Theoretical completeness is established by the result that almost all SSMs are diagonalizable; thus, DSS models are maximally expressive for linear sequence modeling given sufficient state dimension (Gupta et al., 2022).
2. Initialization Schemes and Spectral Representation
The performance and long-range memory properties of DSS models are highly sensitive to the placement of pole locations (). HiPPO-inspired and Fourier-inspired initializations dominate:
- S4D-LegS: Uses the diagonal of the normal HiPPO matrix as eigenvalues; asymptotically recovers the Legendre polynomial structure and the S4 kernel as .
- S4D-Inv: Approximates the HiPPO spectrum with .
- S4D-Lin: Uniform imaginary part, i.e., , giving a truncated Fourier basis.
The S4D-DFouT initialization places poles directly in the discrete Fourier domain to decouple decay and frequency:
with (shared/learnable) decay and uniformly-distributed phases (Solozabal et al., 28 Aug 2025).
These schemes ensure stable, alias-free coverage of relevant frequencies and enable robust, memory-efficient learning. Accurate initialization is critical; empirical ablations consistently demonstrate catastrophic performance if real/imaginary parts are mistuned (Gu et al., 2022, Gupta et al., 2022).
3. Spectral Properties and “Spectral Bias”
The diagonal parameterization causes the DSS kernel’s discrete-time frequency response to decompose as: Each mode acts as a single-pole IIR filter, where the pole’s modulus determines decay (memory span) and argument determines frequency. Therefore, DSS layers exhibit an intrinsic “spectral bias”: they are maximally efficient for tasks that decompose into a small number of fixed, narrowband frequency components (e.g., periodic or slowly-varying temporal phenomena) (Solozabal et al., 28 Aug 2025).
Placement of poles (especially uniform DFT initialization) ensures efficient, alias-free tiling of the frequency plane. If decay and frequency are entangled (e.g., via inappropriate discretization), modes may collapse to low-frequency regions or alias, hampering long-range expressivity.
4. Computational and Architectural Simplicity
DSS models deliver strong computational advantages:
- Kernel Construction: Owing to diagonal structure, the entire sequence kernel can be computed via a simple Vandermonde matrix multiplication (two lines of code), without recourse to matrix inversion, Woodbury identities, or partial fraction expansions as in S4 (Gu et al., 2022).
- Online Inference: Each time step, the hidden state and output can be advanced in per channel.
- Parameterization: A DSS layer typically maintains only eigenvalue, input, and output parameters per channel, versus or more for general SSMs or full attention.
- GPU Friendliness: Kernel construction does not require complex linear algebra, enabling practical deep stacking and memory efficiency (Saon et al., 2023).
- Architecture Integration: DSS layers flexibly replace convolutions in Transformer/Conformer blocks (see DSSformer), yielding models with long-range capacity and linear scaling with sequence length (Saon et al., 2023).
5. Duality to Masked Attention and Expressivity Limits
A diagonal SSM is mathematically equivalent to a masked self-attention mechanism with a 1-semiseparable causal mask: the kernel matrix is rank-one in every lower triangular submatrix, and the transformation can be realized either as a linear-time recurrence or quadratic-time masked attention operation (Hu et al., 6 Oct 2025).
This duality tightens the relationship between recurrent and attention-based sequence models, showing that DSS architectures occupy an efficient subset of possible attention operators. However, they cannot realize the full expressive power of generic softmax attention because the entry-wise nonlinear activation (exponential, softmax) destroys the low-rank/separable structure. DSS layers thus excel for tasks decomposable into fixed kernels, but provably cannot emulate context-dependent or input-conditioned sequence transformations achievable by attention (Gupta et al., 2022, Hu et al., 6 Oct 2025).
6. Robustness, Compression, and Practical Optimization
Diagonalization of non-normal matrices (e.g., arising from HiPPO) can be ill-conditioned, leading to non-uniform kernel approximation and poor resilience to high-frequency input noise. The PTD (Perturb-Then-Diagonalize) methodology regularizes the non-normal matrix before diagonalization, guaranteeing backward-stable transfer function approximation and uniform closeness to structured SSMs (Yu et al., 2023). This significantly improves robustness to adversarial inputs and maintains expressivity across the frequency spectrum.
Compression of DSS layers for deployment uses control-theoretic model reduction techniques:
- Balanced Truncation: Recovers a reduced-size DSS by truncating states of low Hankel singular value, preserving stability and ensuring strong error bounds (Ezoe et al., 2024).
- H Optimal Reduction: Further improves finite-time approximation by explicitly minimizing the integrated impulse response error; empirically achieves up to 32x parameter reduction without loss—and sometimes with gain—of task accuracy after retraining (Sakamoto et al., 14 Jul 2025).
7. Empirical Performance and Applications
DSS architectures match or surpass S4 and attention-based models across a wide range of benchmarks:
| Task | DSS Variant / Initialization | Typical Accuracy (%) | Reference |
|---|---|---|---|
| Long Range Arena (avg) | S4D-Inv, S4D-LegS, S4D-DFouT | 84.5 – 87.9 | (Gu et al., 2022, Solozabal et al., 28 Aug 2025) |
| Speech Commands | DSS, DLR | 97.1 – 98.3 | (Gupta et al., 2022) |
| PathX-256 | S4D-DFouT | 87.9 | (Solozabal et al., 28 Aug 2025) |
DSS-augmented Transformers (DSSformer) set new baselines for end-to-end speech recognition on Switchboard, Fisher, and domain-specific corpora, with evidence that the learned spectrum self-organizes into damped, linearly-spaced Fourier modes (Saon et al., 2023).
Compression via balanced truncation and H optimal reduction consistently yields significant parameter savings with negligible or even positive impact on downstream classification accuracy, facilitating efficient deployment in edge and embedded scenarios (Ezoe et al., 2024, Sakamoto et al., 14 Jul 2025).
References
- (Gu et al., 2022) On the Parameterization and Initialization of Diagonal State Space Models
- (Gupta et al., 2022) Diagonal State Spaces are as Effective as Structured State Spaces
- (Saon et al., 2023) Diagonal State Space Augmented Transformers for Speech Recognition
- (Yu et al., 2023) Robustifying State-space Models for Long Sequences via Approximate Diagonalization
- (Ezoe et al., 2024) Model Compression Method for S4 with Diagonal State Space Layers using Balanced Truncation
- (Sakamoto et al., 14 Jul 2025) Compression Method for Deep Diagonal State Space Model Based on Optimal Reduction
- (Solozabal et al., 28 Aug 2025) Uncovering the Spectral Bias in Diagonal State Space Models
- (Hu et al., 6 Oct 2025) On Structured State-Space Duality
- (Gupta et al., 2022) Simplifying and Understanding State Space Models with Diagonal Linear RNNs