Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
52 tokens/sec
GPT-5 Medium
20 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
100 tokens/sec
DeepSeek R1 via Azure Premium
98 tokens/sec
GPT OSS 120B via Groq Premium
459 tokens/sec
Kimi K2 via Groq Premium
197 tokens/sec
2000 character limit reached

Efficient SSMs with Diagonal Transition Matrices

Updated 17 August 2025
  • The paper demonstrates that diagonal SSMs, with each channel evolving independently, yield significant computational benefits while maintaining state-of-the-art performance.
  • It details a mathematical formulation using diagonal state transition matrices and introduces various initialization strategies (S4D-LegS, S4D-Inv, S4D-Lin) that preserve expressive power.
  • Empirical evaluations show robust performance on image, audio, medical, and long-range sequence tasks, making this approach attractive for scalable deep sequence modeling.

Efficient state space models (SSMs) with diagonal transition matrices represent a streamlined approach to deep sequence modeling. By restricting the SSM’s transition matrix to be diagonal—i.e. each channel evolves independently—these models achieve significant computational benefits and, with careful parameterization and initialization, retain state-of-the-art performance across image, audio, medical, and long-range sequence tasks. The diagonal form simplifies the internal recurrence, enables fast kernel computation, and, if properly initialized, can mathematically recover the expressive power of more complicated structures adopted in earlier SSM research.

1. Mathematical Formulation and Parameterization

Diagonal state space models (SSMs) are typically defined by the discrete-time recurrence

xt+1=Axt+But,yt=Cxtx_{t+1} = A x_t + B u_t,\qquad y_t = C x_t

where the state transition matrix AA is diagonal with complex entries An=exp(Re(An))+iIm(An)A_n = -\exp(\mathrm{Re}(A_n)) + i \cdot \mathrm{Im}(A_n). This form ensures strict stability (negative real part in the eigenvalues) and encodes oscillatory dynamics via the imaginary part.

The convolutional kernel—which determines the mapping from inputs to outputs—is given by

K(t)=nCnetAnBnK(t) = \sum_n C_n\, e^{tA_n}\, B_n

and, due to diagonal AA, reduces to a Vandermonde matrix-vector product, which is computationally efficient. Parameters BB and CC are trained or, in some reductions, fixed (B=1B=1), illustrating empirically minor sensitivity to their treatment.

2. Initialization and Empirical Sensitivity

Initialization critically determines the effectiveness of diagonal SSMs. The diagonal entries must approximate the spectrum of a structured matrix (e.g., the HiPPO-LegS matrix), which is known to generate stable and expressive long-memory kernels. The paper establishes several initialization strategies:

  • S4D-LegS: Employs a diagonal version of the HiPPO-LegS matrix, fixing the real eigenvalue at 12-\frac{1}{2} and spacing imaginary components to mimic the original Legendre polynomial projection.
  • S4D-Inv: Sets imaginary parts by an inverse-law spacing, approximating the original kernel’s dynamics.
  • S4D-Lin: Uses regularly spaced (Fourier-like) frequencies for the imaginary components.

Empirical ablations reveal that small perturbations to these structured initializations (e.g., randomizing the imaginary part) substantially degrade performance, confirming the necessity of precise spectral matching.

3. Theoretical Justification for the Diagonal Form

A pivotal theorem (“thm:legsd”) in the paper proves that, when the diagonal part of the original DPLR HiPPO matrix is isolated,

AD=diag(AHiPPO)A_D = \mathrm{diag}(A_\mathrm{HiPPO})

the resulting kernel converges to the same form as the full S4 model in the limit of infinite state dimension. The kernel solution in that regime matches a scaled Legendre polynomial, and the diagonal approximation thereby recovers the original expressive basis. The result is peculiar: removing the low-rank correction from arbitrary matrices dramatically alters the kernel, but the HiPPO construction is unique in this property.

4. Design Choices and Implementation Trade-offs

Several practical design parameters impact model behavior:

Component S4 DSS S4D
Structure DPLR Diagonal Diagonal
Kernel Computation Cauchy matrix Softmax/Vand. Vandermonde
Initialization HiPPO-LegS HiPPO-D HiPPO-D/Inv/Lin
  • Discretization: Bilinear (Tustin) and zero-order hold (ZOH) are both effective.
  • Kernel Computation: Vandermonde multiplication is simple, efficient, and does not tie computation to a fixed sequence length (unlike row-normalized softmax).
  • Conjugate Symmetry: Ensures real outputs from complex parameterizations, enforced by grouping parameters in conjugate pairs.
  • Trainability of BB: Training BB yields marginal improvements, but fixed BB often suffices.

5. Performance and Domain Coverage

Diagonal SSMs with careful initialization (“S4D”) achieve competitive or state-of-the-art performance:

  • Image: On sequential CIFAR, accuracy reaches the low-to-mid 90% range.
  • Audio: Speech Commands dataset yields >96% accuracy for S4D variants (~300k parameters) on 35-class tasks.
  • Medical: On BIDMC Vital Signs, root-mean-square error matches or exceeds transformer/RNN baselines.
  • Long-range dependencies: On Long Range Arena, S4D-Inv achieves ~85.5% average accuracy, on par with full S4 and transformer methods.

Kernel computation for S4D requires minimal code and is highly efficient, favoring deployment and scaling.

6. Summary of Practical Implications

Efficient SSMs with diagonal transition matrices, when constructed with appropriate parameterization and initialization, provide a compelling foundation for sequence modeling. The transformation to a diagonal state matrix enables substantial simplification without sacrificing expressive power, provided the initialization approximates structured spectra, notably the HiPPO-LegS matrix. Design choices, such as discretization scheme, kernel computation strategy, conjugate pairing, and treatment of input matrices, yield only minor differences as long as spectral properties are preserved.

Empirical studies demonstrate robust performance across a diverse range of tasks, and implementation is trivial—often a Vandermonde matrix computation suffices. The diagonal strategy thus offers an attractive balance of rigorously justified expressivity, engineering simplicity, scalability, and broad empirical viability. Efficient SSMs in the diagonal form are established as a practical sequence modeling backbone for robotics, audio, medical time series, and large-scale deep learning pipelines.