Efficient SSMs with Diagonal Transition Matrices
- The paper demonstrates that diagonal SSMs, with each channel evolving independently, yield significant computational benefits while maintaining state-of-the-art performance.
- It details a mathematical formulation using diagonal state transition matrices and introduces various initialization strategies (S4D-LegS, S4D-Inv, S4D-Lin) that preserve expressive power.
- Empirical evaluations show robust performance on image, audio, medical, and long-range sequence tasks, making this approach attractive for scalable deep sequence modeling.
Efficient state space models (SSMs) with diagonal transition matrices represent a streamlined approach to deep sequence modeling. By restricting the SSM’s transition matrix to be diagonal—i.e. each channel evolves independently—these models achieve significant computational benefits and, with careful parameterization and initialization, retain state-of-the-art performance across image, audio, medical, and long-range sequence tasks. The diagonal form simplifies the internal recurrence, enables fast kernel computation, and, if properly initialized, can mathematically recover the expressive power of more complicated structures adopted in earlier SSM research.
1. Mathematical Formulation and Parameterization
Diagonal state space models (SSMs) are typically defined by the discrete-time recurrence
where the state transition matrix is diagonal with complex entries . This form ensures strict stability (negative real part in the eigenvalues) and encodes oscillatory dynamics via the imaginary part.
The convolutional kernel—which determines the mapping from inputs to outputs—is given by
and, due to diagonal , reduces to a Vandermonde matrix-vector product, which is computationally efficient. Parameters and are trained or, in some reductions, fixed (), illustrating empirically minor sensitivity to their treatment.
2. Initialization and Empirical Sensitivity
Initialization critically determines the effectiveness of diagonal SSMs. The diagonal entries must approximate the spectrum of a structured matrix (e.g., the HiPPO-LegS matrix), which is known to generate stable and expressive long-memory kernels. The paper establishes several initialization strategies:
- S4D-LegS: Employs a diagonal version of the HiPPO-LegS matrix, fixing the real eigenvalue at and spacing imaginary components to mimic the original Legendre polynomial projection.
- S4D-Inv: Sets imaginary parts by an inverse-law spacing, approximating the original kernel’s dynamics.
- S4D-Lin: Uses regularly spaced (Fourier-like) frequencies for the imaginary components.
Empirical ablations reveal that small perturbations to these structured initializations (e.g., randomizing the imaginary part) substantially degrade performance, confirming the necessity of precise spectral matching.
3. Theoretical Justification for the Diagonal Form
A pivotal theorem (“thm:legsd”) in the paper proves that, when the diagonal part of the original DPLR HiPPO matrix is isolated,
the resulting kernel converges to the same form as the full S4 model in the limit of infinite state dimension. The kernel solution in that regime matches a scaled Legendre polynomial, and the diagonal approximation thereby recovers the original expressive basis. The result is peculiar: removing the low-rank correction from arbitrary matrices dramatically alters the kernel, but the HiPPO construction is unique in this property.
4. Design Choices and Implementation Trade-offs
Several practical design parameters impact model behavior:
Component | S4 | DSS | S4D |
---|---|---|---|
Structure | DPLR | Diagonal | Diagonal |
Kernel Computation | Cauchy matrix | Softmax/Vand. | Vandermonde |
Initialization | HiPPO-LegS | HiPPO-D | HiPPO-D/Inv/Lin |
- Discretization: Bilinear (Tustin) and zero-order hold (ZOH) are both effective.
- Kernel Computation: Vandermonde multiplication is simple, efficient, and does not tie computation to a fixed sequence length (unlike row-normalized softmax).
- Conjugate Symmetry: Ensures real outputs from complex parameterizations, enforced by grouping parameters in conjugate pairs.
- Trainability of : Training yields marginal improvements, but fixed often suffices.
5. Performance and Domain Coverage
Diagonal SSMs with careful initialization (“S4D”) achieve competitive or state-of-the-art performance:
- Image: On sequential CIFAR, accuracy reaches the low-to-mid 90% range.
- Audio: Speech Commands dataset yields >96% accuracy for S4D variants (~300k parameters) on 35-class tasks.
- Medical: On BIDMC Vital Signs, root-mean-square error matches or exceeds transformer/RNN baselines.
- Long-range dependencies: On Long Range Arena, S4D-Inv achieves ~85.5% average accuracy, on par with full S4 and transformer methods.
Kernel computation for S4D requires minimal code and is highly efficient, favoring deployment and scaling.
6. Summary of Practical Implications
Efficient SSMs with diagonal transition matrices, when constructed with appropriate parameterization and initialization, provide a compelling foundation for sequence modeling. The transformation to a diagonal state matrix enables substantial simplification without sacrificing expressive power, provided the initialization approximates structured spectra, notably the HiPPO-LegS matrix. Design choices, such as discretization scheme, kernel computation strategy, conjugate pairing, and treatment of input matrices, yield only minor differences as long as spectral properties are preserved.
Empirical studies demonstrate robust performance across a diverse range of tasks, and implementation is trivial—often a Vandermonde matrix computation suffices. The diagonal strategy thus offers an attractive balance of rigorously justified expressivity, engineering simplicity, scalability, and broad empirical viability. Efficient SSMs in the diagonal form are established as a practical sequence modeling backbone for robotics, audio, medical time series, and large-scale deep learning pipelines.