Frequency-Enhanced State Space Model (FSSM)

Updated 24 January 2026

Frequency-Enhanced State Space Model (FSSM) is a class of structured sequence models that explicitly incorporates frequency-domain operations to improve expressivity and efficiency.
It employs mechanisms like grouped FIR filtering and FFT-based processing to capture narrowband and broadband signal features effectively.
The innovative design delivers superior performance in long-context language modeling, image restoration, and spatiotemporal prediction while maintaining computational efficiency.

A Frequency-Enhanced State Space Model (FSSM) is a class of structured sequence models that augment classic state space architectures with explicit frequency-domain operations or parameterizations, enabling superior modeling of signals where discriminative frequency representations are crucial. FSSMs have emerged as a scalable, stable, and computationally efficient approach that bridges recurrent, convolutional, and frequency-centric methodologies in both 1D and 2D domains, notably outperforming existing architectures in tasks like long-context language modeling, image restoration, spatiotemporal dynamics, and signal processing (Meng et al., 2024, Yu et al., 2024, Babaei et al., 13 May 2025).

1. Mathematical Foundations and Classical SSM Structure

The foundation of FSSMs is the continuous and discrete-time State Space Model:

Continuous-time:

$\dot{x}(t) = A\,x(t) + B\,u(t), \qquad y(t) = C\,x(t) + D\,u(t)$

Discrete-time:

$x[k+1] = A\,x[k] + B\,u[k], \qquad y[k] = C\,x[k] + D\,u[k]$

Here, $x$ is the state vector, $u$ the input, $y$ the output, and $A,B,C,D$ are learnable system matrices whose structure determines model expressivity and memory (Meng et al., 2024, Yu et al., 2024).

Classically, SSMs are initialized with polynomial (Legendre, HiPPO) or orthogonal bases, resulting in implicit modeling biases: for example, standard inits cluster eigenvalues of $A$ near the origin, causing these models to preferentially capture low-frequency components and neglect high-frequency signals (Yu et al., 2024, Babaei et al., 13 May 2025). This effect limits their performance in frequency-rich or broadband settings.

2. Frequency Enhancement Mechanisms

Motivated by the spectral inefficacy of classic SSMs, FSSMs employ several explicit mechanisms for frequency modeling:

Grouped FIR Filtering: By splitting the hidden state into groups, each updated via a finite impulse response (FIR) filter, FSSMs emulate parallel band-pass filters. Each group receives a separate convolution of the input:

$y_g[k] = \sum_{\tau=0}^{L-1} k_\tau \cdot (B_{k-\tau}\,x_{k-\tau}), \qquad h^g[k] = \begin{cases} A\,h^g[k-1] + y_g[k], & \text{if}\;g = k \bmod G\ h^g[k-1], & \text{otherwise} \end{cases}$

This enables direct modeling of narrowband and broadband components (Meng et al., 2024).

Direct Frequency-Domain Processing: FSSMs in image and spatiotemporal domains often employ FFT or wavelet decompositions (e.g., DWT for 1D, 2D FFT for images), allowing them to apply learned, channel-wise frequency filters and reconstruct outputs via inverse transforms, capturing both amplitude and phase (Zhu et al., 8 Oct 2025, Li et al., 2024, Yamashita et al., 2024, Zhao et al., 2024, Zhang et al., 2024).
Frequency-Biased Parameterization: In both SaFARi and gradient-based frequency tuning, the system matrix $A$ is constructed directly in a Fourier (sinusoidal) basis or its spectrum is broadened via scaling:

$A^{(\alpha)} = \text{diag}(x_j + i \alpha y_j^0)$

and/or loss/gradient reweighting:

$\|f\|_{H^\beta}^2 = \int_{-\infty}^{\infty}(1+|s|)^{2\beta}|f(i s)|^2\,ds$

which allows targeted regularization or amplification of specific frequency bands (Yu et al., 2024, Babaei et al., 13 May 2025).

3. Efficient Architectures and Computational Gains

Beyond frequency specialization, FSSMs achieve high efficiency and stability through architectural innovations:

Semiseparable Matrix Factorizations: Exploiting the structure of convolution kernels reduces the naive $O(T^2)$ cost of SSM convolution to $O(G\,r\,T)$ , with $r$ denoting the maximum off-diagonal rank of the semiseparable factors. In practice, this enables linear-time training and inference even for very long sequences (Meng et al., 2024).
Attention Sink Mechanisms: Inspired by attention concentration on stream-initial tokens in LLMs, FSSMs introduce learnable prompt states and sink matrices to anchor group memory and mitigate dispersion of long-range sequence information. This directly addresses issues of information loss and instability in long recurrent chains (Meng et al., 2024).
Low-Parameter Overhead: Frequency modules (e.g., small convs on FFT/phase channels) incur modest memory and parameter costs, with empirical reductions in total model size compared to SSM or IR baselines at improved accuracy (Zhu et al., 8 Oct 2025, Yamashita et al., 2024).

A direct comparison of architectural classes is presented below:

Mechanism	Complexity	Frequency Sensitivity	Stability
Classic SSM	$O(T^2)$	Low (default inits)	Varies
Semiseparable SSM (S4/S6)	$O(T\,d\,r)$	Medium	Improved
FSSM (GFSSM)	$O(T\,d(r+G/L))$	High (learnable)	Stable @ fp16
FFT-enhanced SSM	$O(T \log T)$	Tunable	Stable

4. Empirical Performance and Benchmarks

FSSMs demonstrate state-of-the-art results across a variety of important benchmarks:

Sequence Modeling (Long Range Arena, Path-X, Audio): GFSSM (FSSM) achieves up to 30% MSE reduction on synthetic sinusoid-mix and matches or exceeds Mamba-2 in language modeling at sequence lengths up to 64k, benefiting from grouped FIRs and attention sinks (Meng et al., 2024, Yu et al., 2024).
Image Restoration (De-Raining, Denoising): FASSM and DFSSM models, leveraging FFT-based frequency modules in tandem with SSMs, outperform transformer and CNN baselines by up to +0.73 dB PSNR (Rain200L), preserving structural details and suppressing high-amplitude rain artifacts. The models consistently improve SSIM as well. Use of explicit frequency losses further ensures fidelity in amplitude/phase (Zhu et al., 8 Oct 2025, Yamashita et al., 2024).
Spatiotemporal Dynamics (Video, Weather): Physical-guided FSSMs incorporate adaptive frequency modules and ARK2 updates, achieving best-in-class spatiotemporal prediction at lower parameter counts (Zhao et al., 2024).
Motion Generation: Frequency-adaptive SSMs (FreqSSM) parameterize state transitions conditioned on low/high frequency decompositions, yielding superior text-to-motion FID (0.181 vs. 0.421 for the SSM baseline) (Li et al., 2024).
Visual Backbone (ViM-F): S6/Mamba-based backbones incorporating FFT feature fusion match Hybrid ViM/ViT/CNNs in classification and detection with highly efficient, attention-free models (Zhang et al., 2024).

5. Theoretical and Practical Significance

The adoption of explicit frequency enhancement mechanisms confers several core advantages:

Expressivity: By learning frequency-aware kernels or bandpass characteristics, FSSMs represent periodic, quasi-periodic, or broadband features that elude polynomial-centric or uninformed recurrent models (Meng et al., 2024, Babaei et al., 13 May 2025).
Adaptive Bias Control: Frequency bias is accessible via parameter scaling and gradient regularization, offering modelers direct control over spectral regions of interest—with measurable impact on denoising, pattern discrimination, and extrapolation (Yu et al., 2024).
Stability and Robustness: Grouped updates, FIR smoothing, and spectral conditioning mitigate numerical instability (condition number blowup, vanishing activations), enabling fp16 training—whereas classic SSMs may require higher precision (Meng et al., 2024).

6. Relation to Broader Sequence and Signal Modeling Paradigms

FSSMs unify and generalize several streams in machine learning:

They subsume classical SSMs, polynomial HiPPO, and convolution-based architectures by embedding frequency-domain selectivity into their recurrence or convolution structure (Babaei et al., 13 May 2025, Meng et al., 2024).
Experience from Vision, NLP, and physics-informed modeling confirms FSSMs' flexibility: their inductive biases can be tuned to the dominant signal characteristics (band-limited, periodic, directional).
The semiseparable and block-diagonal constructions of modern FSSMs allow plug-and-play integration into streaming, online, and multi-dimensional applications (Meng et al., 2024, Zhang et al., 2024). This encompasses both frame-agnostic pipelines and frequency-specialized designs.

7. Representative Implementations and Usage Guidelines

A recurring FSSM implementation pipeline synthesizes these insights:

Initialization: Choose SSM parameterization (classic, frame-agnostic, or Fourier basis), set frequency bias parameters ( $\alpha$ , $\beta$ ), and architecture (S4/S6, grouped, diagonal).
Forward Pass:
- Frequency split or FFT (if employed).
- State update: grouped FIR, frequency modulation of $A$ , and application of semiseparable kernels.
- Fusion and projection, optionally concatenating outputs from SSM, frequency modules, and raw inputs via learnable gates.
Loss and Training: End-to-end with task-specific objectives, optionally including frequency-matching (e.g., L1 on FFT components) or gradient regularization (Sobolev norm).
Inference and Stability: Leverage attention sinks or prompt tokens for long sequence stability; monitor per-frequency response with spectral diagnostics (Meng et al., 2024, Yu et al., 2024, Zhu et al., 8 Oct 2025).

For developers, selection of the frequency basis (Fourier vs. polynomial), grouping strategy, and explicit frequency loss terms should match the underlying signal characteristics of the problem domain.

FSSMs define a rigorously principled, empirically validated family of models combining the interpretability and lineage of state space theory with modern deep learning practice. By making frequency modeling native to their architecture, they set new baselines for sequence comprehension, restoration, and structured prediction across fields (Meng et al., 2024, Yu et al., 2024, Zhu et al., 8 Oct 2025, Yamashita et al., 2024, Zhang et al., 2024, Li et al., 2024, Babaei et al., 13 May 2025, Zhao et al., 2024).