The paper introduces FFTNet, an adaptive spectral filtering framework that leverages the Fast Fourier Transform (FFT) to achieve global token mixing in time. The method transforms inputs into the frequency domain and uses a learnable spectral filter and modReLU activation to emphasize salient frequency components. Experiments on the Long Range Arena (LRA) and ImageNet benchmarks demonstrate performance improvements over fixed Fourier and standard attention models.
The paper addresses the quadratic complexity of conventional self-attention mechanisms, which limits their scalability on long sequences. The authors propose an alternative approach that uses the FFT to perform global token mixing with reduced computational complexity. The method transforms the input sequence into the frequency domain, where orthogonal frequency components encode long-range dependencies. This transformation reduces the computational complexity to and preserves the energy of the original signal, as ensured by Parseval's theorem.
A key component of the framework is the integration of a learnable spectral filter. This adaptive component modulates the Fourier coefficients based on a global context vector, enabling the model to dynamically emphasize salient frequency bands. Nonlinear activations are applied to both the real and imaginary parts of the filtered signal to enhance the model's expressivity.
The paper reviews existing methods for improving the efficiency of sequence models, including Fourier-based approaches, linear and sparse approximations, and orthogonal matrix decomposition methods. It discusses the complexity issues of self-attention and highlights the limitations of using a static transform, as in FNet, which cannot adapt to varying inputs or highlight task-specific frequency components.
The adaptive spectral filtering framework combines the computational efficiency of FFT-based transformations with adaptive, context-sensitive filtering and nonlinear processing. This approach offers a balance between scalability and adaptability, which is crucial for complex sequence modeling tasks.
The adaptive spectral filtering method comprises four steps:
- Fourier Transform: The discrete Fourier transform (DFT) is applied along the token dimension:
.
* : Input sequence of length and embedding dimension * : Representation of each embedding across orthogonal frequency components
- Adaptive Spectral Filtering: A learnable filter is used to selectively emphasize important frequencies. A global context vector is computed as:
,
* : Global context vector
and passed through a multi-layer perceptron (MLP) to obtain a modulation tensor:
.
* : Modulation tensor
The final filter is defined as
,
* : Final filter * : Fixed base filter
and the adaptive filtering step is:
,
* : Reweighted Fourier coefficients
which reweights the Fourier coefficients element-wise according to the global context.
- Nonlinear Activation (modReLU): The modReLU activation is applied to capture higher-order relationships in the complex frequency domain, defined for a complex number as:
$\operatorname{modReLU}(z) \;=\; \begin{cases} (r + b)\,e<sup>{i\theta},</sup> & \text{if } r + b > 0,\[6pt] 0, & \text{otherwise}, \end{cases}$*: A complex number
- : Magnitude of
- : Argument of
- : Learnable bias
The element-wise modReLU is then:
$\tilde{\mathbf{F} = \operatorname{modReLU}\bigl(\tilde{\mathbf{F}\bigr)}$.
- Inverse Fourier Transform: The inverse Fourier transform is applied to return to the token domain:
$\mathbf{Y} = \operatorname{IFFT}\bigl(\tilde{\mathbf{F}\bigr) \;\in\; \mathbb{R}^{n \times d}$.
* : Globally mixed representation that incorporates adaptive filtering and nonlinear transformations
The paper provides a theoretical justification for using FFT over self-attention, highlighting the efficiency of global mixing, the implicit adaptive attention mechanism, the greater expressivity via nonlinearity, and the energy preservation and stability afforded by Parseval's theorem. The computational complexity of the method is , which is more scalable than the complexity of self-attention.
The paper includes proofs and theoretical guarantees to justify the method as an efficient surrogate for self-attention. It shows that the DFT matrix satisfies the unitary property, which preserves the norm of the input across the frequency transform. The element-wise multiplication in the frequency domain corresponds to a convolution in the token domain, and suitable choices of allow the convolution to approximate self-attention. The use of modReLU on the complex coefficients enriches the effective convolution kernel beyond what a purely linear approach can achieve.
A theorem is presented to demonstrate that under mild regularity conditions on and the activation, there exists a parameterization such that , and the presence of the nonlinear activation extends the expressive capacity beyond that of purely linear self-attention.
The paper compares FFTNet with FNet, noting that while FNet also employs the DFT to mix tokens, it lacks adaptation to specific input distributions. FFTNet introduces a learnable filter conditioned on a global context vector and incorporates a complex-domain activation (modReLU) to capture higher-order phenomena.
The paper evaluates FFTNet on the LRA benchmark and the ImageNet classification task, comparing it to FNet and standard self-attention-based Transformers. The results on LRA show that FFTNet achieves higher accuracy on most tasks, including a 37.65\% accuracy on ListOps. On ImageNet, FFTNetViT often achieves lower FLOPs than ViT for comparable model sizes while maintaining strong accuracy. Ablation studies confirm the importance of each FFTNet component (spectral gating, adaptive module).