SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts (2502.18394v7)

Published 25 Feb 2025 in cs.LG

Abstract: Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention. However, many modern applications-from multi-turn dialogue to high-resolution vision-require contexts spanning tens of thousands of tokens. We introduce SPECTRE, a method that replaces each attention head with a fast real FFT, a content-adaptive spectral gate, and an inverse FFT, reducing per-layer complexity from $\mathcal{O}(L^{2})$ to $O(L\log L)$ while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7$\times$ faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 LLMing and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.

PDF Abstract

The paper introduces FFTNet, an adaptive spectral filtering framework that leverages the Fast Fourier Transform (FFT) to achieve global token mixing in $\mathcal{O}(n \log n)$ time. The method transforms inputs into the frequency domain and uses a learnable spectral filter and modReLU activation to emphasize salient frequency components. Experiments on the Long Range Arena (LRA) and ImageNet benchmarks demonstrate performance improvements over fixed Fourier and standard attention models.

The paper addresses the quadratic complexity of conventional self-attention mechanisms, which limits their scalability on long sequences. The authors propose an alternative approach that uses the FFT to perform global token mixing with reduced computational complexity. The method transforms the input sequence into the frequency domain, where orthogonal frequency components encode long-range dependencies. This transformation reduces the computational complexity to $\mathcal{O}(n \log n)$ and preserves the energy of the original signal, as ensured by Parseval's theorem.

A key component of the framework is the integration of a learnable spectral filter. This adaptive component modulates the Fourier coefficients based on a global context vector, enabling the model to dynamically emphasize salient frequency bands. Nonlinear activations are applied to both the real and imaginary parts of the filtered signal to enhance the model's expressivity.

The paper reviews existing methods for improving the efficiency of sequence models, including Fourier-based approaches, linear and sparse approximations, and orthogonal matrix decomposition methods. It discusses the complexity issues of self-attention and highlights the limitations of using a static transform, as in FNet, which cannot adapt to varying inputs or highlight task-specific frequency components.

The adaptive spectral filtering framework combines the computational efficiency of FFT-based transformations with adaptive, context-sensitive filtering and nonlinear processing. This approach offers a balance between scalability and adaptability, which is crucial for complex sequence modeling tasks.

The adaptive spectral filtering method comprises four steps:

Fourier Transform: The discrete Fourier transform (DFT) is applied along the token dimension:

$\mathbf{F} = \operatorname{FFT}(\mathbf{X}) \;\in\; \mathbb{C}^{n \times d}$ .

* $\mathbf{X} \in \mathbb{R}^{n \times d}$ : Input sequence of length $n$ and embedding dimension $d$ * $\mathbf{F} \in \mathbb{C}^{n \times d}$ : Representation of each embedding across orthogonal frequency components

Adaptive Spectral Filtering: A learnable filter is used to selectively emphasize important frequencies. A global context vector is computed as:

$\mathbf{c} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{X}_i$ ,

* $\mathbf{c}$ : Global context vector

and passed through a multi-layer perceptron (MLP) to obtain a modulation tensor:

$\Delta \mathbf{W} = \operatorname{MLP}(\mathbf{c}) \;\in\; \mathbb{R}^{n \times d}$ .

* $\Delta \mathbf{W}$ : Modulation tensor

The final filter is defined as

$\mathbf{W} = \mathbf{W}_{\mathrm{base} + \Delta \mathbf{W}}$ ,

* $\mathbf{W}$ : Final filter * $\mathbf{W}_{\mathrm{base}}$ : Fixed base filter

and the adaptive filtering step is:

$\tilde{\mathbf{F} = \mathbf{F} \odot \mathbf{W}}$ ,

* $\tilde{\mathbf{F}}$ : Reweighted Fourier coefficients

which reweights the Fourier coefficients element-wise according to the global context.

Nonlinear Activation (modReLU): The modReLU activation is applied to capture higher-order relationships in the complex frequency domain, defined for a complex number $z = re^{i\theta}$ $z = r e^{i θ}$ as:

$\operatorname{modReLU}(z) \;=\; \begin{cases} (r + b)\,e<sup>{i\theta},</sup> & \text{if } r + b > 0,\[6pt] 0, & \text{otherwise}, \end{cases}$* $z$ : A complex number
- $r$ : Magnitude of $z$
- $\theta$ : Argument of $z$
- $b$ : Learnable bias
The element-wise modReLU is then:

$\tilde{\mathbf{F} = \operatorname{modReLU}\bigl(\tilde{\mathbf{F}\bigr)}$.
Inverse Fourier Transform: The inverse Fourier transform is applied to return to the token domain:

$\mathbf{Y} = \operatorname{IFFT}\bigl(\tilde{\mathbf{F}\bigr) \;\in\; \mathbb{R}^{n \times d}$.

* $\mathbf{Y}$ : Globally mixed representation that incorporates adaptive filtering and nonlinear transformations

The paper provides a theoretical justification for using FFT over self-attention, highlighting the efficiency of global mixing, the implicit adaptive attention mechanism, the greater expressivity via nonlinearity, and the energy preservation and stability afforded by Parseval's theorem. The computational complexity of the method is $\mathcal{O}(n \log n)$ , which is more scalable than the $\mathcal{O}(n^2)$ complexity of self-attention.

The paper includes proofs and theoretical guarantees to justify the method as an efficient surrogate for self-attention. It shows that the DFT matrix satisfies the unitary property, which preserves the norm of the input across the frequency transform. The element-wise multiplication in the frequency domain corresponds to a convolution in the token domain, and suitable choices of $\mathbf{W}$ allow the convolution to approximate self-attention. The use of modReLU on the complex coefficients enriches the effective convolution kernel beyond what a purely linear approach can achieve.

A theorem is presented to demonstrate that under mild regularity conditions on $\mathbf{W}$ and the activation, there exists a parameterization such that $\mathbf{Y} \approx \mathbf{Y}_{\text{attn}}$ , and the presence of the nonlinear activation extends the expressive capacity beyond that of purely linear self-attention.

The paper compares FFTNet with FNet, noting that while FNet also employs the DFT to mix tokens, it lacks adaptation to specific input distributions. FFTNet introduces a learnable filter conditioned on a global context vector and incorporates a complex-domain activation (modReLU) to capture higher-order phenomena.

The paper evaluates FFTNet on the LRA benchmark and the ImageNet classification task, comparing it to FNet and standard self-attention-based Transformers. The results on LRA show that FFTNet achieves higher accuracy on most tasks, including a 37.65\% accuracy on ListOps. On ImageNet, FFTNetViT often achieves lower FLOPs than ViT for comparable model sizes while maintaining strong accuracy. Ablation studies confirm the importance of each FFTNet component (spectral gating, adaptive module).

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jacob Fein-Ashley (12 papers)
Neelesh Gupta (7 papers)
Rajgopal Kannan (65 papers)
Viktor Prasanna (76 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/Ji_Ha_Kim/status/1895000913011433690

https://twitter.com/jacobfeinashley/status/1894784671248265699

https://twitter.com/fly51fly/status/1894861968428986786

https://twitter.com/teortaxesTex/status/1894679507824447597

https://twitter.com/jacobfeinashley/status/1902092839292694703

https://twitter.com/burny_tech/status/1894829738130063602

SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts (2502.18394v7)

Related Papers

Tweets

YouTube

HackerNews

Reddit