CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers (2504.06704v1)

Published 9 Apr 2025 in cs.LG, cs.CL, and cs.CV

Abstract: Transformers have driven remarkable breakthroughs in natural language processing and computer vision, yet their standard attention mechanism still imposes O(N²⁾ complexity, hindering scalability to longer sequences. We introduce Circular-convolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular convolutions to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully-connected layers, and introduces no heavier operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations on large-scale benchmarks such as ImageNet-1k and WikiText-103. Grounded in an engineering-isomorphism framework, CAT's design not only offers practical efficiency and ease of implementation but also provides insights to guide the development of next-generation, high-performance Transformer architectures. Finally, our ablation studies highlight the key conditions underlying CAT's success, shedding light on broader principles for scalable attention mechanisms.

Summary

The paper introduces CAT, a novel attention mechanism that reduces self-attention complexity from O(N²) to O(N log N) using FFT-based circular convolution.
It is built on an Engineering-Isomorphic framework that emphasizes softmax preservation, parameter efficiency, and minimal hyperparameter overhead.
Experiments on ImageNet-1k and WikiText-103 show that CAT achieves competitive performance in both masked and causal language modeling tasks.

This paper introduces Circular-Convolutional Attention (CAT), a novel attention mechanism for Transformers designed to overcome the quadratic complexity $O(N^2)$ bottleneck of standard self-attention. CAT achieves $O(N \log N)$ computational complexity while preserving the core global softmax weighting structure, positioning it as an "Engineering-Isomorphic Transformer".

The authors first define the "Engineering-Isomorphism" framework, which outlines four key principles for developing efficient attention alternatives:

Softmax Preservation: The mechanism must retain a global, data-dependent softmax weighting similar to standard attention.
Sub-Quadratic Complexity: The computation must be strictly faster than $O(N^2)$ .
Comparable or Reduced Parameter Count: The number of learnable parameters should not increase compared to standard attention.
Minimal Hyperparameter Overhead: The mechanism should not introduce new hyperparameters that depend on sequence length or require extensive tuning.

CAT achieves these goals by replacing the standard $QK^\top$ matrix multiplication with a circular convolution operation implemented efficiently using the Fast Fourier Transform (FFT). The core steps are:

Project the input sequence $x \in \mathbb{R}^{N \times D}$ using a learnable matrix $W_A \in \mathbb{R}^{D \times 1}$ to get a vector $\mathbf{z} = xW_A \in \mathbb{R}^{N \times 1}$ .
Apply softmax row-wise to this vector: $\mathbf{z}^* = \mathrm{softmax}(\mathbf{z})$ .
Construct a circulant matrix $\mathrm{Roll}(\mathbf{z}^*)$ , where each row is a circular shift of the previous one.
Project the input $x$ using a value matrix $W_V \in \mathbb{R}^{D \times D}$ to get $V = xW_V$ .
Compute the output $F_{\mathrm{cat}} = \mathrm{Roll}(\mathbf{z}^*) V$ .

Crucially, the matrix-vector multiplication $\mathrm{Roll}(\mathbf{z}^*) V$ can be computed in $O(N \log N)$ time using FFTs: $F_{\mathrm{cat}} = \mathrm{IFFT}(\mathrm{FFT}(\mathbf{z}^*) \odot \mathrm{FFT}(V))$ , where $\odot$ denotes element-wise multiplication. This avoids constructing the $N \times N$ circulant matrix explicitly, reducing both time and memory complexity. The paper also proposes an "Averaged-Key" variant closer to standard QKV attention but finds the simpler $W_A$ approach (termed qv in ablations) performs well with fewer parameters.

Experiments were conducted on ImageNet-1k (using ViT CLIP-B/L) and WikiText-103 (using Transformer-XL and GPT-2 small).

ImageNet-1k: CAT performed particularly well with average pooling, often matching or exceeding standard attention. A variant called CAT-Alter, which replaces only half the attention layers with CAT, showed robust performance across different pooling strategies, sometimes outperforming the baseline.
WikiText-103: CAT showed significant improvements in masked language modeling (lower perplexity). For causal language modeling (requiring modifications to the roll operation to prevent future peeking), CAT-Alter was more competitive, often matching or slightly improving upon standard attention.
The authors noted training instability with linear attention baselines under their setup.

Ablation studies confirmed that the qv parameterization (CAT's default) strikes a good balance between performance and parameter efficiency compared to a full qkv (Averaged-Key) split or simpler q-only/v-only versions.

The paper concludes that CAT offers a viable approach to reduce Transformer complexity while adhering to the principles of engineering-isomorphism, making it a promising candidate for scaling models to longer sequences, especially in settings like masked language modeling or vision tasks using average pooling. Future work includes scaling to extremely long sequences, combining CAT with other efficiency techniques, and developing hardware-optimized implementations.