- The paper introduces CAT, a novel attention mechanism that reduces self-attention complexity from O(N²) to O(N log N) using FFT-based circular convolution.
- It is built on an Engineering-Isomorphic framework that emphasizes softmax preservation, parameter efficiency, and minimal hyperparameter overhead.
- Experiments on ImageNet-1k and WikiText-103 show that CAT achieves competitive performance in both masked and causal language modeling tasks.
This paper introduces Circular-Convolutional Attention (CAT), a novel attention mechanism for Transformers designed to overcome the quadratic complexity O(N2) bottleneck of standard self-attention. CAT achieves O(NlogN) computational complexity while preserving the core global softmax weighting structure, positioning it as an "Engineering-Isomorphic Transformer".
The authors first define the "Engineering-Isomorphism" framework, which outlines four key principles for developing efficient attention alternatives:
- Softmax Preservation: The mechanism must retain a global, data-dependent softmax weighting similar to standard attention.
- Sub-Quadratic Complexity: The computation must be strictly faster than O(N2).
- Comparable or Reduced Parameter Count: The number of learnable parameters should not increase compared to standard attention.
- Minimal Hyperparameter Overhead: The mechanism should not introduce new hyperparameters that depend on sequence length or require extensive tuning.
CAT achieves these goals by replacing the standard QK⊤ matrix multiplication with a circular convolution operation implemented efficiently using the Fast Fourier Transform (FFT).
The core steps are:
- Project the input sequence x∈RN×D using a learnable matrix WA∈RD×1 to get a vector z=xWA∈RN×1.
- Apply softmax row-wise to this vector: z∗=softmax(z).
- Construct a circulant matrix Roll(z∗), where each row is a circular shift of the previous one.
- Project the input x using a value matrix WV∈RD×D to get V=xWV.
- Compute the output Fcat=Roll(z∗)V.
Crucially, the matrix-vector multiplication Roll(z∗)V can be computed in O(NlogN) time using FFTs:
Fcat=IFFT(FFT(z∗)⊙FFT(V)), where ⊙ denotes element-wise multiplication. This avoids constructing the N×N circulant matrix explicitly, reducing both time and memory complexity. The paper also proposes an "Averaged-Key" variant closer to standard QKV attention but finds the simpler WA approach (termed qv in ablations) performs well with fewer parameters.
Experiments were conducted on ImageNet-1k (using ViT CLIP-B/L) and WikiText-103 (using Transformer-XL and GPT-2 small).
- ImageNet-1k: CAT performed particularly well with average pooling, often matching or exceeding standard attention. A variant called CAT-Alter, which replaces only half the attention layers with CAT, showed robust performance across different pooling strategies, sometimes outperforming the baseline.
- WikiText-103: CAT showed significant improvements in masked language modeling (lower perplexity). For causal language modeling (requiring modifications to the roll operation to prevent future peeking), CAT-Alter was more competitive, often matching or slightly improving upon standard attention.
- The authors noted training instability with linear attention baselines under their setup.
Ablation studies confirmed that the qv parameterization (CAT's default) strikes a good balance between performance and parameter efficiency compared to a full qkv (Averaged-Key) split or simpler q-only/v-only versions.
The paper concludes that CAT offers a viable approach to reduce Transformer complexity while adhering to the principles of engineering-isomorphism, making it a promising candidate for scaling models to longer sequences, especially in settings like masked language modeling or vision tasks using average pooling. Future work includes scaling to extremely long sequences, combining CAT with other efficiency techniques, and developing hardware-optimized implementations.