Cottention: Cosine Linear Transformers

Updated 27 October 2025

The paper introduces Cottention, a cosine-based linear attention mechanism that replaces quadratic softmax attention with efficient content-based aggregation.
It leverages L2 normalization and a rearranged computation to achieve linear memory complexity and constant inference memory for long-context tasks.
The approach employs custom CUDA kernels and a recurrent formulation, offering scalable performance in NLP, computer vision, and similar domains.

Cottention is a linear transformer attention mechanism that replaces the quadratic-cost softmax attention with efficient content-based aggregation via cosine similarity. By leveraging L2 normalization and rearrangement of the attention computation, Cottention achieves true linear memory complexity and constant inference memory for long-context tasks. This design has foundational implications across transformer architectures in natural language processing, computer vision, and other sequence domains, affecting both modeling capacity and practical deployment scaling.

1. Motivation and Historical Context

Traditional self-attention mechanisms, as popularized by transformer models like BERT and GPT, compute the similarity between queries (Q) and keys (K) using dot-products, followed by softmax normalization. This design results in an $O(s^2)$ memory and compute cost for a sequence of length $s$ , limiting scalability for long-context applications. Previous work on linear transformers achieved linear run-time by kernelizing or factorizing the attention computation, but often lost key distributional properties, reduced modeling expressivity, or failed to match softmax's local concentration and adaptivity (Qin et al., 2022, Fan et al., 1 Jul 2025).

Cosine attention variants—including Cottention, cosFormer, and discrete cosine transform (DCT)-augmented methods—replace dot-product similarity with cosine similarity (i.e., normalized inner product) to attain bounded, scale-independent attention scores and open the door to architectures with $O(s)$ or even $O(1)$ memory scaling (Mongaras et al., 27 Sep 2024, Qin et al., 2022). Recent work also highlights that the neglect of query magnitude and the absence of adaptive concentration mechanisms can limit the expressiveness of simple linear attention, motivating stabilized, magnitude-aware, or doubly-stochastic alternatives (Fan et al., 1 Jul 2025, Shahbazi et al., 27 Sep 2025).

2. Mathematical Formulation of Cosine Attention

In Cottention, attention is defined via L2 normalization over queries and keys, producing a cosine similarity matrix: $S(Q, K) = \mathcal{N}(Q)\,\mathcal{N}(K)^\top \qquad \text{where} \qquad \mathcal{N}(X)_i = \frac{X_i}{\lVert X_i \rVert_2}$ The core attention computation is

$O = S(Q, K)\; V$

This similarity matrix is stabilized in Cottention by dividing each attention row by $s^{\sigma(m)}$ , with $m$ a learned scalar (per head, passed through a sigmoid to interpolate between no scaling and division by $s$ ). This normalization addresses instability arising from row sums scaling with the sequence length, ensuring consistent attention distribution during training and inference.

For causal Transformers, the rearrangement required for bidirectional models cannot be made directly due to masking constraints. Instead, Cottention employs a recurrent grouping and cumulative summation, allowing attention to be incrementally updated at each time-step (see Section 4).

3. Linear Memory Scaling and Rearrangement

A defining property of Cottention is that the attention computation can be reorganized to exploit associativity and distributivity, shifting the bottleneck from $O(s^2)$ to $O(d^2)$ with $d$ the model dimension per head. In the bidirectional case: $O = \mathcal{N}(Q)\, [\mathcal{N}(K)^\top V]$ This grouping allows one to compute $[\mathcal{N}(K)^\top V]$ once and then aggregate with each query, sidestepping the need to materialize the full attention map. For long sequences ( $s \gg d$ ), this rearrangement yields native linear (or sublinear) memory usage.

In the causal setting, where only past context is available, Cottention applies cumulative blockwise reductions (akin to masked attention) while retaining fixed-size hidden state, enabled by a custom CUDA kernel.

4. Recurrent Neural Network Interpretation

Cottention's grouping can be interpreted as a recurrent accumulation: $H_t = H_{t-1} + (V_t \odot K_t) \qquad O_t = (H_t \odot Q_t).sum(-1)$ Here, $H_t$ is a running hidden state summarizing past key-value contributions, and $O_t$ is the contextual output for token $t$ . Since $H_t$ is fixed size and does not scale with $t$ , inference requires constant memory per token—an essential property for generative decoding and streaming applications.

This RNN reformulation permits efficient caching and reuse of hidden states, distinguishing Cottention from vanilla softmax attention (whose key-value stores must scale with $s$ ). A plausible implication is that systems using Cottention can deploy to memory-constrained hardware, or generate extremely long outputs without quadratic cache growth.

5. Implementation and Custom CUDA Kernels

Efficient implementation of Cottention, particularly for causal inference, hinges on optimized kernel design. The custom CUDA kernel

Computes cumulative sums and blockwise reductions over normalized key-value tensors
Exploits parallelization within $d^2$ blocks across threads
Minimizes kernel launches and leverages shared memory to avoid $O(sd^2)$ growth

Both the forward and backward passes are mapped to these kernels, enabling competitive speed and memory efficiency compared to stock PyTorch or generic kernel libraries.

Bidirectional variants (e.g. for BERT-style models) benefit from the rearrangement described above; causal variants (e.g. GPT-family) rely on cumulative state updates with causal masking.

6. Performance Characteristics and Benchmark Results

Empirical evaluations of Cottention demonstrate:

Comparable GLUE task accuracy versus softmax attention in BERT
Similar loss and perplexity trends in GPT-J models with sizes up to 1.2B parameters
Linear memory scaling with sequence length for cosine attention (as observed in plots)
Less aggressive memory growth with respect to model dimension than predicted by theory
The stabilization scalar $m$ decays over training, suggesting early-stage regularization and convergence to less reliance on aggressive normalization

For other cosine-based linear transformer variants, including cosFormer and DCT-initialized architectures, competitive results are observed in language modeling, text classification, image classification (ImageNet-1K, CIFAR-10), and long-context benchmarks (Long Range Arena) (Qin et al., 2022, Pan et al., 22 May 2024). Applications span NLP, computer vision, audio, and time series, with application-dependent trade-offs in locality bias, global context, and concentration properties.

7. Advantages, Limitations, and Future Directions

Advantages:

True linear memory complexity and constant inference state, critical for scaling to long contexts or deployment on constrained hardware
Comparable modeling capacity to softmax attention on diverse benchmarks
Amenability to drop-in replacement in existing transformer models; requires only replacement at the attention module
Implementation flexibility (bidirectional/causal, CUDA kernels)

Limitations and Considerations:

Cosine similarity normalization ignores absolute magnitude of the query; may smooth the attention distribution and decrease local concentration compared to softmax attention (Fan et al., 1 Jul 2025)
Stabilization via learnable scalars mitigates instability, but tuning is required for optimal training convergence
In causal grouping, some computational overhead may arise from blockwise reductions and kernel management, especially for smaller batch sizes

Future Directions:

Exploration of hybrid architectures combining cosine and magnitude-aware kernels to reintroduce adaptive concentration (Fan et al., 1 Jul 2025)
Extension of DCT-based initialization and compression within cosine attention frameworks to optimize spectral sensitivity (Pan et al., 22 May 2024)
Scaling and benchmarking in ultra-long context tasks (documents, biological sequences, code modeling)
Systematic investigation into doubly-stochastic, low-rank optimal transport-based attention for further robustness and balanced information flow (Shahbazi et al., 27 Sep 2025)
Additional CUDA kernel optimizations, dynamic normalization strategies, and recurrent ingredient variants (e.g., GRU/LSTM) for further modeling expressivity

Cottention and related cosine attention mechanisms offer a rigorous framework for efficient, scalable, and expressive transformer models, facilitating the processing of increasingly long and diverse sequences with constant or linear resource profile. This family of attention operators stands as a promising direction for future transformer system design, subject to continued empirical and theoretical analysis of concentration, magnitude adaptivity, and normalization trade-offs.