Cottention: Linear Transformers with Cosine Attention

Updated 3 March 2026

Cottention is a cosine-based attention mechanism that normalizes queries and keys to achieve linear memory complexity.
It replaces quadratic softmax attention with cosine similarity, enabling efficient long-sequence and causal decoding.
Empirical benchmarks on models like BERT and GPT-J demonstrate near-parity in accuracy with significant memory and latency improvements.

Cottention denotes an attention mechanism for transformers that replaces the softmax-based scoring kernel with a cosine similarity kernel and exploits the resulting associativity to achieve native linear—and, for causal decoding, constant—inference-time memory with respect to sequence length. Developed as an alternative to traditional softmax attention, which imposes quadratic memory complexity that limits scalability on long sequences, Cottention demonstrates comparable expressivity on standard benchmarks while offering substantial memory and potential computational savings. The concept and architecture for Cottention were systematically presented and evaluated by Mongaras et al. in "Cottention: Linear Transformers With Cosine Attention" (Mongaras et al., 2024).

1. Motivation: Limitations of Softmax Attention in Transformers

Transformers leveraging self-attention have achieved state-of-the-art results across natural language processing and related domains, in part owing to the expressivity of the softmax-normalized dot-product attention kernel: $\mathrm{Attention}_{\text{softmax}}(Q, K, V) = \mathrm{softmax}\bigl(\tfrac{QK^T}{\sqrt{d_k}}\bigr)V$ where $Q, K \in \mathbb{R}^{N \times H \times s \times d_k}$ , $V \in \mathbb{R}^{N \times H \times s \times d_v}$ , for batch size $N$ , number of heads $H$ , sequence length $s$ , and key/value dimensionality $d_k$ , $d_v$ . The $O(s^2)$ time and, more critically, memory cost of this mechanism, due to explicit storage of the $s \times s$ attention map for every head, becomes impractical for large $s$ , particularly during inference.

Cottention addresses this bottleneck by dispensing with the softmax normalization in favor of a cosine similarity kernel, enabling algebraic rearrangements that directly yield resource-efficient computation, crucial for long-sequence or streaming contexts (Mongaras et al., 2024).

2. Mathematical Formulation and Core Algorithm

Cottention replaces the softmax attention kernel by computing row-normalized queries and keys, followed by matrix multiplication: $\mathrm{CosAttention}(Q, K, V) = [\mathcal{N}(Q)\, \mathcal{N}(K)^T]\; V, \qquad \mathcal{N}(X) = \frac{X}{\|X\|_{2,\text{row}}}$ Here, each query and key vector is $L^2$ -normalized row-wise. Cosine similarity is thus computed as the dot product of unit vectors: $\cos(q, k) = \frac{q \cdot k}{\|q\|_2\,\|k\|_2}$ To mitigate the scale growth of summed similarities with sequence length, a scalar parameter $m$ is trained per head; the output is stabilized by dividing $V$ through $s^{\sigma(m)}$ (where $\sigma$ is the sigmoid function), yielding: $\mathrm{CosAttention}(Q, K, V) = [\mathcal{N}(Q)\mathcal{N}(K)^T][V / s^{\sigma(m)}]$

By associativity, one can compute $[\mathcal{N}(K)^T V]$ first (shape $H \times s \times d_v$ ), then multiply by $\mathcal{N}(Q)$ , bypassing $O(s^2)$ memory storage and reducing memory to $O(s d_v + d_k d_v)$ . For bidirectional attention, this yields linear scaling in $s$ .

3. Causal Masking, RNN Reformulation, and Inference Efficiency

For autoregressive (causal) attention, direct factorization is blocked by the triangular mask. The Cottention algorithm circumvents this by reformulating causal attention computation as a recurrent neural network:

The hidden state at step $t$ is $H_t \in \mathbb{R}^{N \times H \times d_v \times d_k}$ , tracked by:

$H_t = H_{t-1} + K_t \otimes V_t$

The output for token $t$ is:

$O_t = \sum_{i=1}^{d_k} [Q_t \odot H_t]_{:,i}$

For streaming or stepwise inference, only $H_t$ need be stored and updated, so total memory remains $O(d_v d_k)$ , independent of $s$ . This property eliminates the need for storing or recomputing the full past $K$ , $V$ tensors (“kv-caching”) required by softmax attention.

A custom CUDA kernel implements this algorithm with one thread block per head-row and per-step accumulations, storing only $d_k \times d_v$ floats per head, enabling low-latency inference.

4. Computational Complexity Analysis

Cottention’s memory and time complexity are outlined in the following table:

Mechanism	Training Memory	Inference Memory (causal)	Time per step
Softmax attention	$O(s^2)$	$O(s^2)$	$O(s^2 d)$
Cottention (bidirectional)	$O(s d_v + d_k d_v)$	$O(s d_v + d_k d_v)$	$O(s d^2)$
Cottention (causal, inf.)	$O(d_v d_k)$ (const.)	$O(d_v d_k)$ (const.)	$O(d^2)$

Bidirectional Cottention provides linear memory in $s$ ; in causal (autoregressive) inference, the memory footprint is constant in $s$ , whereas softmax always requires $O(s)$ cache.

5. Empirical Evaluation and Benchmarking

Cottention was benchmarked as a drop-in replacement for softmax attention in both BERT (bidirectional) and GPT-J (causal) architectures. Empirical results show:

On GLUE for BERT, Cottention attains scores within approximately 1.3 points of standard softmax attention (average), indicating near-parity in downstream task accuracy.
In GPT-J next-token prediction experiments on The Pile, both 300M and 1.2B parameter models achieve final perplexities nearly identical to softmax attention (e.g., 1.2B: softmax $\approx$ 9.5, Cottention $\approx$ 9.6).
Empirical measurements on A100 GPUs confirm the predicted linear/constant scaling of memory usage with sequence length for Cottention, versus quadratic for softmax.
Wall-clock times favor Cottention for long sequences (when $s \gg d$ ), though for high $d$ and short $s$ softmax’s lower multiplicative work can yield slightly lower training times.

Stabilization hyperparameters $m$ converge to $0.1$–$0.2$ per head after training from an initialization at $0.5$, indicating reduced reliance on normalization at convergence.

6. Implementation and Practical Details

Mongaras et al. provide a fully detailed CUDA kernel for Cottention, exploiting fused operations and memory locality to minimize both peak memory and compute time. Backpropagation is handled via a closed-form reversal of the forward cumulative-sum steps for $Q$ , $K$ , $V$ gradients. No intermediate $s \times s$ or $s \times d$ arrays are stored; only the minimal recurrent state is maintained throughout.

This design supports easy integration into existing transformer codebases as a drop-in replacement for standard attention modules.

7. Implications and Future Directions

Cottention is distinguished by its ability to match the modeling capacity of softmax attention while reducing memory scaling, especially at inference where constant memory enables long-context generation and streaming. The RNN perspective suggests synergies with continual or online transformers and potentially hybrid architectures incorporating LSTM- or GRU-like gating on the incremental state.

Future work includes scaling Cottention to >$10$B parameter models, optimizing kernel-level compute to close any remaining throughput gaps versus specialized fast-attention implementations (e.g., FlashAttention), experimenting with alternative normalization schedules, and exploiting Cottention’s algebraic structure for low-rank or matrix-factorized key–value pathways.

This reconceptualization of the attention mechanism paves the way to more efficient transformer architectures, especially for resource-constrained or real-time sequence modeling scenarios (Mongaras et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Cottention: Linear Transformers With Cosine Attention (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cottention.

Cottention: Linear Transformers with Cosine Attention

1. Motivation: Limitations of Softmax Attention in Transformers

2. Mathematical Formulation and Core Algorithm

3. Causal Masking, RNN Reformulation, and Inference Efficiency

4. Computational Complexity Analysis

5. Empirical Evaluation and Benchmarking

6. Implementation and Practical Details

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Cottention: Linear Transformers with Cosine Attention

1. Motivation: Limitations of Softmax Attention in Transformers

2. Mathematical Formulation and Core Algorithm

3. Causal Masking, RNN Reformulation, and Inference Efficiency

4. Computational Complexity Analysis

5. Empirical Evaluation and Benchmarking

6. Implementation and Practical Details

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research