Cosine Similarity Attention

Updated 9 February 2026

Cosine similarity attention is an attention mechanism that employs normalized cosine similarity to achieve scale-invariant, adaptive feature weighting without softmax normalization.
It integrates seamlessly into multimodal fusion and transformer architectures by leveraging L2 normalization and compact feedforward modules to reduce computational complexity.
Empirical results show that cosine similarity attention maintains competitive accuracy with enhanced speed, lower memory usage, and improved real-time performance in segmentation and language models.

Cosine similarity attention refers to a family of attention mechanisms that replace traditional compatibility functions, such as dot product or softmax, with cosine similarity as the central operation for measuring feature alignment. Unlike standard softmax-based attention, cosine similarity attention yields content-adaptive, scale-invariant weighting without reliance on exponential normalization. This structural change can produce substantial computational and architectural advantages in both encoder-decoder fusion networks for multimodal perception and in transformer-based sequence models.

1. Mathematical Foundations and Formulation

Cosine similarity attention mechanisms compute the normalized inner product between feature vectors, resulting in an alignment score in the range $[−1, 1]$ . The general formulation for a pair of features $x, y \in \mathbb{R}^d$ is: $\text{cos}(x,y) = \frac{\langle x, y \rangle}{\|x\|_2 \cdot \|y\|_2}$ This metric is used either at the local (e.g., channel-wise) or global (sequence-wide) level depending on the architecture.

In the Cosine Similarity Attention Fusion Module (CS-AFM) from the Cosine Similarity Fusion Network (CSFNet), local channel-wise similarity is computed after pooling and reshaping, providing a modality alignment vector $S_v$ that is then mapped via a compact feedforward "attention head" to multiplicative fusion weights $W$ through: $W = \sigma(\mathrm{Conv}_{1\times1}(\mathrm{ReLU}(\mathrm{BN}(\mathrm{Conv}_{1\times1}(S_v)))))$ The rectified and fused outputs are computed as: $\tilde{F}^r = F^r + W \odot F^x,\quad \tilde{F}^x = F^x + (1-W) \odot F^r$

$F_m = W \odot F^x + (1-W) \odot F^r$

where $\odot$ denotes broadcasted channel-wise scaling (Qashqai et al., 2024).

In Cottention, a variant for transformers, all queries and keys are L2-normalized: $\mathcal{N}(X) = \frac{X}{\|X\|_2}$ The attention matrix is then: $A = s^{-\sigma(m)}\,\mathcal{N}(Q)\mathcal{N}(K)^T$ with a learned per-head scalar $\sigma(m)$ for stabilization and $s$ the sequence length. The output is: $O = (s^{-\sigma(m)} \mathcal{N}(Q) \mathcal{N}(K)^T) V$ (Mongaras et al., 2024).

2. Architectural Integration and Mechanism Placement

In CSFNet, CS-AFM modules interface dual-branch encoders—one per modality (e.g., RGB, depth, thermal, polarization)—and are placed at several points:

Encoder Stages 1–3: Each branch produces a feature map, with CS-AFM fusing modalities after every stage to form a single shared representation.
Encoder Stages 4–5: A single-branch encoder processes the fused features for high-level semantics.
Decoder: During upsampling, skip-connection fusion utilizes CS-AFM to adaptively merge encoder and decoder features, improving boundary sharpness with negligible parametric overhead.

Cottention replaces softmax-based attention layers entirely in transformers, impacting both bidirectional (BERT) and causal (GPT) architectures. The reordering of the matrix multiplication allows:

Bidirectional context: Employing associativity to reduce memory from $O(s^2)$ to $O(d^2)$ .
Causal context: Utilizing cumulative-sum kernels to achieve streaming, non-quadratic inference (Mongaras et al., 2024).

3. Computational Complexity and Efficiency

Cosine similarity attention mechanisms exhibit favorable computational and memory efficiency properties:

CSFNet (CS-AFM, STDC1 backbone): 11.31M parameters, 47.28 GFLOPs, 106.1 FPS (1024×512 input). A deeper two-branch model would have 12.62M parameters, 86.9 GFLOPs, and 74.4 FPS, indicating substantial savings via early fusion.
Speed comparison: CSFNet-1 is roughly 2× faster than RGB-D modules such as SGACNet (50.1 FPS) and does so while improving mIoU.
Cottention: By reordering the matrix product, Cottention achieves $O(s\cdot d^2)$ time and $O(d^2)$ memory for bidirectional attention, and with custom CUDA kernels and a recurrent neural network reformulation for the causal case, maintains a constant or linear memory profile in sequence length.
CUDA implementation: Cottention deploys fused custom kernels that store normalized keys/values in shared memory, further lowering memory requirements and runtime (Qashqai et al., 2024, Mongaras et al., 2024).

4. Empirical Performance and Quantitative Results

Empirical evaluations demonstrate that cosine similarity attention mechanisms maintain or closely approach the accuracy of conventional methods, with clear gains in speed and memory consumption:

CSFNet-1, Cityscapes: 74.73% mIoU @ 106.1 FPS.
CSFNet-2, Cityscapes: 76.36% mIoU @ 72.3 FPS (prior best: 73.3% mIoU @ 50.1 FPS).
MFNet (RGB-Thermal): CSFNet-1: 56.05% mIoU @ 106.3 FPS; CSFNet-2: 59.98% mIoU @ 72.7 FPS (state-of-the-art CRM-RGBTSeg: 61.4% but much slower).
ZJU (RGB-Polarization): CSFNet-2: 91.40% mIoU @ 75.0 FPS, outperforming real-time baselines and rivaling slow transformer models.

Ablation studies confirm that leveraging CS-AFM for skip fusion yields +0.45% mIoU improvement on Cityscapes with negligible parameter overhead.

In transformer models, Cottention runs at nearly identical perplexity and loss to softmax attention in GPT-J experiments, and BERT evaluation on GLUE tasks shows average performance loss of ~1–2 points but with dramatically lower memory (Qashqai et al., 2024, Mongaras et al., 2024).

Model/Dataset	mIoU (%)	FPS	Param. (M)	Notes
CSFNet-1/Cityscapes	74.73	106.1	11.31	Real-time SOTA speed
CSFNet-2/Cityscapes	76.36	72.3		Outperforms all prior real-time RGB-X
Cottention/BERT Avg.	81.8	–	–	~1–2 pts gap to softmax, much lower mem.

5. Auxiliary Modules and Fusion Strategies

Cosine similarity attention mechanisms interact efficiently with other modules:

CSFNet Efficient Context Module: Positioned between encoder and decoder, it captures long-range dependencies via directional convolution but does not integrate CS-AFM internally, instead accepting its outputs.
Decoder Fusion: Lightweight decoders with learned upsampling utilize CS-AFM for all fusions, rather than plain addition, providing content-aware skip connections with marginal additional computation.

This modular design enables generality across different RGB-X modalities, making the approach adaptable for depth, thermal, and polarization cues (Qashqai et al., 2024).

6. Practical Implementation and Stabilization

Practical deployment of cosine similarity attention mechanisms requires specific normalization and stabilization strategies:

Normalization: All query and key vectors are L2-normalized per row or channel prior to similarity computation, enforcing uniform scale.
Scaling and Stabilization: To address the lack of bounded row-sums as in softmax, a trainable per-head scalar (passed through a sigmoid and used as an exponent of the sequence length) is used for normalization in Cottention. This preserves stability at initialization and imparts flexibility during training.
Kernel Efficiency: Custom CUDA kernels for Cottention avoid explicit quadratic intermediate tensors using fused operations and parallel block accumulation (Mongaras et al., 2024).

No additional regularization beyond established dropout or weight decay is required for these mechanisms to train stably.

7. Scope, Limitations, and Comparisons

Cosine similarity attention provides significant advances in efficiency, especially for multimodal fusion and sequence modeling with long contexts:

Advantages: Strong modularity (enabling early multimodal fusion), robustness to input scale, state-of-the-art real-time performance across several driving scene datasets, and dramatic reductions in memory complexity for transformer models.
Limitations: In transformers, cosine similarity attention is typically associated with a small mean drop in accuracy metrics compared to softmax, though these are within competitive margins for many applications. For the most demanding accuracy regimes, conventional softmax attention or resource-intensive transformer backbones (e.g., CMX-B4) may outperform at the cost of throughput (Qashqai et al., 2024, Mongaras et al., 2024).
Application Areas: Driving scene segmentation (RGB-X), high-throughput semantic segmentation, and transformers processing long sequences.

Cosine similarity attention mechanisms represent a distinct computational and architectural alternative to standard attention, balancing small trade-offs in expressiveness/accuracy with substantial acceleration and scalability gains.

Markdown Report Issue Upgrade to Chat

References (2)

CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes (2024)

Cottention: Linear Transformers With Cosine Attention (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cosine Similarity Attention.

Cosine Similarity Attention

1. Mathematical Foundations and Formulation

2. Architectural Integration and Mechanism Placement

3. Computational Complexity and Efficiency

4. Empirical Performance and Quantitative Results

5. Auxiliary Modules and Fusion Strategies

6. Practical Implementation and Stabilization

7. Scope, Limitations, and Comparisons

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Cosine Similarity Attention

1. Mathematical Foundations and Formulation

2. Architectural Integration and Mechanism Placement

3. Computational Complexity and Efficiency

4. Empirical Performance and Quantitative Results

5. Auxiliary Modules and Fusion Strategies

6. Practical Implementation and Stabilization

7. Scope, Limitations, and Comparisons

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research