Scaled Cosine Attention in Transformers

Updated 14 April 2026

Scaled Cosine Attention is a transformer mechanism that replaces dot-product with cosine similarity to decouple magnitude from directional features.
Variants like cosine², CosScale, and Cottention employ L2 normalization and temperature scaling to sharpen attention distributions and enhance stability.
Empirical results in hyperspectral image classification and language modeling show improved accuracy, memory efficiency, and effective extrapolation.

Scaled Cosine Attention is a family of attention mechanisms in transformer architectures where the raw dot-product similarity between queries and keys is replaced with (optionally scaled) cosine similarity. This decouples magnitude and orientation in feature comparison, providing sharper angular inductive bias, improved magnitude invariance, and—when used with appropriate scaling—enables improved stability, extrapolation, and memory efficiency in both high-dimensional domains (e.g., hyperspectral imagery) and long-sequence regime (e.g., language modeling). Scaled Cosine Attention encompasses several variants, including cosine-squared scoring, entropy-invariant temperature scaling, and linearized ("softmax-free") forms as in Cottention.

1. Geometric and Algorithmic Motivations

Cosine attention mechanisms are motivated by the observation that in many high-dimensional tasks, especially those exhibiting significant magnitude variation (e.g., variations due to illumination or sensor response), the most discriminative features lie in the direction (angle) rather than the absolute magnitude of feature vectors. Standard dot-product attention computes scores as $q^\top k$ , which is sensitive to both norm and angle—this may amplify irrelevant magnitude effects and dilute meaningful angular relationships between tokens.

Cosine attention achieves magnitude invariance by projecting both queries $q$ and keys $k$ onto the unit hypersphere before scoring. For high-dimensional data like hyperspectral images, this ensures that similarity better reflects intrinsic spectral structure rather than extrinsic scaling (Ahmad et al., 2 Apr 2026).

2. Mathematical Formulations

The core definition of scaled cosine attention replaces raw dot-product logits with cosine similarity, optionally squaring or scaling them. Three main formulations are prominent:

2.1. Cosine-Normalized (Cosine²) Attention

Given query and key vectors per head,

$\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$

The score is then: $\text{score}(q, k) = (\tilde{q}^\top \tilde{k})^2 = (\cos \theta)^{2}$ This squared cosine sharpens the distinction between aligned ( $\theta \approx 0$ ) and misaligned vectors (Ahmad et al., 2 Apr 2026).

2.2. Scaled Cosine Attention (CosScale)

The "CosScale" variant introduces a tunable temperature $\alpha$ : $\beta_{ij} = \frac{\exp(\alpha\, \cos \theta_{ij})}{\sum_{\ell=1}^n \exp(\alpha\, \cos \theta_{i\ell})}$ Here, the norm is enforced or ensured by normalization. The hyperparameter $\alpha$ governs sharpness and helps preserve entropy invariance as sequence length increases (Li et al., 15 Jan 2025).

2.3. Linear (Softmax-Free) Cosine Attention

Cottention uses raw (or scaled) cosine similarities without softmax: $S = \mathcal{N}(Q) \mathcal{N}(K)^\top$

$q$ 0

with $q$ 1 being a learned stabilization parameter. The attention output is then $q$ 2, bypassing softmax normalization and enabling linear complexity and constant memory (Mongaras et al., 2024).

3. Integration in Transformer Architectures

3.1. Cosine² in Spatial–Spectral Transformers

Integration proceeds analogously to standard multi-head attention but with key steps:

Token matrix is linearly projected for $q$ 3, $q$ 4, $q$ 5 and split per head.
Queries and keys for each head are $q$ 6-normalized row-wise.
Cosine similarity matrix is computed and squared elementwise.
Row-wise softmax yields attention weights.
Weighted sum over values, head concatenation, and output projection proceed as usual (Ahmad et al., 2 Apr 2026).

3.2. CosScale in LLMs

Scaled cosine is slotted in by replacing the usual dot-product-scaled logits with $q$ 7 in the softmax. The temperature $q$ 8 is tuned to offset attention mass dilution as context length increases, preserving effective entropy. Empirical procedure involves sweeping $q$ 9 as sequence length grows (Li et al., 15 Jan 2025).

3.3. Cottention: Linearized Cosine Attention

Queries and keys are normalized, and the cosine similarity matrix is calculated. This is scaled by a per-head factor $k$ 0. The output is computed as $k$ 1 directly. Key variants include algorithmic reformulation for causal (autoregressive) processing, which enables streaming evaluation with constant $k$ 2 memory. The mechanism can be interpreted as an unnormalized, RNN-like scan over input, and implemented efficiently as a fused CUDA kernel (Mongaras et al., 2024).

4. Theoretical Analysis and Empirical Effects

Cosine-normalized attention imparts a robust angular inductive bias. It suppresses the influence of tokens with large norm but poor directional alignment, rendering attention robust to extrinsic magnitude distortions (illumination, sensor gain, etc.), especially relevant for hyperspectral classification (Ahmad et al., 2 Apr 2026).

The squaring operation in $k$ 3 further sharpens the attention distribution (lower entropy), enhancing discriminability when classes are angularly proximate. Controlled ablations show consistent superiority of cosine-based scoring—most notably, cosine²—in low-label, high-dimensional regimes.

For language modeling and long-sequence extrapolation, CosScale controls entropy and combats attention score dilution. Large $k$ 4 forces the softmax to peak sharply on the most aligned keys, and in the limit, CosScale approaches windowed attention, restricting focus locally (Li et al., 15 Jan 2025).

Cottention's linearized approach enables transformer inference with memory scaling as $k$ 5 (not $k$ 6), significantly reducing real-world resource requirements on long sequences while maintaining performance rivaling softmax attention (Mongaras et al., 2024).

5. Experimental Performance and Ablation Results

5.1. Hyperspectral Image Classification

Cosine² and Cosine attention variants consistently rank among the top-performing attention mechanisms under extremely label-scarce (1%) regimes. Highlighted results on three benchmarks (OA=Overall Accuracy, $k$ 7=Cohen's kappa, AA=Average Accuracy):

Dataset	Variant	κ	OA	AA
Salinas	Cosine²	99.15	99.23	99.18
Salinas	SDP	99.18	99.26	99.06
Salinas	Dot-prod	97.75	97.98	98.02
HH	Cosine²	96.94	97.58	93.05
HH	Cosine	97.02	97.64	92.38
HH	SDP	97.87	98.32	94.48
TD	Cosine	98.68	98.84	97.23
TD	Cosine²	98.17	98.39	94.92
TD	SDP	98.30	98.51	95.88

Normalization and squaring ablations confirm that jointly normalizing $k$ 8 and $k$ 9 and employing $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 0 further improve accuracy (Ahmad et al., 2 Apr 2026).

5.2. Long-Context Language Modeling

CosScale on GAU-α and related models achieves substantial improvements with $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 1 on sequences up to $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 2:

Model	PPL	ACC
Baseline GAU-α	>500	<0.1
GAU-α w/ CosScale	49.45	0.32
PoSE w/ CosScale	22.03	0.41
ReRoPE w/ CosScale	6.36	0.63
GAU-α w/ CosScale+InfoSc.	44.07	0.34

Model accuracy and perplexity improvements persist when extrapolating up to 64× training length. Entropy-invariant tuning of $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 3 is critical for these gains (Li et al., 15 Jan 2025).

5.3. Cottention: Linear Memory Scaling

Experiments with Cottention on BERT and GPT tasks demonstrate performance comparable to softmax-based attention, with substantial memory savings (native linear complexity in sequence length), and learned scaling factors $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 4 that decay as training stabilizes (Mongaras et al., 2024).

6. Practical Considerations and Recommendations

For spatial–spectral Vision Transformers, full $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 5-normalization of $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 6 and $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 7, followed by cosine² and softmax, provides increased robustness to noise and magnitude shifts.
In long-context LLMs using CosScale, $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 8 should be increased as sequence length grows to counter attention-score dilution. A practical sweep is $\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}$ 9 for $\text{score}(q, k) = (\tilde{q}^\top \tilde{k})^2 = (\cos \theta)^{2}$ 0k– $\text{score}(q, k) = (\tilde{q}^\top \tilde{k})^2 = (\cos \theta)^{2}$ 1k, with smaller values (e.g., $\text{score}(q, k) = (\tilde{q}^\top \tilde{k})^2 = (\cos \theta)^{2}$ 2) for windowed masking (Li et al., 15 Jan 2025).
For linearized cosine attention (Cottention), initialization of the per-head scaling parameter to 0.5 secures early training stability, and dynamic adaptation during optimization removes the need for fixed manual scaling. Custom CUDA kernels are critical for achieving optimal efficiency (Mongaras et al., 2024).
Monitoring effective entropy, training loss, and gradient flow is essential; excessive scaling can induce vanishing gradients (Li et al., 15 Jan 2025).

7. Implications and Extensions

Scaled cosine attention directly aligns the inductive biases of transformer layers to domains where angular relationships are paramount, such as hyperspectral imagery or long-range sequence modeling. Empirical evidence demonstrates consistent top-rank performance for cosine-based scoring, especially cosine², as well as enabling practical advances in efficient inference on long sequences. A plausible implication is that as transformer-based models are further scaled, the explicit control over attention sharpness and entropy provided by tunable cosine scaling (CosScale) may become critical for robust generalization across both spatial and sequential domains (Ahmad et al., 2 Apr 2026, Li et al., 15 Jan 2025, Mongaras et al., 2024).

Markdown Report Issue Upgrade to Chat

References (3)

Cosine-Normalized Attention for Hyperspectral Image Classification (2026)

Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms (2025)

Cottention: Linear Transformers With Cosine Attention (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scaled Cosine Attention.

Scaled Cosine Attention in Transformers

1. Geometric and Algorithmic Motivations

2. Mathematical Formulations

2.1. Cosine-Normalized (Cosine²) Attention

2.2. Scaled Cosine Attention (CosScale)

2.3. Linear (Softmax-Free) Cosine Attention

3. Integration in Transformer Architectures

3.1. Cosine² in Spatial–Spectral Transformers

3.2. CosScale in LLMs

3.3. Cottention: Linearized Cosine Attention

4. Theoretical Analysis and Empirical Effects

5. Experimental Performance and Ablation Results

5.1. Hyperspectral Image Classification

5.2. Long-Context Language Modeling

5.3. Cottention: Linear Memory Scaling

6. Practical Considerations and Recommendations

7. Implications and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Scaled Cosine Attention in Transformers

1. Geometric and Algorithmic Motivations

2. Mathematical Formulations

2.1. Cosine-Normalized (Cosine²) Attention

2.2. Scaled Cosine Attention (CosScale)

2.3. Linear (Softmax-Free) Cosine Attention

3. Integration in Transformer Architectures

3.1. Cosine² in Spatial–Spectral Transformers

3.2. CosScale in LLMs

3.3. Cottention: Linearized Cosine Attention

4. Theoretical Analysis and Empirical Effects

5. Experimental Performance and Ablation Results

5.1. Hyperspectral Image Classification

5.2. Long-Context Language Modeling

5.3. Cottention: Linear Memory Scaling

6. Practical Considerations and Recommendations

7. Implications and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research