Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaled Cosine Attention in Transformers

Updated 14 April 2026
  • Scaled Cosine Attention is a transformer mechanism that replaces dot-product with cosine similarity to decouple magnitude from directional features.
  • Variants like cosine², CosScale, and Cottention employ L2 normalization and temperature scaling to sharpen attention distributions and enhance stability.
  • Empirical results in hyperspectral image classification and language modeling show improved accuracy, memory efficiency, and effective extrapolation.

Scaled Cosine Attention is a family of attention mechanisms in transformer architectures where the raw dot-product similarity between queries and keys is replaced with (optionally scaled) cosine similarity. This decouples magnitude and orientation in feature comparison, providing sharper angular inductive bias, improved magnitude invariance, and—when used with appropriate scaling—enables improved stability, extrapolation, and memory efficiency in both high-dimensional domains (e.g., hyperspectral imagery) and long-sequence regime (e.g., language modeling). Scaled Cosine Attention encompasses several variants, including cosine-squared scoring, entropy-invariant temperature scaling, and linearized ("softmax-free") forms as in Cottention.

1. Geometric and Algorithmic Motivations

Cosine attention mechanisms are motivated by the observation that in many high-dimensional tasks, especially those exhibiting significant magnitude variation (e.g., variations due to illumination or sensor response), the most discriminative features lie in the direction (angle) rather than the absolute magnitude of feature vectors. Standard dot-product attention computes scores as q⊤kq^\top k, which is sensitive to both norm and angle—this may amplify irrelevant magnitude effects and dilute meaningful angular relationships between tokens.

Cosine attention achieves magnitude invariance by projecting both queries qq and keys kk onto the unit hypersphere before scoring. For high-dimensional data like hyperspectral images, this ensures that similarity better reflects intrinsic spectral structure rather than extrinsic scaling (Ahmad et al., 2 Apr 2026).

2. Mathematical Formulations

The core definition of scaled cosine attention replaces raw dot-product logits with cosine similarity, optionally squaring or scaling them. Three main formulations are prominent:

2.1. Cosine-Normalized (Cosine²) Attention

Given query and key vectors per head,

q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}

The score is then: score(q,k)=(q~⊤k~)2=(cos⁔θ)2\text{score}(q, k) = (\tilde{q}^\top \tilde{k})^2 = (\cos \theta)^{2} This squared cosine sharpens the distinction between aligned (Īøā‰ˆ0\theta \approx 0) and misaligned vectors (Ahmad et al., 2 Apr 2026).

2.2. Scaled Cosine Attention (CosScale)

The "CosScale" variant introduces a tunable temperature α\alpha: βij=exp⁔(α cos⁔θij)āˆ‘ā„“=1nexp⁔(α cos⁔θiā„“)\beta_{ij} = \frac{\exp(\alpha\, \cos \theta_{ij})}{\sum_{\ell=1}^n \exp(\alpha\, \cos \theta_{i\ell})} Here, the norm is enforced or ensured by normalization. The hyperparameter α\alpha governs sharpness and helps preserve entropy invariance as sequence length increases (Li et al., 15 Jan 2025).

2.3. Linear (Softmax-Free) Cosine Attention

Cottention uses raw (or scaled) cosine similarities without softmax: S=N(Q)N(K)⊤S = \mathcal{N}(Q) \mathcal{N}(K)^\top

qq0

with qq1 being a learned stabilization parameter. The attention output is then qq2, bypassing softmax normalization and enabling linear complexity and constant memory (Mongaras et al., 2024).

3. Integration in Transformer Architectures

3.1. Cosine² in Spatial–Spectral Transformers

Integration proceeds analogously to standard multi-head attention but with key steps:

  1. Token matrix is linearly projected for qq3, qq4, qq5 and split per head.
  2. Queries and keys for each head are qq6-normalized row-wise.
  3. Cosine similarity matrix is computed and squared elementwise.
  4. Row-wise softmax yields attention weights.
  5. Weighted sum over values, head concatenation, and output projection proceed as usual (Ahmad et al., 2 Apr 2026).

3.2. CosScale in LLMs

Scaled cosine is slotted in by replacing the usual dot-product-scaled logits with qq7 in the softmax. The temperature qq8 is tuned to offset attention mass dilution as context length increases, preserving effective entropy. Empirical procedure involves sweeping qq9 as sequence length grows (Li et al., 15 Jan 2025).

3.3. Cottention: Linearized Cosine Attention

Queries and keys are normalized, and the cosine similarity matrix is calculated. This is scaled by a per-head factor kk0. The output is computed as kk1 directly. Key variants include algorithmic reformulation for causal (autoregressive) processing, which enables streaming evaluation with constant kk2 memory. The mechanism can be interpreted as an unnormalized, RNN-like scan over input, and implemented efficiently as a fused CUDA kernel (Mongaras et al., 2024).

4. Theoretical Analysis and Empirical Effects

Cosine-normalized attention imparts a robust angular inductive bias. It suppresses the influence of tokens with large norm but poor directional alignment, rendering attention robust to extrinsic magnitude distortions (illumination, sensor gain, etc.), especially relevant for hyperspectral classification (Ahmad et al., 2 Apr 2026).

The squaring operation in kk3 further sharpens the attention distribution (lower entropy), enhancing discriminability when classes are angularly proximate. Controlled ablations show consistent superiority of cosine-based scoring—most notably, cosine²—in low-label, high-dimensional regimes.

For language modeling and long-sequence extrapolation, CosScale controls entropy and combats attention score dilution. Large kk4 forces the softmax to peak sharply on the most aligned keys, and in the limit, CosScale approaches windowed attention, restricting focus locally (Li et al., 15 Jan 2025).

Cottention's linearized approach enables transformer inference with memory scaling as kk5 (not kk6), significantly reducing real-world resource requirements on long sequences while maintaining performance rivaling softmax attention (Mongaras et al., 2024).

5. Experimental Performance and Ablation Results

5.1. Hyperspectral Image Classification

Cosine² and Cosine attention variants consistently rank among the top-performing attention mechanisms under extremely label-scarce (1%) regimes. Highlighted results on three benchmarks (OA=Overall Accuracy, kk7=Cohen's kappa, AA=Average Accuracy):

Dataset Variant Īŗ OA AA
Salinas Cosine² 99.15 99.23 99.18
Salinas SDP 99.18 99.26 99.06
Salinas Dot-prod 97.75 97.98 98.02
HH Cosine² 96.94 97.58 93.05
HH Cosine 97.02 97.64 92.38
HH SDP 97.87 98.32 94.48
TD Cosine 98.68 98.84 97.23
TD Cosine² 98.17 98.39 94.92
TD SDP 98.30 98.51 95.88

Normalization and squaring ablations confirm that jointly normalizing kk8 and kk9 and employing q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}0 further improve accuracy (Ahmad et al., 2 Apr 2026).

5.2. Long-Context Language Modeling

CosScale on GAU-α and related models achieves substantial improvements with q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}1 on sequences up to q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}2:

Model PPL ACC
Baseline GAU-α >500 <0.1
GAU-α w/ CosScale 49.45 0.32
PoSE w/ CosScale 22.03 0.41
ReRoPE w/ CosScale 6.36 0.63
GAU-α w/ CosScale+InfoSc. 44.07 0.34

Model accuracy and perplexity improvements persist when extrapolating up to 64Ɨ training length. Entropy-invariant tuning of q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}3 is critical for these gains (Li et al., 15 Jan 2025).

5.3. Cottention: Linear Memory Scaling

Experiments with Cottention on BERT and GPT tasks demonstrate performance comparable to softmax-based attention, with substantial memory savings (native linear complexity in sequence length), and learned scaling factors q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}4 that decay as training stabilizes (Mongaras et al., 2024).

6. Practical Considerations and Recommendations

  • For spatial–spectral Vision Transformers, full q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}5-normalization of q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}6 and q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}7, followed by cosine² and softmax, provides increased robustness to noise and magnitude shifts.
  • In long-context LLMs using CosScale, q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}8 should be increased as sequence length grows to counter attention-score dilution. A practical sweep is q~=q∄q∄2,k~=k∄k∄2\tilde{q} = \frac{q}{\|q\|_2}, \quad \tilde{k} = \frac{k}{\|k\|_2}9 for score(q,k)=(q~⊤k~)2=(cos⁔θ)2\text{score}(q, k) = (\tilde{q}^\top \tilde{k})^2 = (\cos \theta)^{2}0k–score(q,k)=(q~⊤k~)2=(cos⁔θ)2\text{score}(q, k) = (\tilde{q}^\top \tilde{k})^2 = (\cos \theta)^{2}1k, with smaller values (e.g., score(q,k)=(q~⊤k~)2=(cos⁔θ)2\text{score}(q, k) = (\tilde{q}^\top \tilde{k})^2 = (\cos \theta)^{2}2) for windowed masking (Li et al., 15 Jan 2025).
  • For linearized cosine attention (Cottention), initialization of the per-head scaling parameter to 0.5 secures early training stability, and dynamic adaptation during optimization removes the need for fixed manual scaling. Custom CUDA kernels are critical for achieving optimal efficiency (Mongaras et al., 2024).
  • Monitoring effective entropy, training loss, and gradient flow is essential; excessive scaling can induce vanishing gradients (Li et al., 15 Jan 2025).

7. Implications and Extensions

Scaled cosine attention directly aligns the inductive biases of transformer layers to domains where angular relationships are paramount, such as hyperspectral imagery or long-range sequence modeling. Empirical evidence demonstrates consistent top-rank performance for cosine-based scoring, especially cosine², as well as enabling practical advances in efficient inference on long sequences. A plausible implication is that as transformer-based models are further scaled, the explicit control over attention sharpness and entropy provided by tunable cosine scaling (CosScale) may become critical for robust generalization across both spatial and sequential domains (Ahmad et al., 2 Apr 2026, Li et al., 15 Jan 2025, Mongaras et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scaled Cosine Attention.