Scaled Cosine Attention in Transformers
- Scaled Cosine Attention is a transformer mechanism that replaces dot-product with cosine similarity to decouple magnitude from directional features.
- Variants like cosine², CosScale, and Cottention employ L2 normalization and temperature scaling to sharpen attention distributions and enhance stability.
- Empirical results in hyperspectral image classification and language modeling show improved accuracy, memory efficiency, and effective extrapolation.
Scaled Cosine Attention is a family of attention mechanisms in transformer architectures where the raw dot-product similarity between queries and keys is replaced with (optionally scaled) cosine similarity. This decouples magnitude and orientation in feature comparison, providing sharper angular inductive bias, improved magnitude invariance, andāwhen used with appropriate scalingāenables improved stability, extrapolation, and memory efficiency in both high-dimensional domains (e.g., hyperspectral imagery) and long-sequence regime (e.g., language modeling). Scaled Cosine Attention encompasses several variants, including cosine-squared scoring, entropy-invariant temperature scaling, and linearized ("softmax-free") forms as in Cottention.
1. Geometric and Algorithmic Motivations
Cosine attention mechanisms are motivated by the observation that in many high-dimensional tasks, especially those exhibiting significant magnitude variation (e.g., variations due to illumination or sensor response), the most discriminative features lie in the direction (angle) rather than the absolute magnitude of feature vectors. Standard dot-product attention computes scores as , which is sensitive to both norm and angleāthis may amplify irrelevant magnitude effects and dilute meaningful angular relationships between tokens.
Cosine attention achieves magnitude invariance by projecting both queries and keys onto the unit hypersphere before scoring. For high-dimensional data like hyperspectral images, this ensures that similarity better reflects intrinsic spectral structure rather than extrinsic scaling (Ahmad et al., 2 Apr 2026).
2. Mathematical Formulations
The core definition of scaled cosine attention replaces raw dot-product logits with cosine similarity, optionally squaring or scaling them. Three main formulations are prominent:
2.1. Cosine-Normalized (Cosine²) Attention
Given query and key vectors per head,
The score is then: This squared cosine sharpens the distinction between aligned () and misaligned vectors (Ahmad et al., 2 Apr 2026).
2.2. Scaled Cosine Attention (CosScale)
The "CosScale" variant introduces a tunable temperature : Here, the norm is enforced or ensured by normalization. The hyperparameter governs sharpness and helps preserve entropy invariance as sequence length increases (Li et al., 15 Jan 2025).
2.3. Linear (Softmax-Free) Cosine Attention
Cottention uses raw (or scaled) cosine similarities without softmax:
0
with 1 being a learned stabilization parameter. The attention output is then 2, bypassing softmax normalization and enabling linear complexity and constant memory (Mongaras et al., 2024).
3. Integration in Transformer Architectures
3.1. Cosine² in SpatialāSpectral Transformers
Integration proceeds analogously to standard multi-head attention but with key steps:
- Token matrix is linearly projected for 3, 4, 5 and split per head.
- Queries and keys for each head are 6-normalized row-wise.
- Cosine similarity matrix is computed and squared elementwise.
- Row-wise softmax yields attention weights.
- Weighted sum over values, head concatenation, and output projection proceed as usual (Ahmad et al., 2 Apr 2026).
3.2. CosScale in LLMs
Scaled cosine is slotted in by replacing the usual dot-product-scaled logits with 7 in the softmax. The temperature 8 is tuned to offset attention mass dilution as context length increases, preserving effective entropy. Empirical procedure involves sweeping 9 as sequence length grows (Li et al., 15 Jan 2025).
3.3. Cottention: Linearized Cosine Attention
Queries and keys are normalized, and the cosine similarity matrix is calculated. This is scaled by a per-head factor 0. The output is computed as 1 directly. Key variants include algorithmic reformulation for causal (autoregressive) processing, which enables streaming evaluation with constant 2 memory. The mechanism can be interpreted as an unnormalized, RNN-like scan over input, and implemented efficiently as a fused CUDA kernel (Mongaras et al., 2024).
4. Theoretical Analysis and Empirical Effects
Cosine-normalized attention imparts a robust angular inductive bias. It suppresses the influence of tokens with large norm but poor directional alignment, rendering attention robust to extrinsic magnitude distortions (illumination, sensor gain, etc.), especially relevant for hyperspectral classification (Ahmad et al., 2 Apr 2026).
The squaring operation in 3 further sharpens the attention distribution (lower entropy), enhancing discriminability when classes are angularly proximate. Controlled ablations show consistent superiority of cosine-based scoringāmost notably, cosine²āin low-label, high-dimensional regimes.
For language modeling and long-sequence extrapolation, CosScale controls entropy and combats attention score dilution. Large 4 forces the softmax to peak sharply on the most aligned keys, and in the limit, CosScale approaches windowed attention, restricting focus locally (Li et al., 15 Jan 2025).
Cottention's linearized approach enables transformer inference with memory scaling as 5 (not 6), significantly reducing real-world resource requirements on long sequences while maintaining performance rivaling softmax attention (Mongaras et al., 2024).
5. Experimental Performance and Ablation Results
5.1. Hyperspectral Image Classification
Cosine² and Cosine attention variants consistently rank among the top-performing attention mechanisms under extremely label-scarce (1%) regimes. Highlighted results on three benchmarks (OA=Overall Accuracy, 7=Cohen's kappa, AA=Average Accuracy):
| Dataset | Variant | Īŗ | OA | AA |
|---|---|---|---|---|
| Salinas | Cosine² | 99.15 | 99.23 | 99.18 |
| Salinas | SDP | 99.18 | 99.26 | 99.06 |
| Salinas | Dot-prod | 97.75 | 97.98 | 98.02 |
| HH | Cosine² | 96.94 | 97.58 | 93.05 |
| HH | Cosine | 97.02 | 97.64 | 92.38 |
| HH | SDP | 97.87 | 98.32 | 94.48 |
| TD | Cosine | 98.68 | 98.84 | 97.23 |
| TD | Cosine² | 98.17 | 98.39 | 94.92 |
| TD | SDP | 98.30 | 98.51 | 95.88 |
Normalization and squaring ablations confirm that jointly normalizing 8 and 9 and employing 0 further improve accuracy (Ahmad et al., 2 Apr 2026).
5.2. Long-Context Language Modeling
CosScale on GAU-α and related models achieves substantial improvements with 1 on sequences up to 2:
| Model | PPL | ACC |
|---|---|---|
| Baseline GAU-α | >500 | <0.1 |
| GAU-α w/ CosScale | 49.45 | 0.32 |
| PoSE w/ CosScale | 22.03 | 0.41 |
| ReRoPE w/ CosScale | 6.36 | 0.63 |
| GAU-α w/ CosScale+InfoSc. | 44.07 | 0.34 |
Model accuracy and perplexity improvements persist when extrapolating up to 64Ć training length. Entropy-invariant tuning of 3 is critical for these gains (Li et al., 15 Jan 2025).
5.3. Cottention: Linear Memory Scaling
Experiments with Cottention on BERT and GPT tasks demonstrate performance comparable to softmax-based attention, with substantial memory savings (native linear complexity in sequence length), and learned scaling factors 4 that decay as training stabilizes (Mongaras et al., 2024).
6. Practical Considerations and Recommendations
- For spatialāspectral Vision Transformers, full 5-normalization of 6 and 7, followed by cosine² and softmax, provides increased robustness to noise and magnitude shifts.
- In long-context LLMs using CosScale, 8 should be increased as sequence length grows to counter attention-score dilution. A practical sweep is 9 for 0kā1k, with smaller values (e.g., 2) for windowed masking (Li et al., 15 Jan 2025).
- For linearized cosine attention (Cottention), initialization of the per-head scaling parameter to 0.5 secures early training stability, and dynamic adaptation during optimization removes the need for fixed manual scaling. Custom CUDA kernels are critical for achieving optimal efficiency (Mongaras et al., 2024).
- Monitoring effective entropy, training loss, and gradient flow is essential; excessive scaling can induce vanishing gradients (Li et al., 15 Jan 2025).
7. Implications and Extensions
Scaled cosine attention directly aligns the inductive biases of transformer layers to domains where angular relationships are paramount, such as hyperspectral imagery or long-range sequence modeling. Empirical evidence demonstrates consistent top-rank performance for cosine-based scoring, especially cosine², as well as enabling practical advances in efficient inference on long sequences. A plausible implication is that as transformer-based models are further scaled, the explicit control over attention sharpness and entropy provided by tunable cosine scaling (CosScale) may become critical for robust generalization across both spatial and sequential domains (Ahmad et al., 2 Apr 2026, Li et al., 15 Jan 2025, Mongaras et al., 2024).