Injective Linear Attention (InLine)
- Injective Linear Attention (InLine) is an attention mechanism that restores injectivity by employing a novel subtractive normalization scheme for distinct query mappings.
- It scales linearly to high-resolution vision tasks by integrating an explicit local residual module that enhances local context modeling.
- Empirical evaluations show that InLine outperforms both Softmax and prior linear attention methods on key benchmarks with reduced computational overhead.
Injective Linear Attention (InLine) is an attention mechanism designed to reconcile the performance gap between conventional Softmax attention and linear attention, particularly in large-scale vision transformer models. InLine achieves linear computational complexity while restoring crucial properties—injectivity and effective local modeling—that underlie the superior expressiveness of Softmax attention. Its core theoretical and algorithmic innovations enable it to outperform both standard Softmax and prior linear attention methods in diverse visual tasks, all while maintaining efficient scaling for high-resolution input (Han et al., 2024).
1. Distinctions Between Softmax and Linear Attention
Both Softmax and linear attention mechanisms compute context-sensitive outputs from queries , keys , and values , outputting . They diverge sharply in algorithmic complexity and representational power:
- Softmax attention computes the full similarity matrix, requiring computation:
This formulation enables strong capture of both local and long-range dependencies but is computationally prohibitive as increases.
- Linear attention reduces complexity to by substituting the exponential kernel with a feature map :
0
The associativity property allows kernel value aggregation in linear time, but empirical results show consistent under-performance compared to Softmax attention in vision tasks.
2. Injectivity in Attention Functions
Injectivity, defined as 1 such that 2, is crucial for ensuring that distinct queries yield distinct attention weight distributions. For fixed keys 3:
- Softmax mapping: 4 is injective, as 5 for 6.
- Linear mapping: 7 may fail to distinguish collinear or scaling-equivalent queries, leading to non-injectivity.
This property is central in preventing semantic confusion during attention computation, which can otherwise result if different queries receive identical attention weights.
3. Non-Injectivity of Vanilla Linear Attention
Proposition 2 establishes that standard linear attention is not injective for any continuous 8:
- If 9 is non-injective, queries 0 with 1 trivially map to identical outputs: 2.
- If 3 is injective, invariance of domain implies there exist 4 and 5 such that 6, resulting in 7 after kernel normalization.
Consequently, vanilla linear attention "collapses" the attention behavior of collinear queries, significantly reducing its discriminative capacity relative to Softmax attention (Han et al., 2024).
4. The InLine Mechanism: Injective Linear Attention
InLine resolves the injectivity limitation via a novel normalization scheme that replaces divisive normalization with a subtractive form, guaranteeing that 8 and enforcing injectivity under mild rank conditions:
9
0
Under full-rank assumptions for 1 and the augmented matrix 2, the resulting attention map 3 is injective. The InLine computation is efficiently realized in linear time per token:
- Precompute:
- 4
- 5
- 6
- Compute outputs:
7
with 8 cost.
5. Enhancing Local Modeling
Empirical analyses reveal that Softmax attention applies strong local inductive bias in early layers, critical for visual pattern recognition: observed local selection rates in Softmax attention rise markedly above the 9 baseline, ranging from 15–30%. Both vanilla linear attention and InLine lack this property by default. To close this gap, InLine incorporates an explicit local residual:
- Compute global 0.
- Compute overall token mean 1.
- Feed 2 through a small MLP predicting 3.
- Collect 3×3 neighborhood values 4.
- Output:
5
This operation increases overhead by 6 but preserves the overall linear scaling in sequence length.
6. Implementation in Vision Transformers
InLine is implemented by replacing all Softmax attention layers in leading Vision Transformer backbones—such as Swin, DeiT, PVT, and CSWin—with the injective InLine and local residual module:
- Kernel choices: 7 (identity) or nonnegative variants such as ReLU/exp.
- Complexity:
- Softmax: 8
- Linear, InLine: 9 per layer
- Training protocol (ImageNet-1K): 300 epochs from scratch, AdamW (0), cosine decay, 20-epoch warmup, and standard augmentation (RandAugment, Mixup, CutMix, Erasing). Identical protocols are used for downstream detection/segmentation tasks (Han et al., 2024).
7. Empirical Results and Significance
A comprehensive evaluation on vision benchmarks demonstrates that InLine achieves equal or superior accuracy to Softmax attention at substantially reduced computation, outperforming prior linear attention methods in both classification, detection, and segmentation settings:
| Model | Softmax | Linear-base | InLine (ours) |
|---|---|---|---|
| DeiT-T (1.2 G) | 72.2% | ~70.0% | 74.5% |
| PVT-S (3.8 G) | 79.8% | ~77.3% | 82.0% |
| Swin-T (4.5 G) | 81.3% | ~77.3% | 82.4% |
| CSWin-T (4.3 G) | 82.7% | — | 83.2% |
| Backbone | APᵇ | APᵐ | FLOPs |
|---|---|---|---|
| PVT-S | 40.4 | 37.8 | 305 G |
| InLine-PVT-S | 43.4 | 40.1 | 250 G |
| Backbone | mIoU | FLOPs |
|---|---|---|
| Swin-T | 44.51 | 945 G |
| InLine-Swin-T | 45.57 | 941 G |
Inference speed is retained as window or image size grows, in contrast to significant throughput degradation observed for Softmax attention with increasing 1.
8. Theoretical and Practical Impact
Softmax attention’s empirical success is attributed to two distinct properties: injectivity (ensuring unique attention mappings for distinct queries) and emergent local bias. Standard linear attention loses both—providing explainable grounds for its lower efficacy in vision models. InLine systematically remedies both deficiencies: it employs a subtraction-based normalization to restore injectivity and supplements the global attention output with a learned local residual. As a result, InLine closes and, in many regimes, reverses the performance gap between linear and Softmax attention at significantly lower computational cost for large input domains (Han et al., 2024).
A plausible implication is that the key determinants of attention quality in vision models are injectivity and explicit local modeling capacity, rather than the specific use of Softmax normalization. This suggests that further variants—enforcing these properties—could push the scalability and accuracy of attention-based architectures even higher in future work.