Papers
Topics
Authors
Recent
Search
2000 character limit reached

Injective Linear Attention (InLine)

Updated 21 April 2026
  • Injective Linear Attention (InLine) is an attention mechanism that restores injectivity by employing a novel subtractive normalization scheme for distinct query mappings.
  • It scales linearly to high-resolution vision tasks by integrating an explicit local residual module that enhances local context modeling.
  • Empirical evaluations show that InLine outperforms both Softmax and prior linear attention methods on key benchmarks with reduced computational overhead.

Injective Linear Attention (InLine) is an attention mechanism designed to reconcile the performance gap between conventional Softmax attention and linear attention, particularly in large-scale vision transformer models. InLine achieves linear computational complexity while restoring crucial properties—injectivity and effective local modeling—that underlie the superior expressiveness of Softmax attention. Its core theoretical and algorithmic innovations enable it to outperform both standard Softmax and prior linear attention methods in diverse visual tasks, all while maintaining efficient scaling for high-resolution input (Han et al., 2024).

1. Distinctions Between Softmax and Linear Attention

Both Softmax and linear attention mechanisms compute context-sensitive outputs from queries QRN×dQ \in \mathbb{R}^{N \times d}, keys KRN×dK \in \mathbb{R}^{N \times d}, and values VRN×CV \in \mathbb{R}^{N \times C}, outputting ORN×CO \in \mathbb{R}^{N \times C}. They diverge sharply in algorithmic complexity and representational power:

  • Softmax attention computes the full N×NN \times N similarity matrix, requiring O(N2C)\mathcal{O}(N^2 C) computation:

Si=softmax(QiK),OiS=SiV.S_i = \mathrm{softmax}(Q_i K^\top), \quad O_i^S = S_i^\top V.

This formulation enables strong capture of both local and long-range dependencies but is computationally prohibitive as NN increases.

  • Linear attention reduces complexity to O(NC)\mathcal{O}(NC) by substituting the exponential kernel with a feature map ϕ()\phi(\cdot):

KRN×dK \in \mathbb{R}^{N \times d}0

The associativity property allows kernel value aggregation in linear time, but empirical results show consistent under-performance compared to Softmax attention in vision tasks.

2. Injectivity in Attention Functions

Injectivity, defined as KRN×dK \in \mathbb{R}^{N \times d}1 such that KRN×dK \in \mathbb{R}^{N \times d}2, is crucial for ensuring that distinct queries yield distinct attention weight distributions. For fixed keys KRN×dK \in \mathbb{R}^{N \times d}3:

  • Softmax mapping: KRN×dK \in \mathbb{R}^{N \times d}4 is injective, as KRN×dK \in \mathbb{R}^{N \times d}5 for KRN×dK \in \mathbb{R}^{N \times d}6.
  • Linear mapping: KRN×dK \in \mathbb{R}^{N \times d}7 may fail to distinguish collinear or scaling-equivalent queries, leading to non-injectivity.

This property is central in preventing semantic confusion during attention computation, which can otherwise result if different queries receive identical attention weights.

3. Non-Injectivity of Vanilla Linear Attention

Proposition 2 establishes that standard linear attention is not injective for any continuous KRN×dK \in \mathbb{R}^{N \times d}8:

  • If KRN×dK \in \mathbb{R}^{N \times d}9 is non-injective, queries VRN×CV \in \mathbb{R}^{N \times C}0 with VRN×CV \in \mathbb{R}^{N \times C}1 trivially map to identical outputs: VRN×CV \in \mathbb{R}^{N \times C}2.
  • If VRN×CV \in \mathbb{R}^{N \times C}3 is injective, invariance of domain implies there exist VRN×CV \in \mathbb{R}^{N \times C}4 and VRN×CV \in \mathbb{R}^{N \times C}5 such that VRN×CV \in \mathbb{R}^{N \times C}6, resulting in VRN×CV \in \mathbb{R}^{N \times C}7 after kernel normalization.

Consequently, vanilla linear attention "collapses" the attention behavior of collinear queries, significantly reducing its discriminative capacity relative to Softmax attention (Han et al., 2024).

4. The InLine Mechanism: Injective Linear Attention

InLine resolves the injectivity limitation via a novel normalization scheme that replaces divisive normalization with a subtractive form, guaranteeing that VRN×CV \in \mathbb{R}^{N \times C}8 and enforcing injectivity under mild rank conditions:

VRN×CV \in \mathbb{R}^{N \times C}9

ORN×CO \in \mathbb{R}^{N \times C}0

Under full-rank assumptions for ORN×CO \in \mathbb{R}^{N \times C}1 and the augmented matrix ORN×CO \in \mathbb{R}^{N \times C}2, the resulting attention map ORN×CO \in \mathbb{R}^{N \times C}3 is injective. The InLine computation is efficiently realized in linear time per token:

  • Precompute:
    • ORN×CO \in \mathbb{R}^{N \times C}4
    • ORN×CO \in \mathbb{R}^{N \times C}5
    • ORN×CO \in \mathbb{R}^{N \times C}6
  • Compute outputs:

ORN×CO \in \mathbb{R}^{N \times C}7

with ORN×CO \in \mathbb{R}^{N \times C}8 cost.

5. Enhancing Local Modeling

Empirical analyses reveal that Softmax attention applies strong local inductive bias in early layers, critical for visual pattern recognition: observed local selection rates in Softmax attention rise markedly above the ORN×CO \in \mathbb{R}^{N \times C}9 baseline, ranging from 15–30%. Both vanilla linear attention and InLine lack this property by default. To close this gap, InLine incorporates an explicit local residual:

  1. Compute global N×NN \times N0.
  2. Compute overall token mean N×NN \times N1.
  3. Feed N×NN \times N2 through a small MLP predicting N×NN \times N3.
  4. Collect 3×3 neighborhood values N×NN \times N4.
  5. Output:

N×NN \times N5

This operation increases overhead by N×NN \times N6 but preserves the overall linear scaling in sequence length.

6. Implementation in Vision Transformers

InLine is implemented by replacing all Softmax attention layers in leading Vision Transformer backbones—such as Swin, DeiT, PVT, and CSWin—with the injective InLine and local residual module:

  • Kernel choices: N×NN \times N7 (identity) or nonnegative variants such as ReLU/exp.
  • Complexity:
    • Softmax: N×NN \times N8
    • Linear, InLine: N×NN \times N9 per layer
  • Training protocol (ImageNet-1K): 300 epochs from scratch, AdamW (O(N2C)\mathcal{O}(N^2 C)0), cosine decay, 20-epoch warmup, and standard augmentation (RandAugment, Mixup, CutMix, Erasing). Identical protocols are used for downstream detection/segmentation tasks (Han et al., 2024).

7. Empirical Results and Significance

A comprehensive evaluation on vision benchmarks demonstrates that InLine achieves equal or superior accuracy to Softmax attention at substantially reduced computation, outperforming prior linear attention methods in both classification, detection, and segmentation settings:

Model Softmax Linear-base InLine (ours)
DeiT-T (1.2 G) 72.2% ~70.0% 74.5%
PVT-S (3.8 G) 79.8% ~77.3% 82.0%
Swin-T (4.5 G) 81.3% ~77.3% 82.4%
CSWin-T (4.3 G) 82.7% 83.2%
Backbone APᵇ APᵐ FLOPs
PVT-S 40.4 37.8 305 G
InLine-PVT-S 43.4 40.1 250 G
Backbone mIoU FLOPs
Swin-T 44.51 945 G
InLine-Swin-T 45.57 941 G

Inference speed is retained as window or image size grows, in contrast to significant throughput degradation observed for Softmax attention with increasing O(N2C)\mathcal{O}(N^2 C)1.

8. Theoretical and Practical Impact

Softmax attention’s empirical success is attributed to two distinct properties: injectivity (ensuring unique attention mappings for distinct queries) and emergent local bias. Standard linear attention loses both—providing explainable grounds for its lower efficacy in vision models. InLine systematically remedies both deficiencies: it employs a subtraction-based normalization to restore injectivity and supplements the global attention output with a learned local residual. As a result, InLine closes and, in many regimes, reverses the performance gap between linear and Softmax attention at significantly lower computational cost for large input domains (Han et al., 2024).

A plausible implication is that the key determinants of attention quality in vision models are injectivity and explicit local modeling capacity, rather than the specific use of Softmax normalization. This suggests that further variants—enforcing these properties—could push the scalability and accuracy of attention-based architectures even higher in future work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Injective Linear Attention (InLine).