SoLA-Vision: Softmax-Linear Vision

Updated 23 January 2026

SoLA-Vision is a hybrid attention paradigm that strategically integrates softmax and linear attention to balance global expressivity with computational scalability.
Its design leverages layer-wise hybridization and modules like Agent Attention, cosFormer, and MALA to reduce the quadratic complexity of softmax while preserving accuracy.
Empirical results demonstrate that SoLA-Vision achieves competitive accuracy on benchmarks with lower FLOPs, making it ideal for high-resolution tasks and edge deployments.

SoLA-Vision (Softmax-Linear Attention Vision) refers to a family of architectures, modules, and design principles that combine the global expressiveness of softmax self-attention with the scalability and hardware efficiency of linear attention for vision tasks. The SoLA-Vision paradigm arises from the need to reconcile the powerful all-to-all modeling capabilities of softmax attention—crucial for many Vision Transformer (ViT) applications—with the computational and memory constraints imposed by high-resolution inputs, where standard softmax attention becomes prohibitive due to its $O(N^2)$ complexity.

1. Theoretical Motivation and Formulation

Softmax-based attention computes, for input tokens $X \in \mathbb{R}^{N \times d}$ : $\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ where $Q= X W_Q$ , $K= X W_K$ , $V = X W_V$ . While this delivers strong pairwise modeling, its quadratic cost ( $O(N^2 d)$ computation/memory) is unsuited for large $N$ (e.g., high-res images).

Linear attention reparameterizes the attention kernel to a decomposable form, typically replacing the softmax by a kernel $\kappa(q,k) = \phi(q)\phi(k)^\top$ , so that

$\mathrm{Attention}_{\mathrm{lin}}(Q,K,V)_i = \frac{\phi(Q_i) (\sum_j \phi(K_j)^\top V_j)}{\phi(Q_i) (\sum_m \phi(K_m)^\top)},$

reducing complexity to $X \in \mathbb{R}^{N \times d}$ 0 with $X \in \mathbb{R}^{N \times d}$ 1 memory, but at the cost of compressing token interactions, leading to information decay and reduced long-range expressivity.

The SoLA-Vision concept is instantiated in several forms:

Fine-grained, layer-wise hybrid backbones interleaving softmax and linear attention (Li et al., 16 Jan 2026);
Module-level softmax-linear blends via, e.g., Agent (Pool-Softmax) attention (Han et al., 2023);
Advanced linear kernels (cosFormer, MALA) restoring crucial aspects of softmax while retaining $X \in \mathbb{R}^{N \times d}$ 2 scaling (Fan et al., 1 Jul 2025, Qin et al., 2022);
Quantization- and hardware-tailored softmax-free ViTs for edge deployment (Shi et al., 2024).

The central insight is that strategic integration—not mere replacement—of softmax and linear attention recovers the best trade-offs in compute, memory, and accuracy.

2. Layer-wise Softmax-Linear Hybridization

The core SoLA-Vision approach, formalized in (Li et al., 16 Jan 2026), proposes a fine-grained, per-layer mixture of softmax and linear attention:

Early/high-resolution stages use exclusively linear attention (e.g., WKV variant) for $X \in \mathbb{R}^{N \times d}$ 3 scaling.
Later/lower-resolution stages interleave softmax layers among linear ones (e.g., L L S L L S) to restore global coupling.
Softmax layers are sparsely placed (2–3 per 6–10 transformer blocks), identified via systematic ablation to maximize accuracy per FLOP.
A "Hidden-State Bridge" (HSB) selectively injects shallow linear features into deep softmax blocks, reintroducing high-resolution context at minimal softmax cost.

Stacking multiple linear layers only grows the receptive field as $X \in \mathbb{R}^{N \times d}$ 4 (where $X \in \mathbb{R}^{N \times d}$ 5 is #layers), whereas a single softmax layer immediately restores global interactions for all tokens. Empirical evidence shows that hybridizing at the per-layer level (not per-stage/block) achieves comparable or better accuracy to pure softmax or pure linear models at significantly reduced cost.

3. Algorithmic Modules and Efficient Variants

Several softmax-free or hybrid modules underpin SoLA-Vision systems:

Agent Attention (Han et al., 2023): Inserts a small set of agent tokens via pooling or learning. Attention proceeds in two steps:
1. Agent Aggregation: Agents attend to all keys/values with softmax.
2. Agent Broadcast: Queries attend to agents with softmax. Mathematically, for $X \in \mathbb{R}^{N \times d}$ 6 agent tokens,

$X \in \mathbb{R}^{N \times d}$ 7

which is provably a generalized linear attention with softmax-learned feature maps. Cost is $X \in \mathbb{R}^{N \times d}$ 8 for $X \in \mathbb{R}^{N \times d}$ 9.

cosFormer (Qin et al., 2022): Linear attention kernel with a nonnegativity-enforcing activation (e.g., ReLU) plus a cosine-based positional reweighting that decays with distance. For sequence position $\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 0,

$\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 1

The entire machinery is $\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 2 and exactly linearizable via Ptolemy's identity.

MALA (Magnitude-Aware Linear Attention) (Fan et al., 1 Jul 2025): Addresses the inability of vanilla linear attention to adapt to the query magnitude. MALA defines

$\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 3

where $\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 4 and $\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 5 restore "sharpening," interpolating between vanilla linear and exponential softmax ratios.

ReLU Attention / LinAttn (Wortsman et al., 2023, Shi et al., 2024): Replaces the softmax with ReLU and divides by sequence length,

$\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 6

delivering near-softmax scaling if normalized correctly.

SOFT (Lu et al., 2022): Uses a Gaussian kernel in place of softmax, with low-rank Nyström approximation for further scalability, and symmetric normalization to control spectral norm.

4. Implementation, Hardware, and Quantization

SoLA-Vision methodologies extend into implementation and hardware contexts:

In quantization-sensitive and edge-compute contexts, softmax removal is crucial, as exponentiation/division are hardware-unfriendly. Trio-ViT (Shi et al., 2024) demonstrates softmax-free ViTs with ReLU-based linear attention, batch normalization, and convolutional compensation modules. Block-wise post-training quantization with specific handling for divisors (log $\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 7 quantization) and inter-channel scale migration remedies issues peculiar to linear attention.
Hardware accelerators: Both ViTALiTy (Dass et al., 2022) and Trio-ViT (Shi et al., 2024) present custom FPGA designs exploiting stream/pipeline partitioning, MAT (Multiplier+Adder) or systolic arrays for fast matmul, fast computation of kernel sums, and bit-shift division replacing floating-point division.
These approaches yield $\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 83–7 $\mathrm{Attention}_{\mathrm{softmax}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V,$ 9 FPS and $Q= X W_Q$ 02–6 $Q= X W_Q$ 1 DSP efficiency increases for ViTs on Xilinx FPGAs, with less than $Q= X W_Q$ 2 top-1 accuracy drop compared to floating-point softmax models.

5. Empirical Benchmarks and Performance

SoLA-Vision variants uniformly outperform or match both quadratic softmax-only and linear-only counterparts across major vision benchmarks, typically at reduced FLOPs and parameter counts.

Task	Softmax Baseline	SoLA/Hybrid/Linear Baseline	SoLA-Vision/Hybrid	Source
ImageNet-1K Top-1	DeiT-T: 72.2%	EffFormer-L1: 79.2%	SoLA-T: 79.8%	(Li et al., 16 Jan 2026)
COCO Det. (AP)	Swin-T: 43.7	VRWKV: ≈42.2	SoLA-S: 46.6	(Li et al., 16 Jan 2026)
ADE20K Seg. (mIoU)	Swin-B: 48.1	MambaVision-B: 49.1	SoLA-B: 50.5	(Li et al., 16 Jan 2026)

Specific highlights:

Inserting a single softmax block in a 6-block transformer stack recovers $Q= X W_Q$ 3 over pure linear; an optimized two-softmax schedule yields $Q= X W_Q$ 4 (Li et al., 16 Jan 2026).
Agent Attention boosts Swin-T backbone from $Q= X W_Q$ 5 to $Q= X W_Q$ 6 ImageNet-1K top-1, with large gains on detection/segmentation (Han et al., 2023).
MALA raises the accuracy of tiny ViT models to $Q= X W_Q$ 7 (vs $Q= X W_Q$ 8 DeiT-T, $Q= X W_Q$ 9 best vanilla linear), and closes the gap in detection/segmentation (Fan et al., 1 Jul 2025).
EfficientViT/Trio-ViT achieves near-baseline accuracy at $K= X W_K$ 0 inference speed on edge FPGAs (Shi et al., 2024).

6. Ablation Studies, Trade-offs, and Limitations

Key results from systematic ablations include:

Position and number of softmax layers: two per stage (not grouped) is the optimal trade-off for most workloads (Li et al., 16 Jan 2026).
Pure linear models ("LLLLLL") underperform proper hybrids ("LLSLLS") by $K= X W_K$ 1– $K= X W_K$ 2; adding more than 2–3 softmax layers yields diminishing returns (Li et al., 16 Jan 2026).
Dynamic agent-token selection yields higher accuracy than static, and window-size tuning further improves global context at negligible extra cost (Han et al., 2023).
MALA’s sharpening effect interpolates between linear and softmax, theoretically and empirically bounding accuracy gaps (Fan et al., 1 Jul 2025).

Current limitations:

Softmax layers still present $K= X W_K$ 3 cost: at extremely high $K= X W_K$ 4, further windowing or approximations may be needed (Li et al., 16 Jan 2026).
Placement schedules for softmax/linear are found via ablation; no principled or learnable adaptive approach yet.
Information decay in deep/long linear-only networks unless global context is periodically restored (Li et al., 16 Jan 2026).
Real hardware performance gains vary with MBConv, normalization, and path design (Shi et al., 2024).

7. Future Directions and Open Problems

Critical avenues for advancing SoLA-Vision include:

Automatically learning or dynamically adapting the placement of softmax vs. linear layers per task, resolution, or stage (Li et al., 16 Jan 2026).
Enriching linear attention kernels to further slow information decay without sacrificing linearity.
Extension to data types with even longer sequences (video, 3D, non-Euclidean data).
Theoretical understanding of why $K= X W_K$ 5 scaling and magnitude-aware corrections (e.g., MALA) appropriately mimic softmax in the vision regime (Fan et al., 1 Jul 2025, Wortsman et al., 2023).
Hardwaresoftware co-design, enabling efficient, fully quantized ViTs with hybrid attention for real-time/low-power vision.

SoLA-Vision thus defines a principled framework for reconciling the strengths and weaknesses of softmax and linear attention in vision transformer backbones. It yields efficient, scalable models for both cloud and edge deployment, and its design space remains a subject of active research across algorithm, architecture, and hardware domains.