Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learned Linear Attention

Updated 26 February 2026
  • Learned linear attention is a mechanism that replaces fixed kernel mappings with neural parameterization for efficient O(n) computation while boosting expressivity.
  • Techniques such as LUNA, SALAD, and LOTFormer adapt kernel functions end-to-end to achieve significant speedups and high accuracy in long-context sequence tasks.
  • Empirical benchmarks demonstrate that learned linear attention models offer increased throughput and robustness across language, vision, and scientific computing applications with reduced memory use.

Learned linear attention refers to the class of attention mechanisms that retain linear time (and/or memory) complexity in sequence length by expressing attention as a kernel operation with a learned, rather than fixed, feature map or structure. Unlike early linear attention variants employing static random projections or hand-designed kernels, learned linear attention parameterizes some or all of the kernel, projection, or state update via neural weights, end-to-end optimized for the downstream task. This approach facilitates a balance between computational efficiency and expressive power, typically offering significant gains in accuracy and robustness compared to static linear attention approximations, while maintaining scalability to very long input sequences.

1. Mathematical Principles of Learned Linear Attention

Core linear attention methods replace the standard quadratic attention Attn(Q,K,V)=softmax(QK/d)V\operatorname{Attn}(Q,K,V) = \operatorname{softmax}(QK^\top/\sqrt{d})V with a kernel formulation

Attn(Q,K,V)=ϕ(Q)[ϕ(K)V]ϕ(Q)[ϕ(K)1n]\mathrm{Attn}(Q,K,V) = \frac{\phi(Q) [\phi(K)^\top V]}{\phi(Q) [\phi(K)^\top \mathbf{1}_n]}

where ϕ()\phi(\cdot) is a feature map. Early approaches used fixed random features, such as random Fourier mappings or elementwise functions (ELU+1, ReLU), but learned linear attention replaces ϕ\phi by a parameterized neural network, often an MLP, allowing the kernel to be adapted to the data and task (Shahbazi et al., 8 Dec 2025).

Notable variants:

  • Kernelized streaming: Compute key-value “sufficient statistics,” e.g., SKV=ϕ(K)VS_{KV} = \phi(K)^\top V, and then perform low-cost lookups for each query.
  • Recurrent state updating: Gated or decayed fast-weight updates, such as St=αtSt1+βtktktS_t = \alpha_t S_{t-1} + \beta_t k_t k_t^\top, with learned gating functions and/or per-dimension decay (Team et al., 30 Oct 2025, He et al., 23 Oct 2025).
  • OT-based or doubly-stochastic approaches: Learn intermediate transport plans or pivot measures (not strictly a kernel but can be viewed as a learned low-rank attention factorization) (Shahbazi et al., 27 Sep 2025).

For attention to be “learned linear,” the key is end-to-end training of the kernel ϕ\phi, the fast-weight update parameters, or the transport structure, such that the overall mechanism maintains O(n)O(n) or O(nr)O(nr) scaling.

2. Notable Architectures and Algorithms

2.1 LUNA (Linear Universal Neural Attention)

LUNA parameterizes the kernel ϕ(x)\phi(x) as a neural network with learnable projections and MLPs per channel,

ϕ(x)=h(x)m[ψ(wix)]i,\phi(x) = \frac{h(x)}{\sqrt m} [\psi_\ell(w_i^\top x)]_{i,\ell}

where wiw_i are learnable projections, ψ\psi_\ell are 1D MLPs, and h(x)h(x) is an optional envelope function, all trained end-to-end. This construction induces a positive-definite kernel and results in task-adaptive, streaming linear attention. LUNA achieves state-of-the-art accuracy among efficient Transformers under compute parity on Long Range Arena, and nearly perfect recovery in post-hoc softmax replacement for BERT and ViT (Shahbazi et al., 8 Dec 2025).

2.2 SALAD

SALAD introduces a learned linear branch in parallel to a sparse attention backbone. It applies a ReLU kernel ϕ(x)=max(x,0)\phi(x) = \max(x,0) and learns both the output projection and a gate G=σ(XWG+bG)G=\sigma(X W_G + b_G), where WG,bGW_G, b_G are adapted during fine-tuning. The branch is parameter-efficient via the use of LoRA adapters and is gated at the block level to combine with the sparse output. Fine-tuning on 2,000 videos and 1,600 steps is sufficient to reach \simfull attention quality with only O(ND2)O(ND^2) compute (Fang et al., 23 Jan 2026).

2.3 Kimi Linear (Kimi Delta Attention)

Kimi Delta Attention generalizes Gated DeltaNet with a learned per-dimension decay gate αt\bm\alpha_t and a scalar input-dependent βt\beta_t. Updates to the fast weight state use a delta rule with learned Householder corrections: St=(Iβtktkt)Diag(αt)St1+βtktvtS_t = (I - \beta_t k_t k_t^\top)\,\mathrm{Diag}(\alpha_t)\,S_{t-1} + \beta_t k_t v_t^\top Chunkwise algorithms implement efficient block-wise WY or UT transforms in linear space and time (Team et al., 30 Oct 2025).

2.4 RWKV-SCCTX and Bi-RWKV

RWKV-SCCTX applies bidirectional linear attention (WKV) modules and a channel-wise context model in image compression, using only O(nc)O(nc) compute (for nn tokens, cc channels). All projections, decay parameters, and value mappings are learned, with context mixing achieved through sequential residual connections (Feng et al., 9 Feb 2025).

2.5 Parallel and Hybrid Approaches

Hybrid models combine learned linear attention backbones with sparse mixers, learnable token eviction (LTE), or input-dependent gating. For example, laLTE in (He et al., 23 Oct 2025) combines constant-memory linear attention with sliding-window and learnable token retention modules, jointly trained for optimal retrieval and generalization.

2.6 LOTFormer

LOTFormer constrains the attention map to be low-rank and doubly-stochastic by learning a small set of pivot locations and masses, forming two entropic-OT (Sinkhorn) couplings. All pivots and intermediate costs are learned end-to-end. It achieves O(nr)O(nr) complexity with rnr \ll n, with state-of-the-art LRA results among linear and transport-based approaches (Shahbazi et al., 27 Sep 2025).

3. Efficiency and Expressivity Trade-offs

The central trade-off historically faced in linear attention was the loss of expressivity due to the use of fixed kernels. Learned linear attention mechanisms address this by increasing the effective rank and modeling capacity of the kernel via neural parameterization, resulting in approximation properties close to, or surpassing, quadratic softmax attention (Shahbazi et al., 8 Dec 2025, Yau et al., 2024).

Empirical evidence from LUNA, SALAD, LOTFormer, and Kimi Linear demonstrates that learned kernels or gates recover, and often surpass, dense attention quality on language, vision, and scientific computing tasks, with consistent reductions in memory and wall-clock compute time, particularly for long or high-dimensional sequences (Fang et al., 23 Jan 2026, Shahbazi et al., 8 Dec 2025, Team et al., 30 Oct 2025, Shahbazi et al., 27 Sep 2025).

4. Training Procedures and Adaptation

Learned linear attention architectures typically leverage:

  • Parameter-efficient tuning: LoRA adaptation for new kernels/projections, with core pretrained weights frozen (Fang et al., 23 Jan 2026).
  • End-to-end learning: All kernel, gate, and context-mixing parameters are trained using standard optimizers (AdamW, etc.), no auxiliary loss required.
  • Streaming or chunkwise algorithms: Enable training and inference on enormous context lengths or image resolutions.
  • Specialized initialization: Such as zero-initializing output projections to match the initial sparse/dense baseline, which improves early-stage stability (Fang et al., 23 Jan 2026).

Fine-tuning or hybridization enables rapid adaptation to new modalities or backbones with modest data and compute (Fang et al., 23 Jan 2026, Shahbazi et al., 8 Dec 2025).

5. Applications and Modalities

Learned linear attention mechanisms are prominent in:

6. Theoretical Properties and Generalization

Recent work provides the first strong agnostic PAC-learning results for linear attention models, showing that single-layer transformers with linear attention (parameterized as ϕ\phi in a finite RKHS) are efficiently learnable via reduction to linear empirical risk minimization in the feature space (Yau et al., 2024). The class is closed under associative memory, deterministic finite-state computation, and bounded-time universal Turing machines (given polynomial resources).

LUNA proves that a sufficiently wide MLP-parameterized feature map ϕ\phi can uniformly approximate any continuous PD kernel, and gives sampling error bounds controlling the kernel approximation. The class of functions realizable by linear attention with learned ϕ\phi has Rademacher complexity scaling as O~(1/n)\tilde O(1/\sqrt{n}) in the number of tokens, supporting generalization at scale (Shahbazi et al., 8 Dec 2025).

7. Empirical Benchmarks and Ablations

The following summarizes key experimental results (see cited papers):

Model Domain Key Metric(s) Improvement over Baseline Reference
LUNA LRA, BERT LRA acc. 65.44, BERT 99.5% +7 pts (LRA), +0.2% (BERT FT) (Shahbazi et al., 8 Dec 2025)
SALAD Video Gen. 1.72×\times speedup, full quality +0.66 SC, +0.46 IQ (Fang et al., 23 Jan 2026)
Kimi Linear LM, RL 6×6\times throughput, higher acc. +1.8 BBH, +1.6 MMLU (Team et al., 30 Oct 2025)
LOTFormer LRA 62.9 avg acc., lin. runtime +1.5 vs. Performer (Shahbazi et al., 27 Sep 2025)

Ablation studies in these works consistently demonstrate that learned gates, projections, or kernels recover much of the quality gap versus quadratic attention, outperform fixed-feature baselines (Performer), and exhibit robustness to varying context length, out-of-distribution samples, and hybridization with sparse attention (Shahbazi et al., 8 Dec 2025, Feng et al., 9 Feb 2025, Fang et al., 23 Jan 2026).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learned Linear Attention.