Papers
Topics
Authors
Recent
Search
2000 character limit reached

Memory-Efficient Attention

Updated 18 March 2026
  • Memory-efficient attention is a family of techniques that reduce the quadratic memory overhead in Transformer models by approximating or sparsifying token interactions.
  • It includes methods like dimensionality reduction, sophisticated KV-cache compression, and adaptive retention that balance compute efficiency with minimal quality loss.
  • Techniques such as AQUA, SiftAttention, and CMAB demonstrate significant reductions in memory usage while maintaining near state-of-the-art performance on language and vision tasks.

Memory-efficient attention encompasses a broad family of architectural, algorithmic, and systems-level innovations designed to mitigate the prohibitive memory and compute footprint of modern attention mechanisms, especially in large-scale Transformer models. As context lengths and model sizes increase, quadratic scaling in memory (and compute) becomes the principal bottleneck for both training and inference. This entry surveys the fundamental ideas, algorithmic techniques, complexity analyses, and empirical results from recent research on memory-efficient attention, emphasizing both theoretical foundations and practical deployments across language and vision domains.

1. The Quadratic Bottleneck in Standard Attention

In the standard ("full" or "1") attention mechanism, each token attends to all others: for input length NN and feature size dd, softmax attention computes an N×NN \times N affinity matrix, resulting in O(N2d)O(N^2 d) time and space complexity. This scaling creates severe practical challenges:

  • KV-cache memory: For autoregressive inference or training with long contexts, the size of the cache storing keys and values becomes dominant.
  • Activation memory: For large batch sizes or long input sequences (in language or dense high-resolution images), the explicit construction of attention maps or intermediate matrices often exceeds hardware budgets.
  • Bandwidth/memory-bound kernels: Even with efficient matmul implementations, data movement between DRAM (or HBM), cache, and ALU is often the principal limiting factor rather than peak floating-point throughput.

Reducing this quadratic footprint while minimizing quality degradation is the central concern of memory-efficient attention research.

2. Dimensionality Reduction and Approximate Attention

A primary class of methods targets the reduction of per-head dimensions or the selective computation of attention scores.

AQUA: Projection-Magnitude Pruning

AQUA (S et al., 14 Sep 2025) introduces a two-phase approach:

  • Offline Calibration: An SVD is computed on a calibration set of queries and keys to obtain a universal, orthogonal projection matrix PP for each attention layer.
  • Online Inference: Queries and keys are projected to this rotated basis, and at each step, only the top-kk dimensions (by query magnitude) are used in the attention dot-product.

Let qi,kiRdq_i, k_i \in \mathbb{R}^d; after projection, q^i=qiP\hat{q}_i = q_i P, and the largest kk components (by magnitude) are selected. This reduces compute and memory linearly in k/dk/d while preserving most performance (<1% accuracy drop with 25% sparsity on Llama-3.1-8B).

AQUA also provides:

  • A formal break-even point for compute efficiency: for d=128,k=96d=128,k=96, after $512$ tokens, the O(d2)O(d^2) overhead of the projection is amortized.
  • KV-cache compression: static or dynamic slicing in the projected space yields proportional memory reduction.
  • Synergy with token eviction: cheap approximate scores suffice to select which tokens to retain, enabling further reductions in both computation and KV-memory.

SiftAttention: Power-law Dynamic Sifting

SiftAttention (Koley et al., 5 Jun 2025) replaces expensive GPU top-kk selection with an element-wise thresholding mechanism. Empirically, the τ\tau-quantile of attention scores decays as a power law with context position ii: θi,ταiβ\theta_{i,\tau} \approx \alpha i^{-\beta}. After a short warmup, a log-log regression yields per-step thresholds. Only key-value pairs exceeding this threshold are loaded—directly reducing memory bandwidth demand and enabling fast, high-sparsity attention. SiftAttention shows up to 30–40% bandwidth savings and up to 10% speedup with negligible quality loss for moderate sparsity levels.

LOOKAT: Product Quantization for KV Compression

LOOKAT (Karmore, 15 Jan 2026) applies vector database techniques (product quantization and asymmetric distance computation) to the KV-cache. By representing each key as mm-dimensional subspace indices (with KK centroids each), the memory required per key drops from dkd_k FP16 values to mm bytes. Inner products are computed using lookup tables per query, avoiding dequantization. LOOKAT achieves up to 64×64\times KV-cache compression, retaining >0.95>0.95 Spearman rank correlation (ρ\rho) and maintaining output fidelity above 95%95\% on GPT-2 inference benchmarks, with no retraining.

3. Sparse, Pruned, and Adaptive Retention

Sparse attention variants restrict the set of key-value pairs each query can attend to.

Top-kk Attention

Row-wise top-kk (Gupta et al., 2021) retains only the largest kk affinities per query, discarding the rest before softmax. A chunked implementation enables O(Lk)O(L k) memory complexity for sequence length LL. The method is a drop-in replacement for vanilla attention without retraining, preserves model accuracy even at k/L15%k/L \sim 1 - 5\%, and dramatically increases sequence lengths amenable to modern hardware.

Adaptive Retention

Adaptive probabilistic retention (Rafiuddin et al., 9 Oct 2025) learns per-token binary gates (parameterized as Bernoulli variables via hard-concrete relaxation), trained under a global retention budget MM. At inference time, exactly MM tokens with largest gate probabilities are kept. Integrating this selection layer-wise enables up to 1.8×1.8\times throughput speedup and 3545%35-45\% memory reduction for classification and summarization tasks, with 95%\geq 95\% of full-model quality.

Sinkhorn, Sort, and Pyramid-based Sparsification

Sparse Sinkhorn Attention (Tay et al., 2020) sorts grouped blocks of tokens via a differentiable (Sinkhorn) permutation, followed by local-window attention in the permuted space—effectively yielding a quasi-global receptive field, but at only O(Lb)O(L b) or O(L1.5)O(L^{1.5}) memory. SortCut truncates to a fixed top-kk number of blocks. On document classification and language modeling, as little as $1/3$ the full memory suffices with no significant drop in accuracy.

Hierarchical methods, such as SPAN (Wu et al., 2024), build a multi-scale pyramid with sparse windowed attention and global tokens, with alternating shifted windows for cross-window interaction. Sequence length is rapidly reduced at each level. This approach enables whole-slide image analysis (hundreds of thousands of tokens) at conventional GPU memory footprints (<4<4 GB vs. >15>15 GB for dense attention), with improved or preserved accuracy.

4. Explicit Architectural and Kernel Modifications

Constant Memory Attention Block (CMAB)

CMAB (Feng et al., 2023, Feng et al., 2023) and derived architectures (e.g., CMANP) use a small, fixed set of learned latents ("block" and "input" embeddings) to perform cross-attention against (unbounded) input sets in a streaming fashion, updating only rolling log-normalizers and context summaries. All subsequent attention is restricted to these latents, which decouples time and memory complexity from input size NN (true O(1)O(1) memory in context size). Empirically, CMANP matches or slightly outperforms comparable state-of-the-art attentive neural process baselines while requiring only constant GPU memory.

Linearized and Kernelized Attention

Methods such as Linear Attention via Orthogonal Memory (Zhang et al., 2023) project the context to a fixed number of orthogonal basis vectors. Global context is compressed into knk \ll n slots, attended using linear complexity per token. Local attention (windowed) and embedded positional encoding supplement fine structure. This combination yields robust scaling (up to $128$K tokens, with per-position cost O(kd)O(k d)) and strong extrapolation in long-context language modeling.

Windowed, Grouped, and Structured Attention in Vision

HRMedSeg (Xu et al., 8 Apr 2025) replaces the O(N2)O(N^2) quadratic self-attention with a dual-gated linear attention block (embedding nonlinearity and gating on both keys/queries and values) for linear complexity. EfficientViT (Liu et al., 2023) reorders the transformer block as FFN–MHSA–FFN ("sandwich"), using only one memory-bound attention per block, and cascaded group attention (splitting channel groups among heads) to reduce redundancy. AgileIR (Cai et al., 2024) replaces full-windowed MSA with "group shifted window attention" (GSWA), sequentially processing grouped attention heads and reducing peak memory by >50%>50\% with minimal PSNR degradation in super-resolution tasks.

5. Low-Level and Systems Techniques

Gradient and Projection Compression (PAMM)

Point-Approximate Matrix Multiplication (Khalaf et al., 3 Jun 2025) compresses Q/K/V projection activations in attention layers by approximating the activation matrix with a small number of generators and scalar weights per row. Backpropagated gradients are then computed using only O(rbd)O(r b d) memory for r1r \ll 1, achieving up to 512×512\times activation memory reduction for Q/K/V projections, with <0.2%<0.2\% loss in accuracy. PAMM composes with kernel-level optimizations (e.g., FlashAttention), almost erasing the entire attention-block activation footprint.

Kernel and Distributed Implementations

DistFlashAttn (Li et al., 2023) extends FlashAttention to distributed training for long contexts by partitioning the sequence across PP GPUs, overlapping communication and computation, and checkpointing outputs immediately after blockwise attention to avoid redundant recomputation. This achieves near-linear scaling in sequence length and $1.26$–1.88×1.88\times speedup over prior distributed attention schemes, supporting contexts up to $512$K tokens.

Streaming and Sublinear Memory Sketching

One-pass streaming (Addanki et al., 2023) achieves attention approximation in sublinear memory via polynomial-based kernel expansion and randomized sketching (Johnson–Lindenstrauss and 2/2\ell_2/\ell_2 sparse-recovery). It requires only o(n)o(n) memory for the attention computation over n105n \gg 10^5 tokens, with provably vanishing error as nn \to \infty, and outputs a kk-sparse representation per token suitable for downstream layers. Uniquely, full context never needs to be stored in memory at once.

6. Trade-offs, Benchmarks, and Practical Implications

A recurring observation is the inherent trade-off between memory/compute reduction and accuracy. Many techniques provide tunable "knobs"—e.g., AQUA's kk-ratio, SiftAttention/Top-kk's sparsity threshold, hierarchical window size in visual transformers, or the number of latents for CMAB—allowing practitioners to strike their own balance. Several approaches, especially those targeting inference efficiency (e.g., EL-attention (Yan et al., 2021), LOOKAT), are explicitly "lossless," i.e., can achieve full vanilla-attention accuracy with a significant reduction in active memory, bandwidth usage, or cache size.

Key empirical benchmarks include:

  • Llama-3.1-8B and GPT-2: AQUA and LOOKAT deliver 25–64×\times reductions in compute/memory with <<1–2% drop in sophisticated benchmarks.
  • Classification, QA, summarization: Adaptive Retention cuts memory 35–45% (full Transformer 12 GB to 7.5 GB) for 2%\leq 2\% accuracy drop.
  • Image SR and medical segmentation: GSWA (AgileIR) and DGLA (HRMedSeg) demonstrate >50%–90% memory saving with 1%\leq 1\% PSNR or Dice drop.

Notably, compositionality with other efficient attention and quantization techniques is a consistent theme—most methods are compatible with (and often orthogonal to) activation checkpointing, blockwise processing, kernel/low-rank approximations, and hardware-specific optimizations.

7. Outlook and Open Challenges

Memory-efficient attention continues to be an area of intensive research and rapid evolution. Ongoing directions include:

  • Universal, model-agnostic approximate attention that works as a drop-in for pre-trained LLMs without retraining or significant accuracy loss.
  • Dynamic, data-conditional sparsification and retention that adapts the attention pattern for each context and workload.
  • Hybrid memory systems combining local, global, and compressed representations (cf. AllMem (Wang et al., 14 Feb 2026)) to balance long-range modeling with resource budgets.
  • Preservation of attention invariances (e.g., order, permutation) and contextual completeness under aggressive compression.
  • Extension to emerging modalities and tasks: video, scientific data, point clouds, multiscale biological imagery.

As models and workloads scale, the capacity for intelligent memory management, approximation, and efficient kernel execution will likely be decisive for the next generation of large language and vision models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory-Efficient Attention.