Papers
Topics
Authors
Recent
2000 character limit reached

Expected Attention for KV Cache Compression

Updated 2 October 2025
  • Expected attention is a principled, training-free method that analytically estimates the future contribution of each KV pair in transformer models.
  • It employs a Gaussian-based scoring algorithm to rank KV pairs, enabling aggressive memory reduction with minimal output quality loss.
  • KVPress, a complementary benchmarking library, supports rapid evaluation and deployment of over 20 cache-compression strategies across diverse inference tasks.

Expected attention is a principled, training-free approach for compressing the key–value (KV) cache in LLMs, enabling efficient inference by predicting and ranking the future usefulness of cached KV pairs. The method operates at inference time without requiring retraining, analytically estimating, for each KV pair, the expected contribution to subsequent model outputs. This estimation relies on the statistical distribution of future query activations and enables aggressive memory reduction with minimal effect on output quality, outperforming prior attention-score-based compression schemes. The approach is released alongside “KVPress,” a benchmarking library supporting rapid evaluation and deployment of over 20 cache-compression strategies (Devoto et al., 1 Oct 2025).

1. Motivation and Problem Setup

The self-attention mechanism in transformer-based LLMs requires storing all previous key–value pairs (the KV cache) to support attention-based context memory. As input sequences grow, the KV cache becomes the dominant source of memory consumption: KV storage scales linearly with sequence length and model width. Existing methods that prune the KV cache by measuring past attention scores are fundamentally limited because:

  • Attention scores from future queries are unknown at compression time—yet these determine a KV entry’s importance for downstream predictions.
  • Many contemporary implementations (e.g., Flash Attention) do not materialize or retain attention matrices, making traditional score-based access impractical during inference.

Expected attention addresses these issues by analytically estimating, for each cached KV pair, its expected importance given the distribution of all possible future queries, yielding a rankable, model-agnostic saliency criterion for pruning.

2. Analytical Derivation and Scoring Algorithm

Expected attention exploits the empirical observation that hidden states and queries in LLMs are well-approximated by Gaussian random variables. For a fixed key kik_i, and future query vectors qN(μq,Σq)q \sim \mathcal{N}(\overline{\mu}_q, \overline{\Sigma}_q), the expected (unnormalized) attention score is:

z~i=Eq[exp(qkid)]\tilde{z}_i = \mathbb{E}_{q} \left[\exp\left(\frac{q^\top k_i}{\sqrt{d}}\right)\right]

Applying the moment-generating function of the Gaussian, this is:

z~i=exp(μqkid+12dkiΣqki)\tilde{z}_i = \exp\left( \frac{\overline{\mu}_q^\top k_i}{\sqrt{d}} + \frac{1}{2d} k_i^\top \overline{\Sigma}_q k_i \right)

Given all keys {kj}\{k_j\}, the expected normalized attention is:

a^i=z~i/jz~j\hat{a}_i = \tilde{z}_i / \sum_j \tilde{z}_j

Finally, to quantify the expected contribution to the model output, the method multiplies this attention with the norm of the projected value, yielding:

Δh^i=(a^i+ϵ)WOvi\left\| \Delta \hat{h}_i \right\| = (\hat{a}_i + \epsilon) \cdot \left\| W_O v_i \right\|

where WOW_O is the output projection, viv_i is the cached value, and ϵ\epsilon is a small stabilizing constant.

A simplified PyTorch-style pseudocode provided in the paper demonstrates the practical computation:

1
2
3
z = torch.exp((mu_bar @ k.T) / sqrt_d + 0.5 * (k @ Sigma_bar @ k.T) / d)
a_hat = z / z.sum()
delta_h = (a_hat + eps) * torch.norm(W_O @ v, dim=-1)
KV pairs are then ranked and the lowest-scoring entries pruned.

3. Practical Implementation in Decoding and Prefilling

A crucial feature is compatibility with both decoding (autoregressive, streaming generation) and prefilling (batch or one-shot prompt completion):

  • In decoding, the method uses the most recent window of activations to estimate the upcoming query distribution.
  • In prefilling, analogous Gaussian statistics are obtained by averaging over RoPE-transformed hidden activations for immediate upcoming token positions.

Unlike dynamic attention-score schemes (requiring actual query-key evaluation at each time step), expected attention precomputes all required statistics from current cacheable activations and then applies a single analytic formula per key, introducing minimal computational burden.

4. Performance and Empirical Evaluation

Experimental results on diverse benchmarks (LongBench, Ruler, Needle in a Haystack for prefilling; Aime25, MATH-500 for decoding) demonstrate that expected attention:

  • Maintains output quality (low downstream error and minimal residual stream distortion) even at high compression ratios—often retaining only 50% or fewer KV pairs.
  • Outperforms state-of-the-art baselines, including TOVA, SnapKV, and KeyDiff, across multiple LLM architectures and task types.
  • Achieves lower reconstruction error and preserves critical information needed for tasks requiring long-context memory.

The following summarizes performance trade-offs (as reported in the paper):

Compression Method Memory Reduction Retained Accuracy Reconstruction Error (↓)
Expected Attention High High Low
TOVA Moderate High/Moderate Moderate
SnapKV Moderate Moderate Moderate

A plausible implication is that the analytic design, which does not rely on any future queries or training-specific artifacts, enables robust, general-purpose pruning across setting and dataset.

5. KVPress: Research and Benchmarking Framework

KVPress is presented as a research library to accelerate paper and adoption of KV cache compression algorithms. Key features include:

  • Plug-and-play integration with Hugging Face Transformers via forward hooks.
  • Modular, layer-wise compression routines—no alteration to underlying model weights or architecture required.
  • Inclusion of over 20 compression techniques (including expected attention), supporting apples-to-apples comparison under a unified interface.
  • Public benchmark leaderboard, facilitating standardized evaluation on established long-context tasks.

This enables rapid prototyping and large-scale benchmarking for both heuristic and trainable cache reduction methodologies.

6. Applications and Implications for Efficient LLM Inference

Expected attention directly addresses a major practical bottleneck in real-world LLM applications: memory footprint during inference. Its key benefits and applications include:

  • Scalable Long-Context Inference: Systems can process much longer sequences on fixed hardware budgets without architectural changes or degrading answer quality.
  • Streaming and Edge Deployment: Enables LLM inference on resource-limited devices by keeping memory bounded even as generated sequence length grows.
  • No Retraining Requirement: Because expected attention is training-free and architecture-agnostic, it can be applied to existing or deployed model weights.
  • Layer-wise Flexibility: The method supports separate or joint pruning across attention heads and layers, affording fine-grained memory–quality trade-offs.

In summary, expected attention delivers a mathematically grounded, analytically tractable, and empirically validated approach for KV cache compression, advancing the efficiency of LLM inference scenarios (Devoto et al., 1 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Expected Attention.