Towards Tight Bounds for Streaming Attention

Published 5 Jun 2026 in cs.DS and cs.LG | (2606.07205v1)

Abstract: The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture explicitly stores all previously seen input elements (tokens) in order to generate the next one. The problem of implementing a transformer in limited space, known as KV cache compression, has received much interest over the past few years, spurring the development of powerful heuristics. Recent works of Haris et al, COLT'25 and Kochetkova et al, NeurIPS'25, formalized KV cache compression as the streaming attention approximation problem, providing both upper bounds (based on discrepancy theory) and information theoretic lower bounds. However, those papers left open a significant gap between the upper and lower bounds. For example, the space usage of their algorithms increases with the precision parameter, but the lower bound does not get stronger. In this work, we revisit the streaming attention approximation problem and provide nearly tight bounds on its space complexity. On the algorithmic side, we achieve the result through a surprisingly tight interplay between three distinct methods for kernel density estimation: discrepancy-based coreset constructions (e.g., Charikar-Kapralov-Waingarten'24), the polynomial method (e.g., Greengard-Rokhlin'87, Alman-Song'23), and space partitioning (e.g., Andoni-Laarhoven-Razenshteyn-Waingarten'17, Charikar-Kapralov-Nouri-Siminelakis'20). On the lower bound side, our main technical contribution is a new technique for using the INDEX problem with a large amount of side information that we hope will prove useful in other high dimensional geometric estimation problems.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a novel formulation of KV cache compression via online kernel density estimation, achieving nearly tight sublinear space complexity in terms of 1/ε in high temperature regimes.
It combines polynomial sketching with coreset constructions and leverages discrepancy theory to effectively bound memory usage within streaming attention mechanisms.
The study details both high and low temperature regimes, showing sublinear scaling in bounded contexts and unavoidable exponential scaling for large context radii, impacting practical transformer deployments.

Tight Bounds for Streaming Attention: A Technical Overview

Introduction and Problem Formulation

The paper addresses the memory complexity of the streaming attention mechanism, a foundational component in transformer architectures. In traditional transformer models, maintaining the full key-value (KV) cache incurs linear space in sequence length $n$ , leading to scaling bottlenecks in resource-constrained or large-scale inference settings. Contemporary approaches often utilize approximation or quantization but lack provable guarantees on their space-accuracy tradeoffs. The research formulates KV cache compression as an online kernel density estimation (KDE) problem with the exponential (softmax) kernel, focusing on streaming data structure space for arbitrary queries and precision $\epsilon$ .

Mathematically, for keys $k_i \in \mathbb{R}^d$ , values $v_i \in \mathbb{R}^d$ , and queries $q \in \mathbb{R}^d$ (with $\|k_i\|_2, \|q\|_2 \leq r$ ), attention is defined as:

$\mathrm{Attn}(K, q, V) = \frac{\sum_{i=1}^n e^{\langle k_i, q \rangle} v_i}{\sum_{i=1}^n e^{\langle k_i, q \rangle}}$

Recent studies [Kochetkova et al., NeurIPS'25] provide nontrivial upper and lower bounds, yet a gap persists, especially in $\epsilon$ dependence: upper bounds scale linearly with $1/\epsilon$ whereas lower bounds lack $\epsilon$ dependence for moderate error.

Main Results: High and Low Temperature Regimes

High Temperature Regime: Small $\epsilon$ 0, Polylogarithmic $\epsilon$ 1

For bounded $\epsilon$ 2, the authors establish that space for streaming attention computation depends sublinearly on $\epsilon$ 3 in logarithmic and polylogarithmic dimensions. Concretely, for $\epsilon$ 4 ( $\epsilon$ 5), the optimal space is

$\epsilon$ 6

with matching lower bound (up to $\epsilon$ 7 in the exponent) provided via a reduction from the INDEX problem in communication complexity. This bound is provably tighter than the prior $\epsilon$ 8 scaling and approaches nearly $\epsilon$ 9 as $k_i \in \mathbb{R}^d$ 0.

The construction combines polynomial sketching (truncated Taylor expansions for the exponential kernel) for low-degree moments and coreset constructions for "tail" terms beyond the truncation. Discrepancy theory, specifically Banaszczyk's theorem, plays a major role in demonstrating the existence of small-size coresets for the tail terms. The tight lower bound leverages an optimal reduction that utilizes side information (low-order moments) and shows the $k_i \in \mathbb{R}^d$ 1-scaling is inherent.

Low Temperature Regime: Large $k_i \in \mathbb{R}^d$ 2, Moderate to Polylogarithmic $k_i \in \mathbb{R}^d$ 3

As the context radius $k_i \in \mathbb{R}^d$ 4 increases, the kernel can concentrate extremely on a few keys, making the problem similar to the maximum inner product search. For $k_i \in \mathbb{R}^d$ 5, the paper proves the space complexity is

$k_i \in \mathbb{R}^d$ 6

with matching lower bound (again up to $k_i \in \mathbb{R}^d$ 7 factors), dominating the $k_i \in \mathbb{R}^d$ 8-dependence as $k_i \in \mathbb{R}^d$ 9 becomes superpolynomial. This regime recovers the information-theoretic barrier where combinatorial content outweighs approximation difficulty.

Key technical advances include a recursive partitioning (pseudo-randomification) of input keys, isolating tightly-clustered regions that allow improved compression via recentering and rescaling, and the use of weighted Merge-and-Reduce for coresets in the streaming paradigm.

Technical Contributions

Polynomial Method: Applying truncated Taylor and Hermite expansions as compressed sufficient statistics for the low-degree part of the kernel.
Asymmetric Coreset Constructions: Utilizing recent advances in vector balancing and discrepancy theory to obtain data-dependent subset selection for kernel tail approximation.
Side Information in Communication Complexity Reductions: Introducing an information-theoretically optimal way to transmit low-order statistics in INDEX lower bound reductions, which closes the upper/lower bound gap for $v_i \in \mathbb{R}^d$ 0 dependence across broad parameter ranges.
Recursive Pseudorandomification: Partitioning the bulk of the data to exploit local geometric concentration, improving core set efficacy in the large- $v_i \in \mathbb{R}^d$ 1 regime.

Numerical and Theoretical Strengths

The space-precision scaling $v_i \in \mathbb{R}^d$ 2 for $v_i \in \mathbb{R}^d$ 3 significantly improves over $v_i \in \mathbb{R}^d$ 4 and is tight for $v_i \in \mathbb{R}^d$ 5.
In the low temperature regime, the exponential dependence on $v_i \in \mathbb{R}^d$ 6 is shown to be unavoidable.
The methods provide attention-approximation guarantees matching or improving all previous bounds for online, adversarially generated token streams.

The results show that for a practically relevant range of transformer dimensions and expected attention radii, streaming attention can be implemented with sublinear dependence on target accuracy in space.

Implications and Future Developments

Theoretical Implications

This work narrows the gap in our understanding of how statistical redundancy (as captured by low-degree moments) and adversarial query choice interact in memory-limited kernel density problems. The reduction methodology and the combination of discrepancy theory with kernel polynomial approximations may be applicable in lower-bounding other high-dimensional streaming geometric estimation problems.

Practical Implications

Efficient, low-space streaming attention has direct consequences for:

Resource-constrained transformer inference (edge/IoT, large batch server deployment);
Fast and memory-efficient context window extension for long-context LLMs;
Hardware design for attention accelerators supporting compressed-cache protocols.

Future Directions

Adaptive/Robust Streaming: Extending methods to cope with adaptively chosen token streams, where queries are functions of previous KV pairs, an open direction linked to adversarial streaming [ben2022framework].
General Kernels Beyond Softmax: While the results focus on softmax, extension to the Gaussian or other shift-invariant kernels would further impact KDE, especially for high-dimensional smooth function estimation.
Quantization and Sub-bit Precision Storage: Integrating variable quantization strategies consistent with the structure of coresets.
Learning Theoretic Reductions: Adapting these arguments for compressed neural network inference with theoretical bounds on approximation error versus compression.

Conclusion

The paper establishes nearly tight upper and lower bounds for the space complexity of streaming attention, sharply characterizing the $v_i \in \mathbb{R}^d$ 7 and $v_i \in \mathbb{R}^d$ 8 dependence and closing the substantial prior gap. Through an overview of algebraic, geometric, and information-theoretic tools, the findings both optimize and limit the memory requirements of attention-approximation data structures in large-scale transformers. The methodologies have significance for scalable model deployment and lay groundwork for further advancements in efficient neural architectures.