Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards Tight Bounds for Streaming Attention

Published 5 Jun 2026 in cs.DS and cs.LG | (2606.07205v1)

Abstract: The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture explicitly stores all previously seen input elements (tokens) in order to generate the next one. The problem of implementing a transformer in limited space, known as KV cache compression, has received much interest over the past few years, spurring the development of powerful heuristics. Recent works of Haris et al, COLT'25 and Kochetkova et al, NeurIPS'25, formalized KV cache compression as the streaming attention approximation problem, providing both upper bounds (based on discrepancy theory) and information theoretic lower bounds. However, those papers left open a significant gap between the upper and lower bounds. For example, the space usage of their algorithms increases with the precision parameter, but the lower bound does not get stronger. In this work, we revisit the streaming attention approximation problem and provide nearly tight bounds on its space complexity. On the algorithmic side, we achieve the result through a surprisingly tight interplay between three distinct methods for kernel density estimation: discrepancy-based coreset constructions (e.g., Charikar-Kapralov-Waingarten'24), the polynomial method (e.g., Greengard-Rokhlin'87, Alman-Song'23), and space partitioning (e.g., Andoni-Laarhoven-Razenshteyn-Waingarten'17, Charikar-Kapralov-Nouri-Siminelakis'20). On the lower bound side, our main technical contribution is a new technique for using the INDEX problem with a large amount of side information that we hope will prove useful in other high dimensional geometric estimation problems.

Summary

  • The paper presents a novel formulation of KV cache compression via online kernel density estimation, achieving nearly tight sublinear space complexity in terms of 1/ε in high temperature regimes.
  • It combines polynomial sketching with coreset constructions and leverages discrepancy theory to effectively bound memory usage within streaming attention mechanisms.
  • The study details both high and low temperature regimes, showing sublinear scaling in bounded contexts and unavoidable exponential scaling for large context radii, impacting practical transformer deployments.

Tight Bounds for Streaming Attention: A Technical Overview


Introduction and Problem Formulation

The paper addresses the memory complexity of the streaming attention mechanism, a foundational component in transformer architectures. In traditional transformer models, maintaining the full key-value (KV) cache incurs linear space in sequence length nn, leading to scaling bottlenecks in resource-constrained or large-scale inference settings. Contemporary approaches often utilize approximation or quantization but lack provable guarantees on their space-accuracy tradeoffs. The research formulates KV cache compression as an online kernel density estimation (KDE) problem with the exponential (softmax) kernel, focusing on streaming data structure space for arbitrary queries and precision ϵ\epsilon.

Mathematically, for keys ki∈Rdk_i \in \mathbb{R}^d, values vi∈Rdv_i \in \mathbb{R}^d, and queries q∈Rdq \in \mathbb{R}^d (with ∥ki∥2,∥q∥2≤r\|k_i\|_2, \|q\|_2 \leq r), attention is defined as:

Attn(K,q,V)=∑i=1ne⟨ki,q⟩vi∑i=1ne⟨ki,q⟩\mathrm{Attn}(K, q, V) = \frac{\sum_{i=1}^n e^{\langle k_i, q \rangle} v_i}{\sum_{i=1}^n e^{\langle k_i, q \rangle}}

Recent studies [Kochetkova et al., NeurIPS'25] provide nontrivial upper and lower bounds, yet a gap persists, especially in ϵ\epsilon dependence: upper bounds scale linearly with 1/ϵ1/\epsilon whereas lower bounds lack ϵ\epsilon dependence for moderate error.


Main Results: High and Low Temperature Regimes

High Temperature Regime: Small ϵ\epsilon0, Polylogarithmic ϵ\epsilon1

For bounded ϵ\epsilon2, the authors establish that space for streaming attention computation depends sublinearly on ϵ\epsilon3 in logarithmic and polylogarithmic dimensions. Concretely, for ϵ\epsilon4 (ϵ\epsilon5), the optimal space is

ϵ\epsilon6

with matching lower bound (up to ϵ\epsilon7 in the exponent) provided via a reduction from the INDEX problem in communication complexity. This bound is provably tighter than the prior ϵ\epsilon8 scaling and approaches nearly ϵ\epsilon9 as ki∈Rdk_i \in \mathbb{R}^d0.

The construction combines polynomial sketching (truncated Taylor expansions for the exponential kernel) for low-degree moments and coreset constructions for "tail" terms beyond the truncation. Discrepancy theory, specifically Banaszczyk's theorem, plays a major role in demonstrating the existence of small-size coresets for the tail terms. The tight lower bound leverages an optimal reduction that utilizes side information (low-order moments) and shows the ki∈Rdk_i \in \mathbb{R}^d1-scaling is inherent.

Low Temperature Regime: Large ki∈Rdk_i \in \mathbb{R}^d2, Moderate to Polylogarithmic ki∈Rdk_i \in \mathbb{R}^d3

As the context radius ki∈Rdk_i \in \mathbb{R}^d4 increases, the kernel can concentrate extremely on a few keys, making the problem similar to the maximum inner product search. For ki∈Rdk_i \in \mathbb{R}^d5, the paper proves the space complexity is

ki∈Rdk_i \in \mathbb{R}^d6

with matching lower bound (again up to ki∈Rdk_i \in \mathbb{R}^d7 factors), dominating the ki∈Rdk_i \in \mathbb{R}^d8-dependence as ki∈Rdk_i \in \mathbb{R}^d9 becomes superpolynomial. This regime recovers the information-theoretic barrier where combinatorial content outweighs approximation difficulty.

Key technical advances include a recursive partitioning (pseudo-randomification) of input keys, isolating tightly-clustered regions that allow improved compression via recentering and rescaling, and the use of weighted Merge-and-Reduce for coresets in the streaming paradigm.


Technical Contributions

  • Polynomial Method: Applying truncated Taylor and Hermite expansions as compressed sufficient statistics for the low-degree part of the kernel.
  • Asymmetric Coreset Constructions: Utilizing recent advances in vector balancing and discrepancy theory to obtain data-dependent subset selection for kernel tail approximation.
  • Side Information in Communication Complexity Reductions: Introducing an information-theoretically optimal way to transmit low-order statistics in INDEX lower bound reductions, which closes the upper/lower bound gap for vi∈Rdv_i \in \mathbb{R}^d0 dependence across broad parameter ranges.
  • Recursive Pseudorandomification: Partitioning the bulk of the data to exploit local geometric concentration, improving core set efficacy in the large-vi∈Rdv_i \in \mathbb{R}^d1 regime.

Numerical and Theoretical Strengths

  • The space-precision scaling vi∈Rdv_i \in \mathbb{R}^d2 for vi∈Rdv_i \in \mathbb{R}^d3 significantly improves over vi∈Rdv_i \in \mathbb{R}^d4 and is tight for vi∈Rdv_i \in \mathbb{R}^d5.
  • In the low temperature regime, the exponential dependence on vi∈Rdv_i \in \mathbb{R}^d6 is shown to be unavoidable.
  • The methods provide attention-approximation guarantees matching or improving all previous bounds for online, adversarially generated token streams.

The results show that for a practically relevant range of transformer dimensions and expected attention radii, streaming attention can be implemented with sublinear dependence on target accuracy in space.


Implications and Future Developments

Theoretical Implications

This work narrows the gap in our understanding of how statistical redundancy (as captured by low-degree moments) and adversarial query choice interact in memory-limited kernel density problems. The reduction methodology and the combination of discrepancy theory with kernel polynomial approximations may be applicable in lower-bounding other high-dimensional streaming geometric estimation problems.

Practical Implications

Efficient, low-space streaming attention has direct consequences for:

  • Resource-constrained transformer inference (edge/IoT, large batch server deployment);
  • Fast and memory-efficient context window extension for long-context LLMs;
  • Hardware design for attention accelerators supporting compressed-cache protocols.

Future Directions

  • Adaptive/Robust Streaming: Extending methods to cope with adaptively chosen token streams, where queries are functions of previous KV pairs, an open direction linked to adversarial streaming [ben2022framework].
  • General Kernels Beyond Softmax: While the results focus on softmax, extension to the Gaussian or other shift-invariant kernels would further impact KDE, especially for high-dimensional smooth function estimation.
  • Quantization and Sub-bit Precision Storage: Integrating variable quantization strategies consistent with the structure of coresets.
  • Learning Theoretic Reductions: Adapting these arguments for compressed neural network inference with theoretical bounds on approximation error versus compression.

Conclusion

The paper establishes nearly tight upper and lower bounds for the space complexity of streaming attention, sharply characterizing the vi∈Rdv_i \in \mathbb{R}^d7 and vi∈Rdv_i \in \mathbb{R}^d8 dependence and closing the substantial prior gap. Through an overview of algebraic, geometric, and information-theoretic tools, the findings both optimize and limit the memory requirements of attention-approximation data structures in large-scale transformers. The methodologies have significance for scalable model deployment and lay groundwork for further advancements in efficient neural architectures.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.