- The paper presents a novel formulation of KV cache compression via online kernel density estimation, achieving nearly tight sublinear space complexity in terms of 1/ε in high temperature regimes.
- It combines polynomial sketching with coreset constructions and leverages discrepancy theory to effectively bound memory usage within streaming attention mechanisms.
- The study details both high and low temperature regimes, showing sublinear scaling in bounded contexts and unavoidable exponential scaling for large context radii, impacting practical transformer deployments.
Tight Bounds for Streaming Attention: A Technical Overview
The paper addresses the memory complexity of the streaming attention mechanism, a foundational component in transformer architectures. In traditional transformer models, maintaining the full key-value (KV) cache incurs linear space in sequence length n, leading to scaling bottlenecks in resource-constrained or large-scale inference settings. Contemporary approaches often utilize approximation or quantization but lack provable guarantees on their space-accuracy tradeoffs. The research formulates KV cache compression as an online kernel density estimation (KDE) problem with the exponential (softmax) kernel, focusing on streaming data structure space for arbitrary queries and precision ϵ.
Mathematically, for keys ki​∈Rd, values vi​∈Rd, and queries q∈Rd (with ∥ki​∥2​,∥q∥2​≤r), attention is defined as:
Attn(K,q,V)=∑i=1n​e⟨ki​,q⟩∑i=1n​e⟨ki​,q⟩vi​​
Recent studies [Kochetkova et al., NeurIPS'25] provide nontrivial upper and lower bounds, yet a gap persists, especially in ϵ dependence: upper bounds scale linearly with 1/ϵ whereas lower bounds lack ϵ dependence for moderate error.
Main Results: High and Low Temperature Regimes
High Temperature Regime: Small ϵ0, Polylogarithmic ϵ1
For bounded ϵ2, the authors establish that space for streaming attention computation depends sublinearly on ϵ3 in logarithmic and polylogarithmic dimensions. Concretely, for ϵ4 (ϵ5), the optimal space is
ϵ6
with matching lower bound (up to ϵ7 in the exponent) provided via a reduction from the INDEX problem in communication complexity. This bound is provably tighter than the prior ϵ8 scaling and approaches nearly ϵ9 as ki​∈Rd0.
The construction combines polynomial sketching (truncated Taylor expansions for the exponential kernel) for low-degree moments and coreset constructions for "tail" terms beyond the truncation. Discrepancy theory, specifically Banaszczyk's theorem, plays a major role in demonstrating the existence of small-size coresets for the tail terms. The tight lower bound leverages an optimal reduction that utilizes side information (low-order moments) and shows the ki​∈Rd1-scaling is inherent.
Low Temperature Regime: Large ki​∈Rd2, Moderate to Polylogarithmic ki​∈Rd3
As the context radius ki​∈Rd4 increases, the kernel can concentrate extremely on a few keys, making the problem similar to the maximum inner product search. For ki​∈Rd5, the paper proves the space complexity is
ki​∈Rd6
with matching lower bound (again up to ki​∈Rd7 factors), dominating the ki​∈Rd8-dependence as ki​∈Rd9 becomes superpolynomial. This regime recovers the information-theoretic barrier where combinatorial content outweighs approximation difficulty.
Key technical advances include a recursive partitioning (pseudo-randomification) of input keys, isolating tightly-clustered regions that allow improved compression via recentering and rescaling, and the use of weighted Merge-and-Reduce for coresets in the streaming paradigm.
Technical Contributions
- Polynomial Method: Applying truncated Taylor and Hermite expansions as compressed sufficient statistics for the low-degree part of the kernel.
- Asymmetric Coreset Constructions: Utilizing recent advances in vector balancing and discrepancy theory to obtain data-dependent subset selection for kernel tail approximation.
- Side Information in Communication Complexity Reductions: Introducing an information-theoretically optimal way to transmit low-order statistics in INDEX lower bound reductions, which closes the upper/lower bound gap for vi​∈Rd0 dependence across broad parameter ranges.
- Recursive Pseudorandomification: Partitioning the bulk of the data to exploit local geometric concentration, improving core set efficacy in the large-vi​∈Rd1 regime.
Numerical and Theoretical Strengths
- The space-precision scaling vi​∈Rd2 for vi​∈Rd3 significantly improves over vi​∈Rd4 and is tight for vi​∈Rd5.
- In the low temperature regime, the exponential dependence on vi​∈Rd6 is shown to be unavoidable.
- The methods provide attention-approximation guarantees matching or improving all previous bounds for online, adversarially generated token streams.
The results show that for a practically relevant range of transformer dimensions and expected attention radii, streaming attention can be implemented with sublinear dependence on target accuracy in space.
Implications and Future Developments
Theoretical Implications
This work narrows the gap in our understanding of how statistical redundancy (as captured by low-degree moments) and adversarial query choice interact in memory-limited kernel density problems. The reduction methodology and the combination of discrepancy theory with kernel polynomial approximations may be applicable in lower-bounding other high-dimensional streaming geometric estimation problems.
Practical Implications
Efficient, low-space streaming attention has direct consequences for:
- Resource-constrained transformer inference (edge/IoT, large batch server deployment);
- Fast and memory-efficient context window extension for long-context LLMs;
- Hardware design for attention accelerators supporting compressed-cache protocols.
Future Directions
- Adaptive/Robust Streaming: Extending methods to cope with adaptively chosen token streams, where queries are functions of previous KV pairs, an open direction linked to adversarial streaming [ben2022framework].
- General Kernels Beyond Softmax: While the results focus on softmax, extension to the Gaussian or other shift-invariant kernels would further impact KDE, especially for high-dimensional smooth function estimation.
- Quantization and Sub-bit Precision Storage: Integrating variable quantization strategies consistent with the structure of coresets.
- Learning Theoretic Reductions: Adapting these arguments for compressed neural network inference with theoretical bounds on approximation error versus compression.
Conclusion
The paper establishes nearly tight upper and lower bounds for the space complexity of streaming attention, sharply characterizing the vi​∈Rd7 and vi​∈Rd8 dependence and closing the substantial prior gap. Through an overview of algebraic, geometric, and information-theoretic tools, the findings both optimize and limit the memory requirements of attention-approximation data structures in large-scale transformers. The methodologies have significance for scalable model deployment and lay groundwork for further advancements in efficient neural architectures.