Papers
Topics
Authors
Recent
2000 character limit reached

BalanceKV: Streaming ε-Approximate Attention

Updated 6 January 2026
  • BalanceKV is a streaming algorithm that approximates attention in large language models by compressing the key–value cache using geometric vector balancing.
  • It leverages the SoftmaxBalance method to select token pairs in a single pass while ensuring rigorous relative-error guarantees on attention outputs.
  • The approach achieves sublinear memory usage and demonstrates superior empirical performance in long-context tasks on benchmarks like TriviaQA and LongBench-E.

BalanceKV is a streaming algorithm designed for ϵ\epsilon-approximate attention computations in LLMs, fundamentally addressing memory scalability for long-context token generation. By leveraging discrepancy theory and techniques from geometric vector balancing, BalanceKV constructs compressed representations of the key and value (KV) cache, supporting rigorous relative-error guarantees on attention output while maintaining space sublinear in sequence length. This section provides a comprehensive examination of the theoretical foundations, algorithmic structure, practical trade-offs, and empirical performance of BalanceKV (Han et al., 11 Feb 2025).

1. Streaming Attention and ϵ\epsilon-Approximation

The context accumulation in decoder-only Transformers presents a challenge for memory consumption, as the attention operation at step jj requires the full history of queries, keys, and values (q1,k1,v1),,(qj,kj,vj)(q_1,k_1,v_1),\dots,(q_j,k_j,v_j) with qi,ki,viRdq_i,k_i,v_i\in\mathbb R^d. Conventionally, the key and value matrices Kj,VjK_j,V_j are stacked to compute

Attn(qj,Kj,Vj):=1Zjexp(1dKjqj)Vj,\mathrm{Attn}(q_j,K_j,V_j) := \frac{1}{Z_j} \exp\Bigl(\tfrac{1}{\sqrt d}K_j q_j\Bigr)^\top V_j,

where

Zj=i=1jexp(ki,qj/d).Z_j = \sum_{i=1}^j \exp\bigl(\langle k_i, q_j\rangle/\sqrt d\bigr).

Maintaining the entire KV cache incurs O(jd)O(jd) memory cost, which scales linearly in the context length. The objective set forth by BalanceKV is to construct a compressed cache and produce, for each token jj, an estimator zjz_j such that

zjAttn(qj,Kj,Vj)2εsoftmax(1dKjqj)2VjF,\|z_j - \mathrm{Attn}(q_j,K_j,V_j)\|_2 \leq \varepsilon \|\mathrm{softmax}(\frac{1}{\sqrt d}K_j q_j)\|_2 \|V_j\|_F,

for prescribed precision ε>0\varepsilon > 0, meeting the criteria of an ε\varepsilon-approximation with high probability.

2. Geometric Token Selection via SoftmaxBalance

The central component of BalanceKV is the SoftmaxBalance algorithm, a single-pass, discrepancy-theoretic sampler. For a batch C={(k1,v1),,(kn,vn)}C=\{(k_1,v_1),\dots,(k_n,v_n)\}, SoftmaxBalance divides CC into subsets CC' and CCC\setminus C' such that, for arbitrary qRdq\in\mathbb R^d,

(k,v)Cexp(k,q/d)v(k,v)Cexp(k,q/d)v.\sum_{(k,v)\in C'} \exp(\langle k,q\rangle/\sqrt d)\, v \approx \sum_{(k,v)\notin C'} \exp(\langle k,q\rangle/\sqrt d)\, v.

This property allows one to use only CC' and scale up its contribution, effectively compressing the weighted sum required for attention. The algorithm ensures, with high probability,

(k,v)Cek,q/dv(k,v)Cek,q/dv2O(dlog(n/δ))exp(maxiki222d)exp(q222d)maxivi2.\Bigl\|\sum_{(k,v)\in C'} e^{\langle k,q\rangle/\sqrt d} v - \sum_{(k,v)\notin C'} e^{\langle k,q\rangle/\sqrt d} v\Bigr\|_2 \leq O(\sqrt{d} \log(n/\delta)) \exp\Bigl(\frac{\max_i \|k_i\|_2^2}{2\sqrt d}\Bigr) \exp\Bigl(\frac{\|q\|_2^2}{2\sqrt d}\Bigr) \max_i \|v_i\|_2.

Algorithmically, SoftmaxBalance implements a self-balancing walk, extending techniques of Bansal–Liu–Sawhney, over the exponential feature map φ(k)\varphi(k) defined such that φ(k),φ(q)=ek,q/d\langle\varphi(k), \varphi(q)\rangle = e^{\langle k, q\rangle/\sqrt d}. Computational efficiency is upheld: it requires only pairwise inner products, with time complexity O(n2(d+dimv))O(n^2(d+\dim v)) and space complexity O(n(d+dimv))O(n(d+\dim v)).

3. Theoretical Foundations: Banaszczyk’s Vector Balancing Framework

SoftmaxBalance’s correctness is rooted in a vector balancing theorem of Banaszczyk, with formalization and streaming extension by Alweiss–Liu–Sawhney (ALS21). Their result asserts that for vectors u1,,unRDu_1,\dots,u_n\in\mathbb R^D, ui2r\|u_i\|_2\leq r, there exists a randomized streaming sign assignment (±1)(\pm 1) such that, for all uRDu\in\mathbb R^D,

i:εi=+1ui,ui:εi=1ui,uO(ru2log(n/δ)).\biggl|\sum_{i:\varepsilon_i=+1} \langle u_i, u\rangle - \sum_{i:\varepsilon_i=-1} \langle u_i, u\rangle\biggr| \leq O(r \|u\|_2 \log(n/\delta)).

BalanceKV exploits this for ui=φ(ki)viu_i = \varphi(k_i) \otimes v_i, wherein the discrepancy bound translates directly to controlling error of compressed attention sums. The merge–reduce structure further amplifies the effect, enabling streaming operation over arbitrarily long token sequences.

4. Algorithmic Structure and Streaming Protocol

BalanceKV operates by segmenting the incoming stream into batches of size tt, running SoftmaxBalance on each to select t/2\lfloor t/2 \rfloor survivors. Multiple merge–reduce levels (T=log2(n/t)T = \log_2(n/t)) are used, recursively merging paired survivor sets and reapplying SoftmaxBalance, culminating in O(t)O(t) representatives. Denominator estimation (for softmax normalization) employs an identical procedure with all-ones values. The overall storage requirement is O(tlog(n/t))O(t \log(n/t)) pairs. The estimator for attention output at step jj is then

zj==0T2(k,v)Cexp(k,qj/d)v=0T2kKexp(k,qj/d).z_j = \frac{\displaystyle\sum_{\ell=0}^T 2^\ell \sum_{(k,v)\in C^\ell} \exp(\langle k, q_j\rangle/\sqrt d)\, v} {\displaystyle\sum_{\ell=0}^T 2^\ell \sum_{k\in K^\ell} \exp(\langle k, q_j\rangle/\sqrt d)}.

Full pseudocode specifies initialization of merge–reduce trees per value-norm bucket, streaming update, bucket assignment, compression, and estimator calculation. Importantly, buckets by vj\|v_j\| regulate the error bound by preventing outsized contributions from large-norm values.

5. Space–Time Complexity and Lower Bounds

BalanceKV delivers ϵ\epsilon-approximation under the conditions qj2,kj2r\|q_j\|_2, \|k_j\|_2 \leq r with batch size

t=O(exp(2r2/d)/ε),T=log2(n/t),t = O^*\Bigl(\exp(2r^2/\sqrt d)/\varepsilon\Bigr), \quad T = \log_2(n/t),

yielding overall space O(t)O^*(t) and time per step O(t2)O^*(t^2). The salient guarantee is sublinear memory in nn for fixed r,d,εr,d,\varepsilon. Exponential dependence on 1/ε1/\varepsilon arises, but may be practical for moderate values. No nontrivial streaming lower bound is established for attention approximation; standard 2\ell_2-sampling theory implies Ω(ε2)\Omega(\varepsilon^{-2}) tokens are needed for naive sampling, and reconstructing general length-nn sequences incurs Ω(n)\Omega(n) memory. The closing of the gap between exponential and polynomial bounds remains open.

6. Empirical Evaluation and Benchmarking

Experiments demonstrate BalanceKV’s superiority over existing token selection and pruning methods:

  • Single-layer attention approximation: On TriviaQA prompts, using LLaMA 2 and Mistral, at compression rates $1/2,1/4,1/8,1/16$ and layers 1, 2, 5, BalanceKV halved Frobenius-norm error compared to uniform sampling.
  • End-to-end long-context performance: On LongBench-E benchmarks (single-doc QA, multi-doc QA, summarization, few-shot, synthetic, code), using LLaMA 2 and $1/4$ compression, BalanceKV outperformed PyramidKV, SnapKV, StreamingLLM, and uniform sampling in average task-specific accuracy (44.99 vs. 44.82/44.57/44.03/42.51) and achieved notable gains in summarization (23.82 vs. 24.31 for uniform, 19–20 for others).

These results confirm that discrepancy-based, geometric selection exceeds the effectiveness of independent sampling and heuristic token pruning.

7. Practical Implementation and Considerations

BalanceKV is implemented inference-time only, requiring no retraining or architectural modification. In practice, KV-caches are partitioned into blocks (e.g., b=256b=256), with parallelized SoftmaxBalance application and survivor reclustering optimizing GPU utilization. Bucketing by vj\|v_j\| is central to controlling error propagation. Typical parameter selection involves tO(1/ε)t \approx O(1/\varepsilon) for moderate precision and tuning bb for device compatibility.

A plausible implication is that architectural simplicity and inference-time applicability make BalanceKV suitable for integration into existing LLM deployment pipelines, particularly those subject to memory constraints in extended-context settings.


BalanceKV establishes the first streaming, discrepancy-theoretic approach for sublinear-nn attention approximation, integrating geometric selection and vector balancing theory with strong theoretical guarantees and demonstrated empirical advantages in modern LLM workloads (Han et al., 11 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BalanceKV.