A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance (2512.07647v1)
Abstract: We develop a unified mathematical framework for certified Top-$k$ attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution $P$ and its Top-$k$ truncation $\hat P$, we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $\mathrm{TV}(P,\hat P)=1-e{-\mathrm{KL}(\hat P\Vert P)}$, yielding sharp Top-$k$-specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds -- from a single boundary gap through multi-gap and blockwise variants -- that control $\mathrm{TV}(P,\hat P)$ using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as $|\mathrm{Attn}(q,K,V)-\mathrm{Attn}k(q,K,V)|_2=τ|μ{\mathrm{tail}}-μ{\mathrm{head}}|_2$ with $τ=\mathrm{TV}(P,\hat P)$, yielding a new head-tail diameter bound $|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)|_2\leτ\,\mathrm{diam}{H,T}$ and refinements linking the error to $\mathrm{Var}P(V)$. Under an i.i.d. Gaussian score model $s_i\sim\mathcal N(μ,σ2)$ we derive closed-form tail masses and an asymptotic rule for the minimal $k\varepsilon$ ensuring $\mathrm{TV}(P,\hat P)\le\varepsilon$, namely $k_\varepsilon/n\approxΦc(σ+Φ{-1}(\varepsilon))$. Experiments on bert-base-uncased and synthetic logits confirm the predicted scaling of $k\varepsilon/n$ and show that certified Top-$k$ can reduce scored keys by 2-4$\times$ on average while meeting the prescribed total-variation budget.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.