Papers
Topics
Authors
Recent
Search
2000 character limit reached

SiftAttention: Efficient Transformer Attention

Updated 6 January 2026
  • The paper introduces SiftAttention, which replaces expensive top-k selection with a dynamic, threshold-based filtering method to reduce memory bandwidth usage.
  • The method uses a brief warmup phase to fit a power-law model on attention score quantiles, enabling accurate threshold estimation during inference.
  • Empirical results show SiftAttention achieves 10–15% kernel latency reduction and 31% HBM read savings while maintaining near-full attention accuracy.

SiftAttention is an approximate attention mechanism for transformer-based sequence models that improves the efficiency of LLM inference, particularly on modern GPUs where memory bandwidth between High Bandwidth Memory (HBM) and SRAM is a bottleneck. SiftAttention replaces the computationally expensive top-kk selection commonly used in sparse/approximate attention with a differentiable, data-parallel, element-wise thresholding operation. The threshold for pruning is dynamically determined at each generation step by fitting the empirical τ\tau-th quantile of attention scores to a power-law over the course of an initial warmup phase. This methodology yields substantial reductions in memory bandwidth and kernel latency on real hardware, while preserving model performance at high sparsity levels (Koley et al., 5 Jun 2025).

1. Core Mechanism and Departure from Top-kk Attention

Traditional top-kk approximate attention in LLMs operates by first computing the full softmax attention vector and then selecting the indices corresponding to the kk largest entries, using either a sorting or selection algorithm of O(SlogS)O(S \log S) (or O(S)O(S) in best cases), followed by the gathering of value vectors at those locations. This step incurs significant synchronization and computational overhead, resulting in inefficient kernel execution on GPUs.

SiftAttention eliminates the need for this expensive top-kk step by substituting it with a threshold-based filter:

  • During a short warmup phase (ww steps), SiftAttention records the τ\tau-th quantile (θt,τ\theta_{t,\tau}) of the attention score vector at each token position tt.
  • It then fits a 2-parameter power-law model, θ^t,ταtβ\hat\theta_{t,\tau} \approx \alpha t^{-\beta}, to the quantile trajectory via least-squares regression in log-log space.
  • In subsequent generation steps (t>wt > w), SiftAttention computes a threshold ηt=αtβ\eta_t = \alpha t^{-\beta} and zeros out all attention weights below ηt\eta_t.
  • This per-element thresholding is fully parallelizable and highly efficient on GPU architectures, compared to the global sort or selection required by top-kk.

The result is a mechanism that maintains attention output quality while reducing computational and bandwidth overhead.

2. Power-law Structure in Attention Score Quantiles

A central empirical finding substantiating SiftAttention is that the τ\tau-th quantile θt,τ\theta_{t,\tau} of the attention weights decays with generation step tt according to a power law. In symbols,

θt,τ=Q(τ,t)αtβ    logθt,τlogαβlogt\theta_{t,\tau} = Q(\tau, t) \approx \alpha t^{-\beta} \implies \log \theta_{t,\tau} \approx \log \alpha - \beta \log t

Here,

  • tt: current generation step or length of key-value cache
  • τ(0,1)\tau \in (0,1): quantile level (e.g., τ=0.75\tau=0.75)
  • θt,τ\theta_{t,\tau}: τ\tau-th quantile of attention scores at step tt
  • α,β\alpha, \beta: empirically fitted strictly positive constants

This structure was validated across 5 transformer LLMs and two datasets, with the fit evaluated by R2R^2 statistics showing median values in [0.6,0.8][0.6, 0.8] and 5th–95th percentiles [0.4,0.9][0.4, 0.9]. This predictable decay allows the threshold to be robustly estimated after a short warmup.

3. Threshold Estimation, Warmup Procedure, and Filtering Operation

During the warmup phase (t=1,,wt = 1, \ldots, w), the system computes the exact θt,τ\theta_{t,\tau} at every step, then fits the power law by ordinary least squares:

β=t=1w(logtlogt)(logθt,τlogθ)t=1w(logtlogt)2\beta = \frac{\sum_{t=1}^w (\log t - \overline{\log t}) (\log \theta_{t,\tau} - \overline{\log \theta})}{\sum_{t=1}^w (\log t - \overline{\log t})^2}

α=exp(logθ+βlogt)\alpha = \exp\left(\overline{\log \theta} + \beta\,\overline{\log t}\right)

After the warmup, for each subsequent generation step t>wt > w, the threshold is dynamically predicted as

ηt=αtβ\eta_t = \alpha t^{-\beta}

Given the attention vector at\mathbf{a}_t,

  • Indices ItI_t are computed as It={iat[i]>ηt}I_t = \{i \mid \mathbf{a}_t[i] > \eta_t\}
  • Pruned weights at=at[It]\mathbf{a}_t' = \mathbf{a}_t[I_t] and pruned value vectors V:t=V:t[It]V_{:t}' = V_{:t}[I_t] are gathered
  • The output is ot=atV:t\mathbf{o}_t = \mathbf{a}_t' V_{:t}'

This approach requires only a comparison per entry and an indexed gather, with no global sort.

4. Efficiency, Computational Complexity, and Hardware Implications

Let SS denote sequence length, DD embedding dimension, pp the average retained fraction (“realized sparsity”), and ww the warmup length. Full softmax attention necessitates transferring O(SD)O(S D) words from HBM to SRAM. Top-kk also requires O(pSD)O(p S D) transfer, but incurs O(SlogS)O(S \log S) (or O(S)O(S)) selection costs with high synchronization overhead.

SiftAttention's thresholding is O(S)O(S) with no thread divergence, supporting perfect data parallelism. On tested A100 hardware with sequence length 8192 and batch size 8:

Method HBM Reads (MB) Kernel Latency (ms)
Full attention 840.8 1.46
SiftAttention (75% sp.) 579.8 1.30
  • HBM read savings: 31%\approx 31\% at 75% sparsity
  • Kernel latency reduction: 10%\approx 10\% at high sparsity

Bandwidth asymptotics are

  • Full: Bfull=O(SD)B_{\mathrm{full}} = O(SD)
  • SiftAttention: Bsift=O(pSD)B_{\mathrm{sift}} = O(p S D)

These results demonstrate that throughput improves as sparsity increases, with maximal benefits corresponding to kernels that efficiently utilize GPU architectures.

5. Model Performance and Comparative Metrics

SiftAttention was evaluated using Llama-family models on WikiText-2 (perplexity), IFEval (accuracy), MATH-Hard (accuracy), and LongGenBench (accuracy). Key metrics with various warmup and sparsity settings are summarized below.

WikiText-2 Perplexity (Llama-3.1 8B, full = 5.45):

Method Sparsity Perplexity ΔPPL
Top-k (k/S0.50k/S\approx0.50) 50% 5.47 +0.02
SiftAttention 50% 5.49 +0.04
Top-k (k/S0.75k/S\approx0.75) 75% 5.65 +0.20
SiftAttention 75% 5.54 +0.09

IFEval Accuracy (Llama-3.1 8B, full = 91.2%):

Method Sparsity Accuracy
Top-k (0.75) 75% 89.5%
SiftAttention 75% 90.3%
Top-k (0.90) 90% 86.1%
SiftAttention 90% 88.8%

LongGenBench Accuracy (Llama-3.1 8B, full = 72.4%):

Method Sparsity Accuracy
Top-k (0.75) 75% 70.1%
SiftAttention 75% 71.0%
Top-k (0.90) 90% 67.3%
SiftAttention 90% 69.5%

Across all variants, perplexity degradation was 0.1\leq 0.1 for up to roughly 75% sparsity, and accuracy drops at very high sparsity (90%) remained 3%\leq 3\% with appropriate warmup (w=256w=256 or w=512w=512).

6. Implementation, Batching, and Practical Details

  • Warmup phase is implemented in PyTorch with torch.compile; least-squares fitting costs 1\ll 1 ms in practice.
  • Approximate generation is executed via a custom Triton kernel. The current Triton implementation writes filtered indices to global memory, then performs a sparse ×\times dense matrix multiply, with optimizations possible in future CUDA backend releases.
  • Batching is fully supported, with per-sample thresholding and sparsity computed for each batch member.
  • Thresholds are stable provided the power-law fit achieves R2>0.5R^2 > 0.5. Empirically, τ[0.75,0.95]\tau \in [0.75, 0.95] and w[128,512]w \in [128,512] are reliable.
  • The warmup overhead is minor, affecting only the first ww tokens (typically 12.5% of tokens for w=512w=512), after which sparse attention proceeds at full speed.
  • Latency benefits of 10–15% are observed in practice at high sparsity, with potential for further gains with improved dense kernel fusion in hardware drivers.

7. Theoretical and Empirical Implications

SiftAttention demonstrates that practical, hardware-efficient approximate attention is achievable by leveraging predictable statistical regularities in transformer attention score distributions. The replacement of global selection with a parametric, dynamically fitted threshold is a key enabler of competitive performance and efficiency. This approach suggests broader applicability for power-law-guided dynamic sparsity in large sequence models, potentially extending beyond attention to other kernel-bottlenecked neural computations (Koley et al., 5 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SiftAttention.