SiftAttention: Efficient Transformer Attention

Updated 6 January 2026

The paper introduces SiftAttention, which replaces expensive top-k selection with a dynamic, threshold-based filtering method to reduce memory bandwidth usage.
The method uses a brief warmup phase to fit a power-law model on attention score quantiles, enabling accurate threshold estimation during inference.
Empirical results show SiftAttention achieves 10–15% kernel latency reduction and 31% HBM read savings while maintaining near-full attention accuracy.

SiftAttention is an approximate attention mechanism for transformer-based sequence models that improves the efficiency of LLM inference, particularly on modern GPUs where memory bandwidth between High Bandwidth Memory (HBM) and SRAM is a bottleneck. SiftAttention replaces the computationally expensive top- $k$ selection commonly used in sparse/approximate attention with a differentiable, data-parallel, element-wise thresholding operation. The threshold for pruning is dynamically determined at each generation step by fitting the empirical $\tau$ -th quantile of attention scores to a power-law over the course of an initial warmup phase. This methodology yields substantial reductions in memory bandwidth and kernel latency on real hardware, while preserving model performance at high sparsity levels (Koley et al., 5 Jun 2025).

1. Core Mechanism and Departure from Top- $k$ Attention

Traditional top- $k$ approximate attention in LLMs operates by first computing the full softmax attention vector and then selecting the indices corresponding to the $k$ largest entries, using either a sorting or selection algorithm of $O(S \log S)$ (or $O(S)$ in best cases), followed by the gathering of value vectors at those locations. This step incurs significant synchronization and computational overhead, resulting in inefficient kernel execution on GPUs.

SiftAttention eliminates the need for this expensive top- $k$ step by substituting it with a threshold-based filter:

During a short warmup phase ( $w$ steps), SiftAttention records the $\tau$ -th quantile ( $\tau$ 0) of the attention score vector at each token position $\tau$ 1.
It then fits a 2-parameter power-law model, $\tau$ 2, to the quantile trajectory via least-squares regression in log-log space.
In subsequent generation steps ( $\tau$ 3), SiftAttention computes a threshold $\tau$ 4 and zeros out all attention weights below $\tau$ 5.
This per-element thresholding is fully parallelizable and highly efficient on GPU architectures, compared to the global sort or selection required by top- $\tau$ 6.

The result is a mechanism that maintains attention output quality while reducing computational and bandwidth overhead.

2. Power-law Structure in Attention Score Quantiles

A central empirical finding substantiating SiftAttention is that the $\tau$ 7-th quantile $\tau$ 8 of the attention weights decays with generation step $\tau$ 9 according to a power law. In symbols,

$k$ 0

Here,

$k$ 1: current generation step or length of key-value cache
$k$ 2: quantile level (e.g., $k$ 3)
$k$ 4: $k$ 5-th quantile of attention scores at step $k$ 6
$k$ 7: empirically fitted strictly positive constants

This structure was validated across 5 transformer LLMs and two datasets, with the fit evaluated by $k$ 8 statistics showing median values in $k$ 9 and 5th–95th percentiles $k$ 0. This predictable decay allows the threshold to be robustly estimated after a short warmup.

3. Threshold Estimation, Warmup Procedure, and Filtering Operation

During the warmup phase ( $k$ 1), the system computes the exact $k$ 2 at every step, then fits the power law by ordinary least squares:

$k$ 3

$k$ 4

After the warmup, for each subsequent generation step $k$ 5, the threshold is dynamically predicted as

$k$ 6

Given the attention vector $k$ 7,

Indices $k$ 8 are computed as $k$ 9
Pruned weights $k$ 0 and pruned value vectors $k$ 1 are gathered
The output is $k$ 2

This approach requires only a comparison per entry and an indexed gather, with no global sort.

4. Efficiency, Computational Complexity, and Hardware Implications

Let $k$ 3 denote sequence length, $k$ 4 embedding dimension, $k$ 5 the average retained fraction (“realized sparsity”), and $k$ 6 the warmup length. Full softmax attention necessitates transferring $k$ 7 words from HBM to SRAM. Top- $k$ 8 also requires $k$ 9 transfer, but incurs $O(S \log S)$ 0 (or $O(S \log S)$ 1) selection costs with high synchronization overhead.

SiftAttention's thresholding is $O(S \log S)$ 2 with no thread divergence, supporting perfect data parallelism. On tested A100 hardware with sequence length 8192 and batch size 8:

Method	HBM Reads (MB)	Kernel Latency (ms)
Full attention	840.8	1.46
SiftAttention (75% sp.)	579.8	1.30

HBM read savings: $O(S \log S)$ 3 at 75% sparsity
Kernel latency reduction: $O(S \log S)$ 4 at high sparsity

Bandwidth asymptotics are

Full: $O(S \log S)$ 5
SiftAttention: $O(S \log S)$ 6

These results demonstrate that throughput improves as sparsity increases, with maximal benefits corresponding to kernels that efficiently utilize GPU architectures.

5. Model Performance and Comparative Metrics

SiftAttention was evaluated using Llama-family models on WikiText-2 (perplexity), IFEval (accuracy), MATH-Hard (accuracy), and LongGenBench (accuracy). Key metrics with various warmup and sparsity settings are summarized below.

WikiText-2 Perplexity (Llama-3.1 8B, full = 5.45):

Method	Sparsity	Perplexity	ΔPPL
Top-k ( $O(S \log S)$ 7)	50%	5.47	+0.02
SiftAttention	50%	5.49	+0.04
Top-k ( $O(S \log S)$ 8)	75%	5.65	+0.20
SiftAttention	75%	5.54	+0.09

IFEval Accuracy (Llama-3.1 8B, full = 91.2%):

Method	Sparsity	Accuracy
Top-k (0.75)	75%	89.5%
SiftAttention	75%	90.3%
Top-k (0.90)	90%	86.1%
SiftAttention	90%	88.8%

LongGenBench Accuracy (Llama-3.1 8B, full = 72.4%):

Method	Sparsity	Accuracy
Top-k (0.75)	75%	70.1%
SiftAttention	75%	71.0%
Top-k (0.90)	90%	67.3%
SiftAttention	90%	69.5%

Across all variants, perplexity degradation was $O(S \log S)$ 9 for up to roughly 75% sparsity, and accuracy drops at very high sparsity (90%) remained $O(S)$ 0 with appropriate warmup ( $O(S)$ 1 or $O(S)$ 2).

6. Implementation, Batching, and Practical Details

Warmup phase is implemented in PyTorch with torch.compile; least-squares fitting costs $O(S)$ 3 ms in practice.
Approximate generation is executed via a custom Triton kernel. The current Triton implementation writes filtered indices to global memory, then performs a sparse $O(S)$ 4 dense matrix multiply, with optimizations possible in future CUDA backend releases.
Batching is fully supported, with per-sample thresholding and sparsity computed for each batch member.
Thresholds are stable provided the power-law fit achieves $O(S)$ 5. Empirically, $O(S)$ 6 and $O(S)$ 7 are reliable.
The warmup overhead is minor, affecting only the first $O(S)$ 8 tokens (typically 12.5% of tokens for $O(S)$ 9), after which sparse attention proceeds at full speed.
Latency benefits of 10–15% are observed in practice at high sparsity, with potential for further gains with improved dense kernel fusion in hardware drivers.

7. Theoretical and Empirical Implications

SiftAttention demonstrates that practical, hardware-efficient approximate attention is achievable by leveraging predictable statistical regularities in transformer attention score distributions. The replacement of global selection with a parametric, dynamically fitted threshold is a key enabler of competitive performance and efficiency. This approach suggests broader applicability for power-law-guided dynamic sparsity in large sequence models, potentially extending beyond attention to other kernel-bottlenecked neural computations (Koley et al., 5 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Power Law Guided Dynamic Sifting for Efficient Attention (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SiftAttention.