SiftAttention: Efficient Transformer Attention
- The paper introduces SiftAttention, which replaces expensive top-k selection with a dynamic, threshold-based filtering method to reduce memory bandwidth usage.
- The method uses a brief warmup phase to fit a power-law model on attention score quantiles, enabling accurate threshold estimation during inference.
- Empirical results show SiftAttention achieves 10–15% kernel latency reduction and 31% HBM read savings while maintaining near-full attention accuracy.
SiftAttention is an approximate attention mechanism for transformer-based sequence models that improves the efficiency of LLM inference, particularly on modern GPUs where memory bandwidth between High Bandwidth Memory (HBM) and SRAM is a bottleneck. SiftAttention replaces the computationally expensive top- selection commonly used in sparse/approximate attention with a differentiable, data-parallel, element-wise thresholding operation. The threshold for pruning is dynamically determined at each generation step by fitting the empirical -th quantile of attention scores to a power-law over the course of an initial warmup phase. This methodology yields substantial reductions in memory bandwidth and kernel latency on real hardware, while preserving model performance at high sparsity levels (Koley et al., 5 Jun 2025).
1. Core Mechanism and Departure from Top- Attention
Traditional top- approximate attention in LLMs operates by first computing the full softmax attention vector and then selecting the indices corresponding to the largest entries, using either a sorting or selection algorithm of (or in best cases), followed by the gathering of value vectors at those locations. This step incurs significant synchronization and computational overhead, resulting in inefficient kernel execution on GPUs.
SiftAttention eliminates the need for this expensive top- step by substituting it with a threshold-based filter:
- During a short warmup phase ( steps), SiftAttention records the -th quantile () of the attention score vector at each token position .
- It then fits a 2-parameter power-law model, , to the quantile trajectory via least-squares regression in log-log space.
- In subsequent generation steps (), SiftAttention computes a threshold and zeros out all attention weights below .
- This per-element thresholding is fully parallelizable and highly efficient on GPU architectures, compared to the global sort or selection required by top-.
The result is a mechanism that maintains attention output quality while reducing computational and bandwidth overhead.
2. Power-law Structure in Attention Score Quantiles
A central empirical finding substantiating SiftAttention is that the -th quantile of the attention weights decays with generation step according to a power law. In symbols,
Here,
- : current generation step or length of key-value cache
- : quantile level (e.g., )
- : -th quantile of attention scores at step
- : empirically fitted strictly positive constants
This structure was validated across 5 transformer LLMs and two datasets, with the fit evaluated by statistics showing median values in and 5th–95th percentiles . This predictable decay allows the threshold to be robustly estimated after a short warmup.
3. Threshold Estimation, Warmup Procedure, and Filtering Operation
During the warmup phase (), the system computes the exact at every step, then fits the power law by ordinary least squares:
After the warmup, for each subsequent generation step , the threshold is dynamically predicted as
Given the attention vector ,
- Indices are computed as
- Pruned weights and pruned value vectors are gathered
- The output is
This approach requires only a comparison per entry and an indexed gather, with no global sort.
4. Efficiency, Computational Complexity, and Hardware Implications
Let denote sequence length, embedding dimension, the average retained fraction (“realized sparsity”), and the warmup length. Full softmax attention necessitates transferring words from HBM to SRAM. Top- also requires transfer, but incurs (or ) selection costs with high synchronization overhead.
SiftAttention's thresholding is with no thread divergence, supporting perfect data parallelism. On tested A100 hardware with sequence length 8192 and batch size 8:
| Method | HBM Reads (MB) | Kernel Latency (ms) |
|---|---|---|
| Full attention | 840.8 | 1.46 |
| SiftAttention (75% sp.) | 579.8 | 1.30 |
- HBM read savings: at 75% sparsity
- Kernel latency reduction: at high sparsity
Bandwidth asymptotics are
- Full:
- SiftAttention:
These results demonstrate that throughput improves as sparsity increases, with maximal benefits corresponding to kernels that efficiently utilize GPU architectures.
5. Model Performance and Comparative Metrics
SiftAttention was evaluated using Llama-family models on WikiText-2 (perplexity), IFEval (accuracy), MATH-Hard (accuracy), and LongGenBench (accuracy). Key metrics with various warmup and sparsity settings are summarized below.
WikiText-2 Perplexity (Llama-3.1 8B, full = 5.45):
| Method | Sparsity | Perplexity | ΔPPL |
|---|---|---|---|
| Top-k () | 50% | 5.47 | +0.02 |
| SiftAttention | 50% | 5.49 | +0.04 |
| Top-k () | 75% | 5.65 | +0.20 |
| SiftAttention | 75% | 5.54 | +0.09 |
IFEval Accuracy (Llama-3.1 8B, full = 91.2%):
| Method | Sparsity | Accuracy |
|---|---|---|
| Top-k (0.75) | 75% | 89.5% |
| SiftAttention | 75% | 90.3% |
| Top-k (0.90) | 90% | 86.1% |
| SiftAttention | 90% | 88.8% |
LongGenBench Accuracy (Llama-3.1 8B, full = 72.4%):
| Method | Sparsity | Accuracy |
|---|---|---|
| Top-k (0.75) | 75% | 70.1% |
| SiftAttention | 75% | 71.0% |
| Top-k (0.90) | 90% | 67.3% |
| SiftAttention | 90% | 69.5% |
Across all variants, perplexity degradation was for up to roughly 75% sparsity, and accuracy drops at very high sparsity (90%) remained with appropriate warmup ( or ).
6. Implementation, Batching, and Practical Details
- Warmup phase is implemented in PyTorch with
torch.compile; least-squares fitting costs ms in practice. - Approximate generation is executed via a custom Triton kernel. The current Triton implementation writes filtered indices to global memory, then performs a sparse dense matrix multiply, with optimizations possible in future CUDA backend releases.
- Batching is fully supported, with per-sample thresholding and sparsity computed for each batch member.
- Thresholds are stable provided the power-law fit achieves . Empirically, and are reliable.
- The warmup overhead is minor, affecting only the first tokens (typically 12.5% of tokens for ), after which sparse attention proceeds at full speed.
- Latency benefits of 10–15% are observed in practice at high sparsity, with potential for further gains with improved dense kernel fusion in hardware drivers.
7. Theoretical and Empirical Implications
SiftAttention demonstrates that practical, hardware-efficient approximate attention is achievable by leveraging predictable statistical regularities in transformer attention score distributions. The replacement of global selection with a parametric, dynamically fitted threshold is a key enabler of competitive performance and efficiency. This approach suggests broader applicability for power-law-guided dynamic sparsity in large sequence models, potentially extending beyond attention to other kernel-bottlenecked neural computations (Koley et al., 5 Jun 2025).