BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding (2512.12087v1)

Published 12 Dec 2025 in cs.CL

Abstract: The growing demand for long-context inference capabilities in LLMs has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62x speedup for prefill at 74.7% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.

Abstract PDF Chat (Pro)

Summary

The paper introduces a training-free sparse attention mechanism leveraging online softmax statistics for dynamic block pruning to mitigate quadratic attention complexity.
It calibrates sparsity thresholds empirically to maintain predictable speedups and competitive accuracy across varying context lengths and attention variants.
Experimental results on H200 and B200 GPUs demonstrate up to 1.62x speedup and minimal accuracy loss in both compute-bound and memory-bound phases.

BLASST: Dynamic Blocked Attention Sparsity via Softmax Thresholding — Expert Summary

Motivation and Context

The quadratic computational and memory complexity inherent in the Transformer attention mechanism presents critical challenges for scaling LLMs to longer context windows. FlashAttention and similar block-wise innovations have optimized memory and compute efficiency but still compute dense attention scores, leaving the core $O(n^2)$ bottleneck unmitigated. Existing sparse attention methods require costly pre-computation or rely on proxy scores of uncertain reliability, which limits both speedup realization and accuracy. BLASST ("Dynamic BLocked Attention Sparsity via Softmax Thresholding") introduces a kernel-level, training-free sparse attention mechanism that leverages online softmax statistics for dynamic block pruning without any proxy-based estimation, fitting seamlessly into FlashAttention kernels.

Algorithmic Contributions

BLASST’s central insight is the exploitation of the block-wise online softmax accumulation in FlashAttention. The technique maintains a running row-wise maximum and, for each block, computes its local maximum. If the block’s maximum is lower than the running maximum by more than a fixed threshold $\lambda$ , the resulting softmax-normalized values from that block will be near-zero. Computation, value block loading from HBM, and the matrix multiplication for that block are skipped entirely. The skip decision uses only already-computed statistics and incurs negligible computational overhead, with just one per-block comparison and a coordinated skip predicate per CUDA warpgroup.

This method is general across attention variants (MHA, GQA, MQA, MLA) and is applicable to both the prefill (compute-bound) and decode (memory-bound) phases. Optimized CUDA kernels are provided for both, with targeted resource savings: compute-heavy operations in prefill and HBM bandwidth in decode.

Calibration and Sparsity-Aware Training

Sparsity control is critical for predictable deployment. Extensive empirical analysis shows a robust inverse relationship between optimal threshold and context length: $\lambda^\star \propto 1/L$ , where $L$ is the sequence length. The authors develop an automated calibration procedure that empirically finds threshold values for target sparsity levels over a range of context lengths and fits a regression model, enabling stable sparsity control across deployment scenarios.

For further accuracy preservation at high sparsity, BLASST is extended to sparsity-aware training, simply applying its block-skipping mechanism during fine-tuning. Skipped blocks receive no gradient, encouraging the model to concentrate critical information in high-scoring blocks. This pushes the accuracy-sparsity Pareto frontier beyond what is possible purely post-training.

Experimental Results and Analysis

Speedup and Accuracy

BLASST yields strong numerical results across benchmarks:

Prefill (compute-bound) kernels: Up to 1.62x speedup at 74.7% sparsity on H200 GPU, with minimal accuracy loss.
Decode (memory-bound) kernels: Up to 1.48x speedup at 73.2% sparsity on B200 GPU.
Predictable performance scaling: Greater sparsity results in greater speedups, with negligible baseline overhead at low sparsity rates.
Accuracy is maintained or marginally improved at moderate sparsity: On Qwen3-8B, at 50% sparsity, MATH500 and AIME tasks show slight accuracy improvements over dense attention.

Comparative Benchmarking

When evaluated against SOTA sparse attention and KV cache compression methods (MInference, FlexPrefill, Quest, RocketKV, etc.), BLASST achieves the best overall accuracy for both prefill and decode optimizations, particularly notable as it requires no proxy importance scores or pre-computation. BLASST also composes effectively with these other methods for end-to-end pipeline optimization.

Calibration Stability

Fixed thresholds induce high variance in sparsity across context lengths, rendering deployment impractical. BLASST’s calibration maintains sparsity within 1–2 percentage points of the target across tasks and context lengths up to 128K tokens.

Robustness and Ablations

Sparsity-aware training demonstrably reduces the degradation in accuracy at aggressive sparsity settings and sometimes slightly exceeds dense training baselines.
Head and layer-level sparsity analysis shows substantial natural heterogeneity, which BLASST accommodates without explicit layer/head selection, enabling generalized pruning that adapts to intrinsic model statistics.
Extreme-length contexts (up to 200K tokens): BLASST maintains >50% sparsity and competitive accuracy, making efficient inference feasible for giant context windows.
Tile row reordering for cummax calculation provides dataset-dependent improvements and is supported by the flexible kernel scheduling.

Graceful Degradation

Compared to proxy-based block sparsity (e.g., XAttention), BLASST's use of actual softmax statistics results in more stable accuracy as sparsity increases, supporting deployment for aggressive efficiency scenarios.

Implications and Future Directions

BLASST’s design addresses both practical and theoretical challenges in sparse attention:

Efficiency: The direct use of online kernel statistics establishes a practical foundation for on-device, long-context inference with minimal overhead.
Hardware-alignment: Specialized kernels deliver realized speedups commensurate with theoretical compute/memory reductions.
Adaptability and extensibility: BLASST calibrates naturally across context lengths and layers, composes with other sparsity methods, and its approach is robust to reordering and task variation.
Scalability to extreme context sizes: Realizing block sparsity enables inference scaling beyond what dense attention can support, critical for code, document, and agentic AI use cases.

Looking forward, hybrid sparse patterns, hardware-aware kernel designs, and sparsity-learned during training will be critical for future LLM architectures and agentic AI systems. BLASST provides a flexible, efficient basis for these explorations, with promising results for both practical deployment and further research in attention sparsity control and adaptive inference.

Conclusion

BLASST introduces an efficient, dropout-free, training-free kernel sparse attention mechanism leveraging online blockwise softmax statistics, enabling scalable long-context inference with predictable accuracy-speedup trade-offs. Its calibration and sparsity-aware training extensions further expand deployment robustness and efficiency. The methodology is general, hardware-aligned, and practical for both prefill and decode phases, supporting future trajectories in sparse attention and resource-optimal LLM inference.

Reference: "BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding" (2512.12087)