SeerAttention-R: Sparse Attention for LLMs
- SeerAttention-R is a sparse attention framework that adapts LLMs for long-context autoregressive decoding using a data-driven gating mechanism.
- It employs architectural optimizations like block-sparse operations and a self-distilled gating network to prune unnecessary computations.
- The framework delivers near-lossless accuracy, up to 9× faster decoding, and efficient scaling on modern hardware.
SeerAttention-R is a sparse attention adaptation framework for long-context autoregressive decoding in LLMs, specifically designed to enable accurate and efficient reasoning across extended sequences. Building on the data-driven attentional sparsity paradigm of the original SeerAttention, SeerAttention-R introduces architectural modifications and hardware-oriented optimizations—most notably, a self-distilled gating module and coarse-grained, plug-in block-sparse operations—enabling near-lossless reasoning accuracy at dramatically reduced computational costs during long sequence generation.
1. Architectural Features
SeerAttention-R retains the core innovation of SeerAttention: attention sparsity learned via a gating network, trained to predict which regions of the attention map are most semantically significant. The architectural adaptations to accommodate long autoregressive decoding include:
- Elimination of Query Pooling: Unlike SeerAttention, which performs pooling over queries (suitable for prefill or parallel attention), SeerAttention-R removes query pooling to preserve token-by-token granularity essential to stepwise auto-regressive generation.
- Block-Sparse Attention: The attention mechanism operates over a subset of key-value (KV) blocks, as determined by a gating network, rather than the entire history. This effectively prunes computation and memory proportional to the chosen sparsity level.
- Self-distilled Gating Network: The gating module is trained post hoc on the model’s own dense attention outputs, requiring only minor additional parameters and no modification of the core LLM weights.
- Grouped Query Attention (GQA) Integration: For models using GQA, all queries in a group share the same KV head and block selection decision, further reducing fragmentation and better aligning with hardware tiling strategies.
The gating mechanism’s formulation can be summarized as follows: where are blockwise pooling operators, and all gating is performed before re-inserting RoPE positional encoding.
2. Integration and Training
SeerAttention-R is designed as a minimally invasive, plug-in module:
- No weight modification of the pretrained model: Only the small gating networks are updated during distillation.
- Post-training self-distillation: The gating module is trained to match the sparsity distribution of the model's own full-attention outputs, typically with a relatively small corpus (0.4B tokens in demonstrated experiments).
- Flexible block sizes: Block sizes of 64 or 128 are used, delivering both accuracy preservation and compute efficiency through hardware-aligned sparsity.
Plug-and-play deployment is facilitated by wrapping existing attention layers with the AttnGate logic. This design is compatible with a wide range of modern open-source LLMs.
3. Sparse Decoding Kernel Implementation
A distinctive contribution of SeerAttention-R is its highly optimized block-sparse decoding kernel, built with TileLang. Key technical features include:
- TileLang Tensor Compiler: The kernel leverages TileLang for optimal hardware scheduling, tensor pipelining, warp specialization, and efficient memory access patterns.
- Dynamic block skipping: Inference traverses only those blocks activated by the gating module, avoiding unnecessary computation.
- K Compression Cache: Blockwise key pooling cache is stored at <1% of KV memory, maximizing both memory savings and high batch throughput.
- GQA alignment: Grouped block selection and tile-based compute are harmonized for maximum tensor core (e.g., NVidia H100) utilization.
Empirical speedups verify near-theoretical scaling: up to 9× faster decoding than FlashAttention-3 at 90% sparsity, with actual throughput tightly tracking the I/O lower bound of the computation.
4. Evaluation and Accuracy
SeerAttention-R achieves near-lossless accuracy on long-form mathematical reasoning and logic tasks (notably AIME24, AIME25, GPQA-Diamond, and MATH-500 benchmarks) under large block size (64/128) settings. Characteristic results include:
- Minimal gap versus dense attention: When using a 4k token budget at block size 64, accuracy matches or exceeds dense-attention models for hardest tasks; for easier tasks, a 2k budget suffices.
- Robustness to block coarseness: Unlike other sparse or heuristic methods (e.g., Quest), SeerAttention-R's accuracy does not degrade with increased block size.
- Applicability across model scale: Larger models (Qwen3-8B, Qwen3-14B, DeepSeek-R1-Distill-Qwen-14B) maintain or improve robustness under extreme sparsity.
Example accuracy table (Qwen3-8B, AIME24):
Method | Token Budget | Accuracy | Generation Length (k) |
---|---|---|---|
SeerAttention-R | 2k | 56.6 | 19.8 |
SeerAttention-R | 4k | 72.3 | 16.3 |
SeerAttention-R | 8k | 75.1 | 15.1 |
Full attention | n/a | 74.5 | 15.1 |
5. Resource Efficiency and Practical Implications
SeerAttention-R’s efficiency profile is salient at both training and inference:
- Training overhead: The gate can be distilled in 10–18 GPU hours on a single A100 for a 14B parameter model, using just 0.4B tokens.
- Memory and compute cost: At block size 64, the key compression cache is negligible (<1% KV cache), enabling extremely long-sequence decoding (demonstrated up to 128k length).
- In-production scaling: Batch size and sequence length scaling see proportional speedups, with up to 8.6× throughput over FlashAttention-3 at batch size 16 and sequence length 32k.
- Deployment: The lightweight design allows retrofitting to any LLM for applications requiring extended, accurate generation, such as mathematical proofs, scientific reasoning, long-context QA, and advanced agentic planning.
6. Technical Distinctions and Field Impact
SeerAttention-R’s approach diverges from heuristic or pattern-based sparsity by:
- Data-driven, adaptable sparsity: It does not rely on static or manually designed sparsity masks, enabling intrinsic adaptation to model, layer, input, or context.
- No base parameter modification: Base model weights are never altered in the adaptation process, supporting robust compliance with original model quality and licensing.
- Coarse block granularity compatible with modern hardware: Large block sizes drastically reduce hardware fragmentation, allowing practical high-throughput inference and training efficiencies.
The framework enables scaling reasoning models to longer outputs and sequences with minimal loss in accuracy, addressing a key bottleneck for advanced applications of LLMs in domains requiring extended, step-by-step reasoning.
Summary Table
Aspect | SeerAttention-R |
---|---|
Integration | Plug-in, post-training, no base weight changes |
Accuracy | Near-lossless, even at large block sizes |
Speed | Up to 9× faster than dense (FA3, H100, 90% sparse) |
Memory | Compression cache <1% of KV size (block = 64) |
Training | 0.4B tokens, 10–18 A100 hours |
Applicability | Long reasoning, mathematical problem-solving, extended planning |
SeerAttention-R represents a hardware- and application-aligned solution for sparse attention in long-sequence LLM reasoning, combining data-driven block selection with intensive kernel-level optimizations. Code and resources are available at https://github.com/microsoft/SeerAttention.