Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SeerAttention-R: Sparse Attention for LLMs

Updated 30 June 2025
  • SeerAttention-R is a sparse attention framework that adapts LLMs for long-context autoregressive decoding using a data-driven gating mechanism.
  • It employs architectural optimizations like block-sparse operations and a self-distilled gating network to prune unnecessary computations.
  • The framework delivers near-lossless accuracy, up to 9× faster decoding, and efficient scaling on modern hardware.

SeerAttention-R is a sparse attention adaptation framework for long-context autoregressive decoding in LLMs, specifically designed to enable accurate and efficient reasoning across extended sequences. Building on the data-driven attentional sparsity paradigm of the original SeerAttention, SeerAttention-R introduces architectural modifications and hardware-oriented optimizations—most notably, a self-distilled gating module and coarse-grained, plug-in block-sparse operations—enabling near-lossless reasoning accuracy at dramatically reduced computational costs during long sequence generation.

1. Architectural Features

SeerAttention-R retains the core innovation of SeerAttention: attention sparsity learned via a gating network, trained to predict which regions of the attention map are most semantically significant. The architectural adaptations to accommodate long autoregressive decoding include:

  • Elimination of Query Pooling: Unlike SeerAttention, which performs pooling over queries (suitable for prefill or parallel attention), SeerAttention-R removes query pooling to preserve token-by-token granularity essential to stepwise auto-regressive generation.
  • Block-Sparse Attention: The attention mechanism operates over a subset of key-value (KV) blocks, as determined by a gating network, rather than the entire history. This effectively prunes computation and memory proportional to the chosen sparsity level.
  • Self-distilled Gating Network: The gating module is trained post hoc on the model’s own dense attention outputs, requiring only minor additional parameters and no modification of the core LLM weights.
  • Grouped Query Attention (GQA) Integration: For models using GQA, all queries in a group share the same KV head and block selection decision, further reducing fragmentation and better aligning with hardware tiling strategies.

The gating mechanism’s formulation can be summarized as follows: Qgate=RoPE(Wgateqreshape(Qnope,[...,gd])) Kgate=RoPE(Wgatekconcat[Pmax(Knope),Pmin(Knope),Pavg(Knope)]) S=softmax(QgateKgate ⁣dgate)\begin{align} \mathbf{Q}_{\text{gate}} &= \mathrm{RoPE}\left( \mathbf{W}_{\text{gate}}^q \cdot \operatorname{reshape}(\mathbf{Q}_{\text{nope}}, [..., g \cdot d]) \right)\ \mathbf{K}_{\text{gate}} &= \mathrm{RoPE}\left( \mathbf{W}_{\text{gate}}^k \cdot \operatorname{concat}[\operatorname{P}_{\max}(\mathbf{K}_{\text{nope}}), \operatorname{P}_{\min}(\mathbf{K}_{\text{nope}}), \operatorname{P}_{\text{avg}}(\mathbf{K}_{\text{nope}})] \right)\ \mathbf{S} &= \operatorname{softmax}\left( \frac{ \mathbf{Q}_{\text{gate}} \mathbf{K}_{\text{gate}}^{\!\top} }{ \sqrt{d_{\text{gate}}} } \right) \end{align} where Pmax,Pmin,Pavg\operatorname{P}_{\max}, \operatorname{P}_{\min}, \operatorname{P}_{\text{avg}} are blockwise pooling operators, and all gating is performed before re-inserting RoPE positional encoding.

2. Integration and Training

SeerAttention-R is designed as a minimally invasive, plug-in module:

  • No weight modification of the pretrained model: Only the small gating networks are updated during distillation.
  • Post-training self-distillation: The gating module is trained to match the sparsity distribution of the model's own full-attention outputs, typically with a relatively small corpus (0.4B tokens in demonstrated experiments).
  • Flexible block sizes: Block sizes of 64 or 128 are used, delivering both accuracy preservation and compute efficiency through hardware-aligned sparsity.

Plug-and-play deployment is facilitated by wrapping existing attention layers with the AttnGate logic. This design is compatible with a wide range of modern open-source LLMs.

3. Sparse Decoding Kernel Implementation

A distinctive contribution of SeerAttention-R is its highly optimized block-sparse decoding kernel, built with TileLang. Key technical features include:

  • TileLang Tensor Compiler: The kernel leverages TileLang for optimal hardware scheduling, tensor pipelining, warp specialization, and efficient memory access patterns.
  • Dynamic block skipping: Inference traverses only those blocks activated by the gating module, avoiding unnecessary computation.
  • K Compression Cache: Blockwise key pooling cache is stored at <1% of KV memory, maximizing both memory savings and high batch throughput.
  • GQA alignment: Grouped block selection and tile-based compute are harmonized for maximum tensor core (e.g., NVidia H100) utilization.

Empirical speedups verify near-theoretical scaling: up to 9× faster decoding than FlashAttention-3 at 90% sparsity, with actual throughput tightly tracking the I/O lower bound of the computation.

4. Evaluation and Accuracy

SeerAttention-R achieves near-lossless accuracy on long-form mathematical reasoning and logic tasks (notably AIME24, AIME25, GPQA-Diamond, and MATH-500 benchmarks) under large block size (64/128) settings. Characteristic results include:

  • Minimal gap versus dense attention: When using a 4k token budget at block size 64, accuracy matches or exceeds dense-attention models for hardest tasks; for easier tasks, a 2k budget suffices.
  • Robustness to block coarseness: Unlike other sparse or heuristic methods (e.g., Quest), SeerAttention-R's accuracy does not degrade with increased block size.
  • Applicability across model scale: Larger models (Qwen3-8B, Qwen3-14B, DeepSeek-R1-Distill-Qwen-14B) maintain or improve robustness under extreme sparsity.

Example accuracy table (Qwen3-8B, AIME24):

Method Token Budget Accuracy Generation Length (k)
SeerAttention-R 2k 56.6 19.8
SeerAttention-R 4k 72.3 16.3
SeerAttention-R 8k 75.1 15.1
Full attention n/a 74.5 15.1

5. Resource Efficiency and Practical Implications

SeerAttention-R’s efficiency profile is salient at both training and inference:

  • Training overhead: The gate can be distilled in 10–18 GPU hours on a single A100 for a 14B parameter model, using just 0.4B tokens.
  • Memory and compute cost: At block size 64, the key compression cache is negligible (<1% KV cache), enabling extremely long-sequence decoding (demonstrated up to 128k length).
  • In-production scaling: Batch size and sequence length scaling see proportional speedups, with up to 8.6× throughput over FlashAttention-3 at batch size 16 and sequence length 32k.
  • Deployment: The lightweight design allows retrofitting to any LLM for applications requiring extended, accurate generation, such as mathematical proofs, scientific reasoning, long-context QA, and advanced agentic planning.

6. Technical Distinctions and Field Impact

SeerAttention-R’s approach diverges from heuristic or pattern-based sparsity by:

  • Data-driven, adaptable sparsity: It does not rely on static or manually designed sparsity masks, enabling intrinsic adaptation to model, layer, input, or context.
  • No base parameter modification: Base model weights are never altered in the adaptation process, supporting robust compliance with original model quality and licensing.
  • Coarse block granularity compatible with modern hardware: Large block sizes drastically reduce hardware fragmentation, allowing practical high-throughput inference and training efficiencies.

The framework enables scaling reasoning models to longer outputs and sequences with minimal loss in accuracy, addressing a key bottleneck for advanced applications of LLMs in domains requiring extended, step-by-step reasoning.

Summary Table

Aspect SeerAttention-R
Integration Plug-in, post-training, no base weight changes
Accuracy Near-lossless, even at large block sizes
Speed Up to 9× faster than dense (FA3, H100, 90% sparse)
Memory Compression cache <1% of KV size (block = 64)
Training 0.4B tokens, 10–18 A100 hours
Applicability Long reasoning, mathematical problem-solving, extended planning

SeerAttention-R represents a hardware- and application-aligned solution for sparse attention in long-sequence LLM reasoning, combining data-driven block selection with intensive kernel-level optimizations. Code and resources are available at https://github.com/microsoft/SeerAttention.