Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

SeerAttention-R: Sparse Attention for LLMs

Updated 30 June 2025
  • SeerAttention-R is a sparse attention framework that adapts LLMs for long-context autoregressive decoding using a data-driven gating mechanism.
  • It employs architectural optimizations like block-sparse operations and a self-distilled gating network to prune unnecessary computations.
  • The framework delivers near-lossless accuracy, up to 9× faster decoding, and efficient scaling on modern hardware.

SeerAttention-R is a sparse attention adaptation framework for long-context autoregressive decoding in LLMs, specifically designed to enable accurate and efficient reasoning across extended sequences. Building on the data-driven attentional sparsity paradigm of the original SeerAttention, SeerAttention-R introduces architectural modifications and hardware-oriented optimizations—most notably, a self-distilled gating module and coarse-grained, plug-in block-sparse operations—enabling near-lossless reasoning accuracy at dramatically reduced computational costs during long sequence generation.

1. Architectural Features

SeerAttention-R retains the core innovation of SeerAttention: attention sparsity learned via a gating network, trained to predict which regions of the attention map are most semantically significant. The architectural adaptations to accommodate long autoregressive decoding include:

  • Elimination of Query Pooling: Unlike SeerAttention, which performs pooling over queries (suitable for prefill or parallel attention), SeerAttention-R removes query pooling to preserve token-by-token granularity essential to stepwise auto-regressive generation.
  • Block-Sparse Attention: The attention mechanism operates over a subset of key-value (KV) blocks, as determined by a gating network, rather than the entire history. This effectively prunes computation and memory proportional to the chosen sparsity level.
  • Self-distilled Gating Network: The gating module is trained post hoc on the model’s own dense attention outputs, requiring only minor additional parameters and no modification of the core LLM weights.
  • Grouped Query Attention (GQA) Integration: For models using GQA, all queries in a group share the same KV head and block selection decision, further reducing fragmentation and better aligning with hardware tiling strategies.

The gating mechanism’s formulation can be summarized as follows: Qgate=RoPE(Wgateqreshape(Qnope,[...,gd])) Kgate=RoPE(Wgatekconcat[Pmax(Knope),Pmin(Knope),Pavg(Knope)]) S=softmax(QgateKgate ⁣dgate)\begin{align} \mathbf{Q}_{\text{gate}} &= \mathrm{RoPE}\left( \mathbf{W}_{\text{gate}}^q \cdot \operatorname{reshape}(\mathbf{Q}_{\text{nope}}, [..., g \cdot d]) \right)\ \mathbf{K}_{\text{gate}} &= \mathrm{RoPE}\left( \mathbf{W}_{\text{gate}}^k \cdot \operatorname{concat}[\operatorname{P}_{\max}(\mathbf{K}_{\text{nope}}), \operatorname{P}_{\min}(\mathbf{K}_{\text{nope}}), \operatorname{P}_{\text{avg}}(\mathbf{K}_{\text{nope}})] \right)\ \mathbf{S} &= \operatorname{softmax}\left( \frac{ \mathbf{Q}_{\text{gate}} \mathbf{K}_{\text{gate}}^{\!\top} }{ \sqrt{d_{\text{gate}}} } \right) \end{align} where Pmax,Pmin,Pavg\operatorname{P}_{\max}, \operatorname{P}_{\min}, \operatorname{P}_{\text{avg}} are blockwise pooling operators, and all gating is performed before re-inserting RoPE positional encoding.

2. Integration and Training

SeerAttention-R is designed as a minimally invasive, plug-in module:

  • No weight modification of the pretrained model: Only the small gating networks are updated during distillation.
  • Post-training self-distillation: The gating module is trained to match the sparsity distribution of the model's own full-attention outputs, typically with a relatively small corpus (0.4B tokens in demonstrated experiments).
  • Flexible block sizes: Block sizes of 64 or 128 are used, delivering both accuracy preservation and compute efficiency through hardware-aligned sparsity.

Plug-and-play deployment is facilitated by wrapping existing attention layers with the AttnGate logic. This design is compatible with a wide range of modern open-source LLMs.

3. Sparse Decoding Kernel Implementation

A distinctive contribution of SeerAttention-R is its highly optimized block-sparse decoding kernel, built with TileLang. Key technical features include:

  • TileLang Tensor Compiler: The kernel leverages TileLang for optimal hardware scheduling, tensor pipelining, warp specialization, and efficient memory access patterns.
  • Dynamic block skipping: Inference traverses only those blocks activated by the gating module, avoiding unnecessary computation.
  • K Compression Cache: Blockwise key pooling cache is stored at <1% of KV memory, maximizing both memory savings and high batch throughput.
  • GQA alignment: Grouped block selection and tile-based compute are harmonized for maximum tensor core (e.g., NVidia H100) utilization.

Empirical speedups verify near-theoretical scaling: up to 9× faster decoding than FlashAttention-3 at 90% sparsity, with actual throughput tightly tracking the I/O lower bound of the computation.

4. Evaluation and Accuracy

SeerAttention-R achieves near-lossless accuracy on long-form mathematical reasoning and logic tasks (notably AIME24, AIME25, GPQA-Diamond, and MATH-500 benchmarks) under large block size (64/128) settings. Characteristic results include:

  • Minimal gap versus dense attention: When using a 4k token budget at block size 64, accuracy matches or exceeds dense-attention models for hardest tasks; for easier tasks, a 2k budget suffices.
  • Robustness to block coarseness: Unlike other sparse or heuristic methods (e.g., Quest), SeerAttention-R's accuracy does not degrade with increased block size.
  • Applicability across model scale: Larger models (Qwen3-8B, Qwen3-14B, DeepSeek-R1-Distill-Qwen-14B) maintain or improve robustness under extreme sparsity.

Example accuracy table (Qwen3-8B, AIME24):

Method Token Budget Accuracy Generation Length (k)
SeerAttention-R 2k 56.6 19.8
SeerAttention-R 4k 72.3 16.3
SeerAttention-R 8k 75.1 15.1
Full attention n/a 74.5 15.1

5. Resource Efficiency and Practical Implications

SeerAttention-R’s efficiency profile is salient at both training and inference:

  • Training overhead: The gate can be distilled in 10–18 GPU hours on a single A100 for a 14B parameter model, using just 0.4B tokens.
  • Memory and compute cost: At block size 64, the key compression cache is negligible (<1% KV cache), enabling extremely long-sequence decoding (demonstrated up to 128k length).
  • In-production scaling: Batch size and sequence length scaling see proportional speedups, with up to 8.6× throughput over FlashAttention-3 at batch size 16 and sequence length 32k.
  • Deployment: The lightweight design allows retrofitting to any LLM for applications requiring extended, accurate generation, such as mathematical proofs, scientific reasoning, long-context QA, and advanced agentic planning.

6. Technical Distinctions and Field Impact

SeerAttention-R’s approach diverges from heuristic or pattern-based sparsity by:

  • Data-driven, adaptable sparsity: It does not rely on static or manually designed sparsity masks, enabling intrinsic adaptation to model, layer, input, or context.
  • No base parameter modification: Base model weights are never altered in the adaptation process, supporting robust compliance with original model quality and licensing.
  • Coarse block granularity compatible with modern hardware: Large block sizes drastically reduce hardware fragmentation, allowing practical high-throughput inference and training efficiencies.

The framework enables scaling reasoning models to longer outputs and sequences with minimal loss in accuracy, addressing a key bottleneck for advanced applications of LLMs in domains requiring extended, step-by-step reasoning.

Summary Table

Aspect SeerAttention-R
Integration Plug-in, post-training, no base weight changes
Accuracy Near-lossless, even at large block sizes
Speed Up to 9× faster than dense (FA3, H100, 90% sparse)
Memory Compression cache <1% of KV size (block = 64)
Training 0.4B tokens, 10–18 A100 hours
Applicability Long reasoning, mathematical problem-solving, extended planning

SeerAttention-R represents a hardware- and application-aligned solution for sparse attention in long-sequence LLM reasoning, combining data-driven block selection with intensive kernel-level optimizations. Code and resources are available at https://github.com/microsoft/SeerAttention.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SeerAttention-R.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube