Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Rectified Sparse Attention (2506.04108v2)

Published 4 Jun 2025 in cs.CL

Abstract: Efficient long-sequence generation is a critical challenge for LLMs. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, LLMing, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.

Summary

  • The paper introduces ReSA, a method combining sparse attention with periodic dense rectification to minimize error accumulation in long-sequence generation.
  • It employs a block-sparse attention mechanism paired with periodic KV cache refreshing to maintain high fidelity to the pretraining distribution.
  • Experimental results show up to a 2.42× speedup while preserving generation quality, marking a significant efficiency improvement for LLMs.

Rectified Sparse Attention for Efficient Long-Sequence Generation

The paper "Rectified Sparse Attention" introduces a novel approach for improving the efficiency of long-sequence generation in LLMs. This approach, termed Rectified Sparse Attention (ReSA), effectively addresses the challenges associated with long-context inference by blending sparse attention mechanisms with periodic dense rectification, thus mitigating error accumulation and ensuring the robustness of generative outputs.

Overview

ReSA targets one of the critical challenges in the landscape of LLMs: the efficient processing and generation of long sequences, which often involves computational bottlenecks due to memory-intensive operations. Traditional dense attention requires each token to consider the complete history during autoregressive decoding, which does not scale well with extensive context lengths. Sparse attention mechanisms attempt to alleviate these pressures by focusing compute resources selectively across the sequence context.

Methodology

The core innovation of ReSA is the periodic correction of errors inherent in sparse representations. The method employs a block-sparse attention framework that facilitates reduced computational load. This framework is periodically supplemented by a dense processing step that rectifies approximations in the key-value (KV) cache, maintaining alignment with the intended distributions reflected in pretraining data.

Group Block Sparse Attention: This mechanism is utilized during the sparse decoding phases, where relevant context blocks are dynamically selected based on query information. This helps maintain a low computational profile while still capturing necessary dependencies within the model inputs.

Dense Rectification: The KV cache, holding the precomputed keys and values from past contexts, is occasionally refreshed using a dense pass that recalculates and updates the cache. This interrupts the potential for cumulative error and ensures any drift from the pretraining distribution is corrected regularly.

Experimental Evaluation

In comparative evaluations involving mathematical reasoning, LLMing, and retrieval tasks, ReSA is shown to achieve near-lossless generation quality while significantly boosting inference efficiency. Notable results include up to a 2.42× speedup under a sequence length of 256,000 tokens, demonstrating its utility for scaling LLMs. The results indicate marked improvements over purely sparse decoding methods, which suffer from cumulative approximation errors under long decoding sequences.

Implications and Future Prospects

The implications of ReSA are substantial, as they suggest a practical path forward for deploying LLMs in environments where context length scalability is paramount. By addressing the alignment issues between training and inference phases, ReSA contributes to more reliable and efficient model deployments.

Looking forward, ReSA could be adapted and expanded upon by integrating it with various sparse computation strategies to further reduce computational load without sacrificing performance. The periodic rectification concept might also inspire similar strategies in other domains where approximation errors hinder model performance over long sessions or sequences.

ReSA represents a promising technique in the ongoing effort to optimize LLM efficiency, with potential applications ranging from real-time data processing to extensive text generation tasks, providing a robust framework for long-context inference in neural LLMs.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com