Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers (2406.16747v1)

Published 24 Jun 2024 in cs.CL and cs.LG

Abstract: Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in LLMing and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained LLMs with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chao Lou (8 papers)
  2. Zixia Jia (15 papers)
  3. Zilong Zheng (63 papers)
  4. Kewei Tu (74 papers)
Citations (7)
Youtube Logo Streamline Icon: https://streamlinehq.com