PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention (2503.03588v1)

Published 5 Mar 2025 in cs.CL and cs.LG

Abstract: LLMs face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5\sim 40\%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0\times$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.

Authors (11)

Lida Chen (8 papers)
Dong Xu (167 papers)
Chenxin An (17 papers)
Xintao Wang (132 papers)
Yikai Zhang (41 papers)
Jiangjie Chen (46 papers)
Zujie Liang (13 papers)
Feng Wei (39 papers)
Jiaqing Liang (62 papers)
Yanghua Xiao (151 papers)
Wei Wang (1793 papers)

Summary

The paper presents a novel design for sparse attention, named PowerAttention, which systematically addresses the efficiency bottleneck of processing long contexts in LLMs by enabling an exponentially growing receptive field. The authors frame sparse attention as a problem of choosing an optimal edge set in a directed acyclic graph (DAG) corresponding to the attention mask, thereby formalizing the token information flow across layers. Instead of using heuristic-driven patterns such as sliding windows, dilated windows, or strided approaches, PowerAttention constructs connections at power‐of‐2 distances so that, in a model with d layers, each output token can theoretically attend to up to $2^d$ input tokens. This exponential expansion is contrasted with the linear or sub-quadratic growth found in existing methods.

Key Contributions and Theoretical Insights:

Receptive Field Analysis Framework:

The authors introduce a framework to analyze sparse attention in terms of the effective receptive field. By modeling the attention structure as a DAG where nodes are tokens and edges represent attention links, they show that the propagation depth (i.e., the minimum number of layers needed for one token to influence another) determines the effective context length. In particular, they demonstrate that a token’s effective receptive field can be rigorously quantified by the shortest path length in the DAG.

Exponential Receptive Field Growth:

PowerAttention is designed such that every token attends only to tokens at distances that are powers of 2. The resulting structure guarantees that for any token, the total number of tokens it can indirectly access grows exponentially with the number of layers, while maintaining a maximum out-degree of $O(\log N)$ for a sequence of length $N$ . A formal theorem is provided where the authors prove that the distance between any two tokens is bounded by $O(\log N)$ when connections are restricted to these power-of-2 gaps. In mathematical form, given a difference $d$ in positions, the number of hops required is bounded by the number of ones in the binary representation of $d$ , which is at most $\log N$ .

Static Sparse Attention Design:

Unlike dynamic sparse attention methods that modify the attention pattern during inference, PowerAttention is a static scheme. It imposes no additional computational overhead compared to sliding window methods because the pattern can be pre-computed and executed efficiently. The paper also provides pseudo-code for PowerAttention, highlighting its ease of implementation and direct compatibility with existing libraries for sparse attention computation.

Experimental Evaluations:

The empirical evaluation covers several dimensions:

LLMing and Perplexity:

On the PG19 test set, all sparse attention methods achieve competitive perplexity scores up to 32K context lengths under a high sparsity ratio (approximately 94%). Notably, although sliding window schemes sometimes yield lower perplexity, the exponential receptive field growth of PowerAttention affords better performance in tasks where long-range dependencies are critical.

Passkey Retrieval Task:

To probe the effectiveness of context retrieval, the authors design a passkey retrieval task. They show that while sliding window, dilated, and other existing sparse attention schemes suffer from incomplete passage of information (with some tokens unreachable within the maximum number of layers), PowerAttention achieves full-sequence coverage. When the sequence length is extended to ultra-long contexts (e.g., 64K tokens), PowerAttention outperforms alternatives, which is attributed to its phase transition-like, discrete leaps in information propagation.

Long Context Benchmarks:

On the RULER benchmark, which comprises tasks designed to fully exploit long-range dependencies, PowerAttention consistently records higher scores when compared to baseline static patterns (sliding window, stride slash, dilated attention, LongNet). This performance is attributed to its ability to balance the trade-off between sparsity (for computational efficiency) and completeness of the receptive field.

Efficiency Metrics:

The paper provides a detailed efficiency analysis both at the kernel level and in end-to-end latency. For instance, in a 128K context setting, PowerAttention is reported to be 3.0× faster than full attention approaches during the prefill phase and shows significant improvements in the decoding phase as well. Additionally, its per-forward-pass cost grows in nearly linear time ( $O(N \log N)$ ), making it competitive with sliding window methods while providing superior context coverage.

Probing Analysis:

A probing paper with linear classifiers is undertaken to trace the propagation of passkey information across layers. The results visually demonstrate that in PowerAttention, information can leap to distant token positions at specific layers, contrasting with the gradual, linear spread observed with sliding window attention. This analysis confirms that PowerAttention not only extends the receptive field theoretically but also results in effective information propagation in practice.

Summary of Main Strengths:

Provides a rigorous theoretical framework that connects sparse attention patterns with effective receptive field growth.
Demonstrates that by attending to tokens at power-of-2 intervals, one can achieve exponential growth in context coverage with minimal overhead.
Offers comprehensive experimental validation across LLMing, synthetic retrieval tasks, and real-world long-context performance benchmarks.
Delivers significant efficiency improvements in both prefill and decoding stages of transformer inference, which are critical for LLMs processing ultra-long sequences.

Overall, the innovative architectural design of PowerAttention addresses the inherent limitations of static sparse attention patterns by ensuring complete and efficient long-range dependency modeling while being computationally tractable for practical deployment in modern LLMs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/papers_anon/status/1897488174865359021