Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 170 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 41 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Sparser Block-Sparse Attention via Token Permutation (2510.21270v1)

Published 24 Oct 2025 in cs.CL, cs.AI, and cs.CV

Abstract: Scaling the context length of LLMs offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

Summary

  • The paper introduces PBS-Attn, which uses token permutation to enhance block sparsity and reduce redundant self-attention computations.
  • It implements segmented permutation and query-aware key sorting to maintain causality while clustering key tokens efficiently.
  • Experiments on long-context tasks show up to a 2.75× speedup without degrading accuracy, validating its practical efficiency improvements.

Sparser Block-Sparse Attention via Token Permutation

The paper introduces "Permuted Block-Sparse Attention (PBS-Attn)", a novel approach to enhance the computational efficiency of block-sparse attention mechanisms in LLMs. The primary goal of PBS-Attn is to leverage token permutation to improve block-level sparsity, reducing computational redundancy associated with self-attention mechanisms in transformers.

Motivation and Background

The paper targets the inefficiencies introduced by the O(N2)O(N^2) computational complexity of self-attention in transformers. Block-sparse attention aims to mitigate this complexity by partitioning the sequence into blocks, selectively computing attention within these blocks. However, traditional block-sparse methods often suffer from sub-optimal sparsity patterns. These inefficiencies arise when important key tokens for a query in a block are distributed across multiple blocks, leading to redundant computations.

Traditional efforts in optimizing attention complexity include architectural changes like linear transformers, hardware-aware optimizations like FlashAttention, and block-sparse techniques that prune interactions using masks. While effective, these methods are limited by the inherent attention patterns and cannot fully capitalize on the sparsity present in natural language sequences.

Permuted Block-Sparse Attention (PBS-Attn)

PBS-Attn introduces a permutation-driven strategy that reorders query and key sequences to increase block-level sparsity. Key innovations include the following:

  1. Symmetry Exploitation via Permutation: Attention mechanisms are permutation-invariant for key-value pairings and equivariant for query permutations. PBS-Attn exploits these properties to reorder tokens, maximizing intra-block sparsity without altering the model's output.
  2. Segmented Permutation: Since maintaining causality is crucial, particularly in autoregressive models, PBS-Attn uses segmented permutation. This technique allows for intra-segment permutation while ensuring inter-segment causality is preserved (Figure 1). Figure 1

    Figure 1: Illustration of causal attention without (Left) and with (Right) segmented permutation with B=1,S=4B=1, S=4. Segmented permutation enhances block-level sparsity via intra-segment permutation while preserving inter-segment causality. By restricting computation of blocks within on-diagonal segments (green blocks), we can safely skip inter-segment blocks (yellow blocks) for block-sparse attention.

  3. Query-aware Key Permutation: The method implements segment-wise sorting of keys based on estimated attention scores, optimizing their placement in computationally efficient clusters.

Implementation and Experimentation

The implementation of PBS-Attn involves customizing kernels for the permuted block-sparse attention mechanism using Triton for efficient inference. The method optimizes the prefill stage by leveraging tensor parallelism, resulting in enhanced scalability and speedup. Intensive evaluations using Llama-3.1-8B and Qwen-2.5-7B-1M models on LongBench and LongBenchv2 benchmarks demonstrate that PBS-Attn maintains accuracy close to the full-attention baseline while achieving significant computational speedups.

Main Results

  • PBS-Attn outperformed other sparsity-driven approaches on multiple long-context tasks, achieving up to a 2.75×2.75\times end-to-end speedup in LLM prefilling.
  • The proposed permutation strategy enhances block sparsity, evidenced by improvements in real task performance due to effective clustering of important tokens (Figure 2). Figure 2

    Figure 2: Speedup of various methods relative to FlashAttention, measured by time to first token (TTFT) on LongBenchv2 across various sequence lengths. To accommodate longer sequences under memory constraints, we employ tensor parallelism with tp_size of 2 and 8 for the 256K and 512K contexts, respectively.

Ablation Studies

The ablation studies confirmed the role of permutation in improving block sparsity and computational efficiency. Adjusting permutation target and segment size demonstrated that the approach provides robust control over the density-performance trade-off.

Conclusion

PBS-Attn offers a novel methodology for improving the computational efficiency of LLMs. By intelligently permuting key and query tokens, the method strategically increases block sparsity, thus enhancing the scalability and speed of processing without degrading model accuracy. This paper not only extends the capabilities of block-sparse attention mechanisms but also lays the groundwork for more efficient implementations of transformer models in handling ultra-long sequences. Future work can explore integrating PBS-Attn with other efficiency-oriented methods, such as low-rank and quantization techniques, to further optimize resource use.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: