Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XAttention: Block Sparse Attention with Antidiagonal Scoring (2503.16428v1)

Published 20 Mar 2025 in cs.CL and cs.CV

Abstract: Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.

Overview

The paper "XAttention: Block Sparse Attention with Antidiagonal Scoring" (Xu et al., 20 Mar 2025 ) presents a plug‐and‐play framework designed to alleviate the quadratic complexity of conventional full attention mechanisms in long-context Transformer models. The authors propose a block-sparse attention mechanism driven by a novel antidiagonal scoring approach. This technique effectively identifies and prunes non-essential blocks in the attention matrix while preserving critical information, attaining computation acceleration of up to 13.5x compared to full attention implementations such as FlashAttention.

Block Sparse Attention with Antidiagonal Scoring

The core innovation lies in the observation that the sum along the antidiagonals (i.e., lower-left to upper-right traversals) of attention blocks serves as an effective proxy for the importance of those blocks. This process can be summarized as follows:

  1. Block Partitioning and Scoring: The attention matrix is partitioned into fixed-size blocks (e.g., 8×8). For each block, the score is computed by summing values along its strided antidiagonals. Given a stride S, the scoring aggregates values that intersect various directional patterns within the block, ensuring that each token’s participation is considered without imposing a bias on vertical or diagonal dependencies.
  2. Thresholding and Block Selection: A predetermined threshold τ (or dynamically predicted per-head using a dynamic programming method) is applied to the antidiagonal scores. Blocks with scores surpassing τ are retained, implying that they contain substantial attention mass, while blocks below the threshold are pruned from computation. This thresholding paradigm balances sparsity and information retention effectively.
  3. Efficient Attention Computation: By restricting attention computation to only the selected blocks, the overall cost aligns more closely with linear or sub-quadratic complexity for long sequences. This block-based selection strategy is both computationally efficient and adaptive to long-range dependencies.

Below is a simplified pseudocode outline to demonstrate the algorithm’s core workflow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def xattention(attention_matrix, block_size, stride, threshold):
    # Partition the attention matrix into blocks of block_size x block_size
    num_blocks = attention_matrix.shape[0] // block_size
    selected_blocks = []
    
    for i in range(num_blocks):
        for j in range(num_blocks):
            block = attention_matrix[i*block_size:(i+1)*block_size,
                                     j*block_size:(j+1)*block_size]
            # Compute the antidiagonal score for the block
            score = compute_antidiagonal_sum(block, stride)
            
            # Select block if score exceeds threshold
            if score > threshold:
                selected_blocks.append((i, j, block))
    
    # Compute sparse attention using only selected blocks
    sparse_attention = sparse_attention_computation(selected_blocks)
    return sparse_attention

def compute_antidiagonal_sum(block, stride):
    score = 0.0
    # Sum over the strided antidiagonals
    for k in range(stride):
        score += np.sum(np.diag(np.fliplr(block), k - stride + 1))
    return score

Note that advanced implementations include a dynamic programming approach to individually set thresholds per attention head, thereby better adapting to heterogeneous importance distributions within the matrix.

Experimental Results and Numerical Performance

The experimental validation of XAttention was performed across several demanding benchmarks:

  • RULER (Natural Language):

With configurations using strides of S=8 and S=16, XAttention achieved accuracy on par with full attention while outperforming baselines such as FlexPrefill and MInference, particularly in scenarios with very long sequence lengths.

  • LongBench (Real-World NLP):

The methodology obtained the highest average scores across multiple tasks, underscoring its viability in practical, long-context natural language processing tasks.

  • VideoMME (Video Understanding) and VBench (Video Generation):

For video understanding, XAttention consistently outperformed the full attention paradigm (FlashAttention), while in video generation, it maintained high visual fidelity. An interesting aspect is the introduction of a “warmup” phase (using full attention for initial denoising steps) that further optimizes generation quality. Notably, a sparsity exceeding 50% was achieved without significant compromises in quality.

  • Speedup Considerations:

Demonstrated acceleration of up to 13.5x at 256k tokens is particularly noteworthy, marking a substantial reduction in computational load and inference latency. Detailed breakdowns reveal that the computational cost attributed to the block selection phase is negligible compared to the overall savings realized in the attention computation.

Implications for Long-Context Transformer Models

XAttention directly addresses the quadratic complexity in attention computation, which is a bottleneck in contemporary long-context Transformer architectures. Its design carries several key implications:

  • Scalability:

By significantly reducing computational overhead, XAttention facilitates the deployment of Long-Context Transformer Models (LCTMs) in real-world applications where long sequence lengths and multimodal inputs (e.g., video and language) are prevalent.

  • Efficiency:

The antidiagonal scoring method enables the retention of the most informative portions of the attention matrix. This ensures that the model retains critical dependencies across tokens while sidestepping redundant calculations.

  • Adaptivity:

The per-head dynamic thresholding allows the model to adaptively balance sparsity and accuracy, an essential feature for handling heterogeneous data modalities and varying sequence characteristics.

  • Democratization of LCTMs:

By lowering the resource requirements, XAttention makes it feasible to deploy these models in computationally constrained environments without compromising performance.

Implementation Considerations and Deployment

For practitioners, several technical aspects are crucial when integrating XAttention into existing systems:

  • Compatibility:

XAttention is designed as a plug‐and‐play module, making it straightforward to integrate into standard Transformer architectures. It is particularly compatible with architectures that already support block-wise operations and sparse matrix computations.

  • Threshold Calibration:

Implementing the dynamic programming approach for threshold prediction may necessitate additional computational overhead during model initialization but yields long-term benefits in sparsity adaptation and inference efficiency.

  • Hardware Utilization:

The effectiveness of XAttention benefits from GPU-accelerated sparse matrix operations. Implementers should consider leveraging optimized libraries (e.g., CUDA kernels for block-sparse operations) to maximize the tangible speedup on high-throughput inference systems.

  • Scaling and Memory Trade-offs:

While the method dramatically reduces computation time, careful consideration must be given to memory access patterns and potential overheads associated with block extraction and sparse tensor management. Profiling on target hardware is recommended to balance memory latency and throughput improvement.

  • Benchmarking:

Thorough benchmarking is advised on the target domain (e.g., NLP, video understanding, or generation) to determine the optimal block sizes, stride, and threshold settings. The reported numerical improvements in the paper provide a strong baseline that can be further tuned according to specific deployment constraints.

Conclusion

XAttention offers a rigorous and computationally efficient framework for reducing the cost of long-context attention in Transformer models. Its antidiagonal scoring mechanism uniquely balances accuracy and efficiency through a strategic block selection process, as validated by extensive empirical evaluations across diverse benchmarks. The quantitative improvements, including up to 13.5x acceleration and over 50% sparsity without significant degradation in performance, provide compelling evidence for its adoption in scalable LCTMs. For researchers and practitioners seeking to deploy long-context models in resource-constrained environments, integrating XAttention can provide substantial computational gains while maintaining high fidelity in performance outcomes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ruyi Xu (4 papers)
  2. Guangxuan Xiao (16 papers)
  3. Haofeng Huang (14 papers)
  4. Junxian Guo (6 papers)
  5. Song Han (155 papers)
Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com