Rectified SpaAttn: Sparse Attention Rectification
- The paper rectifies systematic biases in block-wise sparse attention by aligning critical block weights with implicit full-attention references.
- Rectified SpaAttn employs pooled representations of queries and keys to approximate full attention while maintaining temporal and spatial alignment.
- Empirical evaluations demonstrate 2–3× speedups and quality metrics comparable to dense attention, supporting high-resolution multimodal generation.
Rectified SpaAttn is a sparse attention rectification methodology designed to address the systematic biases introduced by block-wise sparse attention mechanisms in large-scale video and image generation models. Its central technical innovation is the rigorous recalibration of attention weights, using implicit full-attention references to ensure that sparse attention distributions better approximate their dense counterparts. The method enhances computational efficiency on high-resolution, long-context video and text generation tasks while achieving speedups with minimal degradation in sample quality, and is released as open source (Liu et al., 25 Nov 2025).
1. Preliminaries: Full and Block-Sparse Attention
Consider query, key, and value sequences with sequence length and hidden dimension . For Diffusion Transformers (DiTs), , denoting concatenated video and text tokens. These sequences are partitioned into non-overlapping blocks of size , so that with , and similarly for and with .
Full attention computes, for query block :
Block-wise sparse attention introduces a binary mask to select “critical” blocks. The sparse computation for proceeds as:
Here, “critical” refers to entries where and “non-critical” where .
2. Systematic Biases in Sparse Attention
Sparse masks, by truncating attention distributions, introduce two types of systematic error:
- Amplification bias: For critical blocks, attention weights are consistently greater than the true (full) attention , due to normalization over a reduced set:
- Mass-loss bias: All non-critical blocks receive zero attention (), whereas their true values in dense attention remain nonzero.
These biases, if unrectified, lead to degraded sample diversity and reduced synchronization between spatial and temporal modalities in generative video and image models.
3. Implicit Full-Attention Reference via Pooled QK
Rectified SpaAttn circumvents the computational intractability of full attention by constructing an implicit full-attention reference using block-pooled representations:
Mixed key banks are assembled as , yielding block-pooled scores and softmax:
serves as an efficient estimator for true block-level attention, forming an “implicit full attention” reference.
4. Isolated-Pooling Attention Reallocation (IPAR) and Rectification
Given heterogeneity in text tokens, the pooling process isolates text blocks: video keys are pooled, text keys retained, and mixed pools constructed as .
Block-wise softmax yields , which are normalized and concatenated into a distribution-aligned . The rectification factor for critical blocks is defined as:
The amplified sparse attention is then rescaled:
This procedure aligns critical block attention with dense-reference distributions, reducing amplification error.
5. Gain-Aware Pooling Rectification (GAPR) for Non-Critical Tokens
For non-critical blocks, attention mass recovery leverages , subject to a rectification constraint. Define:
- Attention gain:
- Pooling error:
Compensation applies only when ; a mask selects such blocks. Non-critical, compensated blocks are assigned:
This conservative strategy ensures recovered attention mass does not increase overall error.
6. High-Performance Implementation and Pseudocode
Rectified SpaAttn integrates both sparse and full kernels; block-wise sparse attention and pooling operations leverage FlashAttention2 and Triton, supporting video-text multimodal blocks. The method’s main tunable hyperparameters are topk (number of critical blocks) and p (retained attention mass). Overhead is negligible because rectifications operate strictly on pooled vectors.
A concise high-level pseudocode representation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
def RectifiedSpaAttn(Q, K, V, topk, p, adj_mask): Qv, Qt = split_video_text(Q) Kv, Kt = split_video_text(K) Vv, Vt = split_video_text(V) Qv_pool = block_pool(Qv) Kv_pool = block_pool(Kv) Vv_pool = block_pool(Vv) Kmix_pool = concat(Kv_pool, Kt) A_mix = softmax(Qv_pool @ Kmix_pool.T / sqrt(d)) A_pool = reallocate_to_blocks(A_mix, B) G = compute_gain(A_pool, Qv, Kv) E = compute_error(Qv, Kv, Qv_pool, Kv_pool) M_c = (abs(G) > abs(E)) M_I = topk_blocks(A_pool, topk) M_sparse = M_I | adj_mask O_v_spa = SparseKernel(Qv, K, V, M_sparse) O_t = FullKernel(Qt, K, V) R = (A_pool * M_sparse).sum(dim=1) O_v_cri = scale_blocks(O_v_spa, R) O_v_ncri = (A_pool * (~M_sparse & M_c)) @ Vv_pool O_v_prime = O_v_cri + O_v_ncri return concat(O_v_prime, O_t) |
7. Empirical Evaluation and Comparative Outcomes
Extensive benchmarks demonstrate the effectiveness of Rectified SpaAttn on large generative models such as HunyuanVideo-T2V-13B, Wan2.1 (T2V & I2V-14B), and Flux.1-dev-12B. Metrics used include sparsity, FLOPs, NVIDIA H100 end-to-end latency, speedup, and generation quality (VR, VBench for video; FID, CLIP, SSIM, PSNR, LPIPS for images).
Summary results:
| Model | Sparsity (%) | Speedup (×) | Quality |
|---|---|---|---|
| HunyuanVideo-T2V | 79.7–88.95 | 2.50–3.33 | VBench 83.13 |
| Flux.1-dev | 74.05 | 1.40 | FID 5.37 |
| Wan2.1-T2V/I2V | ~74.9–79.7 | 1.68–1.80 | VBench 83.72 |
| +TeaCache | up to 8.97 | N/A | N/A |
Rectified SpaAttn maintains quality levels statistically indistinguishable from full-attention baselines, even at sparsity rates approaching 90%. In contrast, previous sparse methods exhibit sharp degradation.
8. Significance and Position within Sparse Attention Research
Rectified SpaAttn advances sparse attention methodology by (1) accurately reallocating amplified mass on critical blocks, (2) conditionally recovering lost mass on non-critical blocks with rigorous error controls, and (3) achieving 2–3× pure sparse kernel speedups and up to 9× end-to-end gains when integrated with caching schemes. The technical principles are distinct from periodic dense rectification approaches like Rectified Sparse Attention (ReSA) for LLMs (Sun et al., 4 Jun 2025), which refresh KV caches via periodic dense passes.
Rectified SpaAttn is particularly suited for high-fidelity video generation and can be generalized to other block-sparse, multimodal scenarios where dense attention computation is prohibitive but sample quality is critical.