Papers
Topics
Authors
Recent
2000 character limit reached

Rectified SpaAttn: Sparse Attention Rectification

Updated 2 December 2025
  • The paper rectifies systematic biases in block-wise sparse attention by aligning critical block weights with implicit full-attention references.
  • Rectified SpaAttn employs pooled representations of queries and keys to approximate full attention while maintaining temporal and spatial alignment.
  • Empirical evaluations demonstrate 2–3× speedups and quality metrics comparable to dense attention, supporting high-resolution multimodal generation.

Rectified SpaAttn is a sparse attention rectification methodology designed to address the systematic biases introduced by block-wise sparse attention mechanisms in large-scale video and image generation models. Its central technical innovation is the rigorous recalibration of attention weights, using implicit full-attention references to ensure that sparse attention distributions better approximate their dense counterparts. The method enhances computational efficiency on high-resolution, long-context video and text generation tasks while achieving speedups with minimal degradation in sample quality, and is released as open source (Liu et al., 25 Nov 2025).

1. Preliminaries: Full and Block-Sparse Attention

Consider query, key, and value sequences Q,K,VRT×dQ, K, V \in \mathbb{R}^{T \times d} with sequence length TT and hidden dimension dd. For Diffusion Transformers (DiTs), T=Tv+TtT = T_v + T_t, denoting concatenated video and text tokens. These sequences are partitioned into non-overlapping blocks of size BB, so that Q=[Q1,...,QN]Q = [Q_1, ..., Q_N] with QnRB×dQ_n \in \mathbb{R}^{B \times d}, and similarly for KK and VV with N=T/BN = T/B.

Full attention computes, for query block QnQ_n:

Sn=1dQnKTRB×(MB)S_n = \tfrac{1}{\sqrt{d}} Q_n K^T \in \mathbb{R}^{B \times (MB)}

An=softmax(Sn)RB×(MB)A_n = \mathrm{softmax}(S_n) \in \mathbb{R}^{B \times (MB)}

Onfull=AnVRB×dO_n^{\mathrm{full}} = A_n V \in \mathbb{R}^{B \times d}

Block-wise sparse attention introduces a binary mask M^{0,1}N×M\widehat M \in \{0,1\}^{N \times M} to select “critical” blocks. The sparse computation for QnQ_n proceeds as:

Snspa=1d(QnKT)M^nS_n^{\mathrm{spa}} = \tfrac{1}{\sqrt{d}} (Q_n K^T) \odot \widehat M_n

Anspa=softmax(Snspa)A_n^{\mathrm{spa}} = \mathrm{softmax}(S_n^{\mathrm{spa}})

Onspa=AnspaVO_n^{\mathrm{spa}} = A_n^{\mathrm{spa}} V

Here, “critical” refers to entries where M^n,m=1\widehat M_{n,m} = 1 and “non-critical” where M^n,m=0\widehat M_{n,m} = 0.

2. Systematic Biases in Sparse Attention

Sparse masks, by truncating attention distributions, introduce two types of systematic error:

  • Amplification bias: For critical blocks, attention weights An,mspaA_{n,m}^{\mathrm{spa}} are consistently greater than the true (full) attention An,mA_{n,m}, due to normalization over a reduced set:

An,mspa=exp(Sn,m)m=1Mexp(Sn,m)M^n,mA_{n,m}^{\mathrm{spa}} = \frac{\exp(S_{n,m})}{\sum_{m'=1}^M \exp(S_{n,m'}) \cdot \widehat M_{n,m'}}

  • Mass-loss bias: All non-critical blocks receive zero attention (An,mspa=0A_{n,m}^{\mathrm{spa}} = 0), whereas their true values An,mA_{n,m} in dense attention remain nonzero.

These biases, if unrectified, lead to degraded sample diversity and reduced synchronization between spatial and temporal modalities in generative video and image models.

3. Implicit Full-Attention Reference via Pooled QK

Rectified SpaAttn circumvents the computational intractability of full attention by constructing an implicit full-attention reference using block-pooled representations:

qnpool=1Biblock nqiq_n^{\mathrm{pool}} = \frac{1}{B}\sum_{i \in \mathrm{block}~n} q_i

kmpool=1Bjblock mkjk_m^{\mathrm{pool}} = \frac{1}{B}\sum_{j \in \mathrm{block}~m} k_j

Mixed key banks are assembled as [k1pool,...,kMpool][k^{\mathrm{pool}}_1, ..., k^{\mathrm{pool}}_M], yielding block-pooled scores and softmax:

Snpool=1dqnpoolKpool TS_n^{\mathrm{pool}} = \tfrac{1}{\sqrt{d}} q_n^{\mathrm{pool}} K^{\mathrm{pool}~T}

Anpool=softmax(Snpool)A_n^{\mathrm{pool}} = \mathrm{softmax}(S_n^{\mathrm{pool}})

AnpoolA_n^{\mathrm{pool}} serves as an efficient estimator for true block-level attention, forming an “implicit full attention” reference.

4. Isolated-Pooling Attention Reallocation (IPAR) and Rectification

Given heterogeneity in text tokens, the pooling process isolates text blocks: video keys KvK_v are pooled, text keys KtK_t retained, and mixed pools constructed as Kmixpool=[Kvpool;Kt]K_{\mathrm{mix}}^{\mathrm{pool}} = [K_v^{\mathrm{pool}} ; K_t].

Block-wise softmax yields Avpool,AtA_v^{\mathrm{pool}}, A_t, which are normalized and concatenated into a distribution-aligned ApoolA^{\mathrm{pool}}. The rectification factor for critical blocks is defined as:

Rn=m:M^n,m=1An,mpoolR_n = \sum_{m:\widehat M_{n,m}=1} A_{n,m}^{\mathrm{pool}}

The amplified sparse attention is then rescaled:

An,mspa=An,mspa×RnAn,mA_{n,m}^{\mathrm{spa}'} = A_{n,m}^{\mathrm{spa}} \times R_n \approx A_{n,m}

This procedure aligns critical block attention with dense-reference distributions, reducing amplification error.

5. Gain-Aware Pooling Rectification (GAPR) for Non-Critical Tokens

For non-critical blocks, attention mass recovery leverages ApoolA^{\mathrm{pool}}, subject to a rectification constraint. Define:

  • Attention gain: Gn,m=iBnjBm(ai,jpoolai,jspa)G_{n,m} = \sum_{i \in B_n}\sum_{j \in B_m} (a_{i,j}^{\mathrm{pool}} - a_{i,j}^{\mathrm{spa}})
  • Pooling error: En,m=iBnjBmai,jai,jpoolE_{n,m} = \sum_{i \in B_n}\sum_{j \in B_m} |a_{i,j} - a_{i,j}^{\mathrm{pool}}|

Compensation applies only when Gn,m>En,m|G_{n,m}| > |E_{n,m}|; a mask McM_c selects such blocks. Non-critical, compensated blocks are assigned:

An,mspa=An,mpoolAn,mA_{n,m}^{\mathrm{spa}'} = A_{n,m}^{\mathrm{pool}} \approx A_{n,m}

This conservative strategy ensures recovered attention mass does not increase overall error.

6. High-Performance Implementation and Pseudocode

Rectified SpaAttn integrates both sparse and full kernels; block-wise sparse attention and pooling operations leverage FlashAttention2 and Triton, supporting video-text multimodal blocks. The method’s main tunable hyperparameters are topk (number of critical blocks) and p (retained attention mass). Overhead is negligible because rectifications operate strictly on pooled vectors.

A concise high-level pseudocode representation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def RectifiedSpaAttn(Q, K, V, topk, p, adj_mask):
    Qv, Qt = split_video_text(Q)
    Kv, Kt = split_video_text(K)
    Vv, Vt = split_video_text(V)

    Qv_pool = block_pool(Qv)
    Kv_pool = block_pool(Kv)
    Vv_pool = block_pool(Vv)

    Kmix_pool = concat(Kv_pool, Kt)
    A_mix = softmax(Qv_pool @ Kmix_pool.T / sqrt(d))
    A_pool = reallocate_to_blocks(A_mix, B)

    G = compute_gain(A_pool, Qv, Kv)
    E = compute_error(Qv, Kv, Qv_pool, Kv_pool)
    M_c = (abs(G) > abs(E))

    M_I = topk_blocks(A_pool, topk)
    M_sparse = M_I | adj_mask

    O_v_spa = SparseKernel(Qv, K, V, M_sparse)
    O_t = FullKernel(Qt, K, V)

    R = (A_pool * M_sparse).sum(dim=1)
    O_v_cri = scale_blocks(O_v_spa, R)
    O_v_ncri = (A_pool * (~M_sparse & M_c)) @ Vv_pool

    O_v_prime = O_v_cri + O_v_ncri
    return concat(O_v_prime, O_t)

7. Empirical Evaluation and Comparative Outcomes

Extensive benchmarks demonstrate the effectiveness of Rectified SpaAttn on large generative models such as HunyuanVideo-T2V-13B, Wan2.1 (T2V & I2V-14B), and Flux.1-dev-12B. Metrics used include sparsity, FLOPs, NVIDIA H100 end-to-end latency, speedup, and generation quality (VR, VBench for video; FID, CLIP, SSIM, PSNR, LPIPS for images).

Summary results:

Model Sparsity (%) Speedup (×) Quality
HunyuanVideo-T2V 79.7–88.95 2.50–3.33 VBench 83.13
Flux.1-dev 74.05 1.40 FID 5.37
Wan2.1-T2V/I2V ~74.9–79.7 1.68–1.80 VBench 83.72
+TeaCache up to 8.97 N/A N/A

Rectified SpaAttn maintains quality levels statistically indistinguishable from full-attention baselines, even at sparsity rates approaching 90%. In contrast, previous sparse methods exhibit sharp degradation.

8. Significance and Position within Sparse Attention Research

Rectified SpaAttn advances sparse attention methodology by (1) accurately reallocating amplified mass on critical blocks, (2) conditionally recovering lost mass on non-critical blocks with rigorous error controls, and (3) achieving 2–3× pure sparse kernel speedups and up to 9× end-to-end gains when integrated with caching schemes. The technical principles are distinct from periodic dense rectification approaches like Rectified Sparse Attention (ReSA) for LLMs (Sun et al., 4 Jun 2025), which refresh KV caches via periodic dense passes.

Rectified SpaAttn is particularly suited for high-fidelity video generation and can be generalized to other block-sparse, multimodal scenarios where dense attention computation is prohibitive but sample quality is critical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Rectified SpaAttn.