Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bidirectional Sparse Attention (BSA) Overview

Updated 4 July 2026
  • Bidirectional Sparse Attention (BSA) is a technique that adaptively reduces the quadratic cost of dense attention by jointly sparsifying both queries and key–value tokens.
  • It employs dynamic methods such as cosine similarity-based block selection and bidirectional co-clustering to retain content-adaptive semantic information.
  • Empirical results in video diffusion and long-context language modeling demonstrate significant speedups with minimal quality loss, validating its practical efficiency.

Bidirectional Sparse Attention (BSA) denotes a family of attention-reduction strategies that preserve broad contextual modeling while avoiding the quadratic cost of dense attention. In the recent literature, the term does not have a single canonical meaning. In video diffusion Transformers, BSA can denote the simultaneous dynamic sparsification of Queries and Key–Value pairs within 3D full attention (Zhan et al., 1 Sep 2025). In training-free video generation, it can denote online bidirectional co-clustering that jointly partitions queries and keys before block selection (Luo et al., 19 Mar 2026). In long-context language modeling, closely related formulations use bidirectional alignment between sparse and full attention streams, even when the paper does not explicitly adopt the name BSA (Shen et al., 25 Nov 2025). A common thread across these works is the attempt to reduce the effective attention domain without discarding the content-adaptive structure that dense attention would otherwise model.

1. Terminology and conceptual scope

A useful way to read the BSA literature is to separate the object being made bidirectional from the mechanism used to induce sparsity. The term is therefore best understood as polysemous rather than fully standardized.

Paper What “bidirectional” denotes Domain
(Zhan et al., 1 Sep 2025) Joint reduction of active queries and retained Key–Value blocks/tokens Video diffusion training
(Luo et al., 19 Mar 2026) Joint query–key partitioning via bidirectional co-clustering Training-free video generation
(Shen et al., 25 Nov 2025) Symmetric alignment between sparse-attention and full-attention outputs Long-context language modeling

This terminological variation matters technically. In "Bidirectional Sparse Attention for Faster Video Diffusion Training" (Zhan et al., 1 Sep 2025), bidirectionality is not about left-to-right versus right-to-left context; it refers to Q-side dynamic sparsification and K/V-side dynamic sparsification performed together within 3D full attention. In SVOO, the BSA core is the coupling of query and key block assignments, so that block partitioning is not performed independently on the two sides (Luo et al., 19 Mar 2026). In SSA, the authors explicitly use the phrase bidirectional alignment rather than BSA, and the provided terminology mapping presents SSA as a close analogue rather than a paper that formally names the method BSA (Shen et al., 25 Nov 2025).

Related work broadens the historical context. "Combiner: Full Attention Transformer with Sparse Computation Cost" treats self-attention as a conditional expectation and preserves full attention capability with sub-quadratic cost, including in bidirectional MLM settings where Ωi=[L]\Omega_i = [L] (Ren et al., 2021). "Efficient Long-Context Modeling in Diffusion LLMs via Block Approximate Sparse Attention" positions block-wise, content-adaptive sparse attention as especially relevant for bidirectional diffusion LLMs, although the accompanying description frames its operator as a design blueprint grounded in a normalization perturbation lemma rather than a fully specified empirical account (Zhang et al., 19 May 2026).

2. Computational motivation

The motivating bottleneck is dense attention’s quadratic scaling. For a single head with sequence length LL and head dimension dd, the standard formulation is

S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.

In the video DiT setting, latent tensors of shape (T,H,W)(T, H, W) are flattened to a 1D sequence of length L=T×H×WL = T \times H \times W, and the dominant per-head FLOP cost is approximately 4L2d4L^2 d; with hh heads, it is approximately 4hL2d4hL^2 d (Zhan et al., 1 Sep 2025). Memory is similarly dominated by the O(L2)O(L^2) score or attention matrix. The same basic bottleneck appears in training-free video generation, where per-head attention is described as LL0 in 3D token spaces, with LL1 (Luo et al., 19 Mar 2026).

In video diffusion Transformers, this scaling is especially severe because both resolution and duration enlarge the token lattice. The BSA paper states that, in DiTs, attention often dominates more than LL2 of training cost, making dense 3D attention the principal bottleneck (Zhan et al., 1 Sep 2025). SVOO makes the same structural diagnosis on the inference side: attention is repeatedly applied across diffusion steps and layers, so even modest per-layer savings compound across generation (Luo et al., 19 Mar 2026).

This suggests that the central design problem is not merely to reduce the number of computed interactions, but to do so adaptively. Fixed sparse patterns such as local windows, striding, or static top-LL3 rules are described as suboptimal because attention distributions vary across time, space, heads, layers, samples, and training steps (Zhan et al., 1 Sep 2025). The modern BSA formulations therefore emphasize content-aware selection rather than static geometry.

3. Dynamic bidirectional sparsification in video diffusion Transformers

The formulation in "Bidirectional Sparse Attention for Faster Video Diffusion Training" is a trainable, hardware-aligned sparse attention mechanism for Video Diffusion Transformers that simultaneously sparsifies the Query side and the Key–Value side in 3D full attention (Zhan et al., 1 Sep 2025). Its two components are complementary.

On the query side, the video latent is partitioned into 3D blocks of size LL4, with block size LL5. Within each block, a representative center query is chosen, and token selection is driven by semantic similarity to that center. The similarity metric instantiated in the paper is cosine similarity,

LL6

and the retained sparse query set is written as

LL7

The paper also introduces a window-based refinement in which each block is subdivided into windows of size LL8 and local centers are used instead of a single block center; this is reported to preserve fine-grained semantics better at the same sparsity (Zhan et al., 1 Sep 2025).

On the Key–Value side, tensors are grouped into aligned spatiotemporal blocks, and inter-block saliency scores LL9 are computed before full attention materialization. The dynamic threshold is

dd0

where dd1 is the quantile function and dd2 is derived from the sparsity schedule. For a query block dd3, the retained KV set is the minimal set satisfying a cumulative probability target,

dd4

The retained tokens are then used in sparse attention

dd5

The resulting complexity depends on the retained query fraction dd6 and KV retention fraction dd7. The paper states that FLOPs scale as dd8 relative to full attention, so the speedup is approximately dd9. The empirical rule of thumb given is S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.0 and S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.1, yielding S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.2, or roughly S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.3 FLOP reduction (Zhan et al., 1 Sep 2025).

Several implementation choices are part of the method rather than incidental engineering. Hard binary masks are used for queries, gradients flow only through retained queries, outputs are scattered back to the original token layout, and mask computation overhead is measured as less than S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.4 FLOPs. The sparsity schedule is annealed: training begins with full attention, then every 30 steps sparsity increases by 0.03 until approximately 0.9. Triton custom kernels and block-partitioned masks are used so that GPU SM tiles process or skip whole blocks, and the design is explicitly aligned with FlashAttention-style IO-aware tiling (Zhan et al., 1 Sep 2025).

4. Training-free BSA via offline profiling and online bidirectional co-clustering

SVOO realizes a different BSA paradigm for inference-time acceleration in video generation (Luo et al., 19 Mar 2026). The method is explicitly training-free and organized into two stages: offline layer-wise sparsity profiling and online bidirectional co-clustering.

The offline stage estimates intrinsic sparsity for each layer and head. For calibration input S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.5, layer S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.6, and head S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.7, the post-softmax attention matrix is S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.8. For each query row, the smallest index set covering a recall threshold S=QK/d,A=softmax(S),O=AV.S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.9 is found, and the attention density is defined as

(T,H,W)(T, H, W)0

These densities are modeled as Gaussian across calibration samples, (T,H,W)(T, H, W)1, with conservative estimate

(T,H,W)(T, H, W)2

and sparsity schedule

(T,H,W)(T, H, W)3

The paper argues that this generalizes across inputs because attention sparsity is an intrinsic property of each layer, with minor effects across different inputs, and it supports this claim through a stability bound on a pre-softmax logit variance proxy (T,H,W)(T, H, W)4 (Luo et al., 19 Mar 2026).

The online stage performs affinity-driven, alternating bidirectional co-clustering. Query and key tokens are partitioned into (T,H,W)(T, H, W)5 and (T,H,W)(T, H, W)6 blocks, with the default experimental choice (T,H,W)(T, H, W)7 and (T,H,W)(T, H, W)8. Current query anchors induce key-side affinity patterns,

(T,H,W)(T, H, W)9

which are normalized and used to assign each key to the nearest key cluster. Updated key anchors then induce query-side affinity patterns,

L=T×H×WL = T \times H \times W0

which are used to assign queries to clusters. The paper uses L=T×H×WL = T \times H \times W1 co-clustering iterations per recompute. Block-pair saliency is then approximated by centroid dot products

L=T×H×WL = T \times H \times W2

and the fraction of active block pairs is chosen by a rule that balances intrinsic sparsity schedule and recall target:

L=T×H×WL = T \times H \times W3

with L=T×H×WL = T \times H \times W4.

The resulting attention complexity is written as L=T×H×WL = T \times H \times W5 with L=T×H×WL = T \times H \times W6, where L=T×H×WL = T \times H \times W7 is the number of active block pairs (Luo et al., 19 Mar 2026). Integration is again hardware-conscious: Triton kernels are used for co-clustering, dynamic block-size FlashInfer kernels are used for block-sparse attention, and clustering assignments are reused every L=T×H×WL = T \times H \times W8 diffusion steps because partitions are empirically stable across steps. First-layer dense attention and warm-up dense diffusion steps are retained: L=T×H×WL = T \times H \times W9 for Wan-series and 4L2d4L^2 d0 for HunyuanVideo-series (Luo et al., 19 Mar 2026).

5. Bidirectional alignment and other generalizations

A broader strand of work uses bidirectionality in supervision rather than in token selection. "SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space" does not use the term BSA, but the provided terminology mapping presents it as a close functional instance of the concept (Shen et al., 25 Nov 2025). SSA maintains two attention paths per layer: a native sparse-attention path and a native full-attention path. The main path is sampled per iteration with probability 4L2d4L^2 d1, and the auxiliary path is the opposite mode. The layerwise alignment losses are

4L2d4L^2 d2

with

4L2d4L^2 d3

and total objective

4L2d4L^2 d4

The paper’s central diagnosis is gradient update deficiency: low-ranked key–value pairs excluded during sparse training receive neither forward contribution nor backward gradients, so they never learn proper suppression. Alternating full and sparse streams, while aligning their outputs symmetrically, is the proposed remedy (Shen et al., 25 Nov 2025).

Other related formulations situate BSA-like ideas in broader bidirectional modeling. BA-Att is presented for diffusion LLMs, which require globally coherent, bidirectional, and controllable text generation. Its abstract states that the method identifies informative regions in a compact downsampled space, avoids fixed positional priors, achieves up to 4L2d4L^2 d5 acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 4L2d4L^2 d6 sparsity across LLMs, multimodal LLMs, and video generation models (Zhang et al., 19 May 2026). The accompanying description, however, presents the operator as a theoretically motivated blueprint rather than a fully grounded empirical account, with the normalization perturbation lemma serving as the main formal anchor.

An earlier precursor is Combiner, which is not a sparse masking method in the usual sense but is directly relevant to bidirectional sparse-attention discussions because it preserves full attention capability with sub-quadratic cost (Ren et al., 2021). Combiner factors the conditional attention distribution through region abstractions,

4L2d4L^2 d7

and thereby retains full support in bidirectional MLM settings while achieving 4L2d4L^2 d8 or 4L2d4L^2 d9 complexity depending on the partition scheme. This suggests that some lines of research adjacent to BSA aim not to sparsify the support irreversibly, but to reparameterize or approximate full attention through structured intermediates (Ren et al., 2021).

6. Empirical behavior, trade-offs, and limitations

The empirical profile of BSA depends strongly on the formulation.

For trainable video diffusion BSA, experiments are reported with a Wan2.1-1.3B backbone, 300k videos from Vchitect T2V DataVerse, preprocessing including shot segmentation, 5-second truncation, and captions from Tarsier2, over 30,000 training steps on NVIDIA H100 GPUs (Zhan et al., 1 Sep 2025). At approximately 23K tokens, full attention yields Text Consistency hh0, BG Consistency hh1, Image Quality hh2, Subject Consistency hh3, and FLOPs approximately hh4; BSA yields Text Consistency hh5, BG Consistency hh6, Image Quality hh7, Subject Consistency hh8, and FLOPs approximately hh9, corresponding to 4hL2d4hL^2 d0 speedup. At approximately 153K tokens, full attention yields Text 4hL2d4hL^2 d1, BG 4hL2d4hL^2 d2, Image 4hL2d4hL^2 d3, Subject 4hL2d4hL^2 d4, and FLOPs approximately 4hL2d4hL^2 d5; BSA yields Text 4hL2d4hL^2 d6, BG 4hL2d4hL^2 d7, Image 4hL2d4hL^2 d8, Subject 4hL2d4hL^2 d9, and FLOPs approximately O(L2)O(L^2)0, corresponding to O(L2)O(L^2)1 speedup (Zhan et al., 1 Sep 2025). Inference latency on H100 is reduced from 31s to 5s, approximately O(L2)O(L^2)2, with no perceptible quality degradation. The ablations are structurally informative: query-sparse with O(L2)O(L^2)3 gives approximately O(L2)O(L^2)4 speedup; KV-sparse with fixed threshold gives approximately O(L2)O(L^2)5; KV-sparse with statistical dynamic threshold gives approximately O(L2)O(L^2)6; combined query+KV sparsity gives approximately O(L2)O(L^2)7 at sparsity approximately O(L2)O(L^2)8 (Zhan et al., 1 Sep 2025).

For SVOO, the reported quality–speed trade-off is more modest in raw speedup but broad across seven video generation models (Luo et al., 19 Mar 2026). On Wan2.1-T2V-1.3B at 720p and 81 frames, SVOO reports PSNR O(L2)O(L^2)9 dB, SSIM LL00, LPIPS LL01, ImageQual LL02, AesQual LL03, SubjectConsistency LL04, BackgroundConsistency LL05, latency 216 s, and speedup LL06 versus dense attention. HunyuanVideo-T2V reports latency 821 s and speedup LL07, which is the best speedup across the T2V experiments. Ablations show that removing offline profiling reduces efficiency for similar quality, while removing bidirectional co-clustering degrades PSNR and SSIM and raises LPIPS for similar speed (Luo et al., 19 Mar 2026).

For bidirectional alignment in SSA, the strongest evidence concerns sparsity fidelity rather than video generation throughput (Shen et al., 25 Nov 2025). SSA reports the smallest KL divergence between sparse and full modes, at LL08, and the highest attention sparsity, with AttnSparsity LL09 in sparse mode and LL10 in full mode. Under full-attention inference, SSA attains commonsense average LL11 and WikiText perplexity LL12; under sparse attention inference with receptive field 256, it attains average LL13 and perplexity LL14; with receptive field 1024, it attains average LL15 and perplexity LL16 (Shen et al., 25 Nov 2025). Long-context extrapolation is also reported as unusually strong: in Needle-in-a-Haystack under full-attention inference, SSA maintains LL17 at 4k and 8k, LL18 at 16k, and LL19 at 32k.

The limitations are correspondingly specific. In trainable video BSA, failure modes include missing salient tokens if centers are poorly chosen or if LL20 is too low, and KV under-selection if LL21 is too high under unusual score distributions; validation loss remains stable up to approximately LL22 sparsity and degrades only beyond approximately LL23 (Zhan et al., 1 Sep 2025). In SVOO, extremely long sequences with low redundancy may require larger LL24 and higher LL25, unusual inputs may benefit from input-aware schedule adjustments, and high-frequency textures or abrupt motion may need more active blocks (Luo et al., 19 Mar 2026). In SSA, one-sided alignment is reported as unstable, and eliminating either stream harms performance, indicating that the “bidirectional” coupling is not merely auxiliary but structurally necessary (Shen et al., 25 Nov 2025).

Taken together, these results show that BSA is less a single algorithm than a design principle: sparsity should be imposed in a way that respects the bidirectional structure of the attention problem being solved. In video diffusion training, that structure lies in the simultaneous redundancy of queries and Key–Value pairs (Zhan et al., 1 Sep 2025). In training-free video generation, it lies in the coupling between query and key partitions (Luo et al., 19 Mar 2026). In dual-stream long-context training, it lies in reciprocal supervision between sparse and dense pathways (Shen et al., 25 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bidirectional Sparse Attention (BSA).