Bidirectional Sparse Attention (BSA) Overview

Updated 4 July 2026

Bidirectional Sparse Attention (BSA) is a technique that adaptively reduces the quadratic cost of dense attention by jointly sparsifying both queries and key–value tokens.
It employs dynamic methods such as cosine similarity-based block selection and bidirectional co-clustering to retain content-adaptive semantic information.
Empirical results in video diffusion and long-context language modeling demonstrate significant speedups with minimal quality loss, validating its practical efficiency.

Bidirectional Sparse Attention (BSA) denotes a family of attention-reduction strategies that preserve broad contextual modeling while avoiding the quadratic cost of dense attention. In the recent literature, the term does not have a single canonical meaning. In video diffusion Transformers, BSA can denote the simultaneous dynamic sparsification of Queries and Key–Value pairs within 3D full attention (Zhan et al., 1 Sep 2025). In training-free video generation, it can denote online bidirectional co-clustering that jointly partitions queries and keys before block selection (Luo et al., 19 Mar 2026). In long-context language modeling, closely related formulations use bidirectional alignment between sparse and full attention streams, even when the paper does not explicitly adopt the name BSA (Shen et al., 25 Nov 2025). A common thread across these works is the attempt to reduce the effective attention domain without discarding the content-adaptive structure that dense attention would otherwise model.

1. Terminology and conceptual scope

A useful way to read the BSA literature is to separate the object being made bidirectional from the mechanism used to induce sparsity. The term is therefore best understood as polysemous rather than fully standardized.

Paper	What “bidirectional” denotes	Domain
(Zhan et al., 1 Sep 2025)	Joint reduction of active queries and retained Key–Value blocks/tokens	Video diffusion training
(Luo et al., 19 Mar 2026)	Joint query–key partitioning via bidirectional co-clustering	Training-free video generation
(Shen et al., 25 Nov 2025)	Symmetric alignment between sparse-attention and full-attention outputs	Long-context language modeling

This terminological variation matters technically. In "Bidirectional Sparse Attention for Faster Video Diffusion Training" (Zhan et al., 1 Sep 2025), bidirectionality is not about left-to-right versus right-to-left context; it refers to Q-side dynamic sparsification and K/V-side dynamic sparsification performed together within 3D full attention. In SVOO, the BSA core is the coupling of query and key block assignments, so that block partitioning is not performed independently on the two sides (Luo et al., 19 Mar 2026). In SSA, the authors explicitly use the phrase bidirectional alignment rather than BSA, and the provided terminology mapping presents SSA as a close analogue rather than a paper that formally names the method BSA (Shen et al., 25 Nov 2025).

Related work broadens the historical context. "Combiner: Full Attention Transformer with Sparse Computation Cost" treats self-attention as a conditional expectation and preserves full attention capability with sub-quadratic cost, including in bidirectional MLM settings where $\Omega_i = [L]$ (Ren et al., 2021). "Efficient Long-Context Modeling in Diffusion LLMs via Block Approximate Sparse Attention" positions block-wise, content-adaptive sparse attention as especially relevant for bidirectional diffusion LLMs, although the accompanying description frames its operator as a design blueprint grounded in a normalization perturbation lemma rather than a fully specified empirical account (Zhang et al., 19 May 2026).

2. Computational motivation

The motivating bottleneck is dense attention’s quadratic scaling. For a single head with sequence length $L$ and head dimension $d$ , the standard formulation is

$S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$

In the video DiT setting, latent tensors of shape $(T, H, W)$ are flattened to a 1D sequence of length $L = T \times H \times W$ , and the dominant per-head FLOP cost is approximately $4L^2 d$ ; with $h$ heads, it is approximately $4hL^2 d$ (Zhan et al., 1 Sep 2025). Memory is similarly dominated by the $O(L^2)$ score or attention matrix. The same basic bottleneck appears in training-free video generation, where per-head attention is described as $L$ 0 in 3D token spaces, with $L$ 1 (Luo et al., 19 Mar 2026).

In video diffusion Transformers, this scaling is especially severe because both resolution and duration enlarge the token lattice. The BSA paper states that, in DiTs, attention often dominates more than $L$ 2 of training cost, making dense 3D attention the principal bottleneck (Zhan et al., 1 Sep 2025). SVOO makes the same structural diagnosis on the inference side: attention is repeatedly applied across diffusion steps and layers, so even modest per-layer savings compound across generation (Luo et al., 19 Mar 2026).

This suggests that the central design problem is not merely to reduce the number of computed interactions, but to do so adaptively. Fixed sparse patterns such as local windows, striding, or static top- $L$ 3 rules are described as suboptimal because attention distributions vary across time, space, heads, layers, samples, and training steps (Zhan et al., 1 Sep 2025). The modern BSA formulations therefore emphasize content-aware selection rather than static geometry.

3. Dynamic bidirectional sparsification in video diffusion Transformers

The formulation in "Bidirectional Sparse Attention for Faster Video Diffusion Training" is a trainable, hardware-aligned sparse attention mechanism for Video Diffusion Transformers that simultaneously sparsifies the Query side and the Key–Value side in 3D full attention (Zhan et al., 1 Sep 2025). Its two components are complementary.

On the query side, the video latent is partitioned into 3D blocks of size $L$ 4, with block size $L$ 5. Within each block, a representative center query is chosen, and token selection is driven by semantic similarity to that center. The similarity metric instantiated in the paper is cosine similarity,

$L$ 6

and the retained sparse query set is written as

$L$ 7

The paper also introduces a window-based refinement in which each block is subdivided into windows of size $L$ 8 and local centers are used instead of a single block center; this is reported to preserve fine-grained semantics better at the same sparsity (Zhan et al., 1 Sep 2025).

On the Key–Value side, tensors are grouped into aligned spatiotemporal blocks, and inter-block saliency scores $L$ 9 are computed before full attention materialization. The dynamic threshold is

$d$ 0

where $d$ 1 is the quantile function and $d$ 2 is derived from the sparsity schedule. For a query block $d$ 3, the retained KV set is the minimal set satisfying a cumulative probability target,

$d$ 4

The retained tokens are then used in sparse attention

$d$ 5

The resulting complexity depends on the retained query fraction $d$ 6 and KV retention fraction $d$ 7. The paper states that FLOPs scale as $d$ 8 relative to full attention, so the speedup is approximately $d$ 9. The empirical rule of thumb given is $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 0 and $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 1, yielding $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 2, or roughly $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 3 FLOP reduction (Zhan et al., 1 Sep 2025).

Several implementation choices are part of the method rather than incidental engineering. Hard binary masks are used for queries, gradients flow only through retained queries, outputs are scattered back to the original token layout, and mask computation overhead is measured as less than $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 4 FLOPs. The sparsity schedule is annealed: training begins with full attention, then every 30 steps sparsity increases by 0.03 until approximately 0.9. Triton custom kernels and block-partitioned masks are used so that GPU SM tiles process or skip whole blocks, and the design is explicitly aligned with FlashAttention-style IO-aware tiling (Zhan et al., 1 Sep 2025).

4. Training-free BSA via offline profiling and online bidirectional co-clustering

SVOO realizes a different BSA paradigm for inference-time acceleration in video generation (Luo et al., 19 Mar 2026). The method is explicitly training-free and organized into two stages: offline layer-wise sparsity profiling and online bidirectional co-clustering.

The offline stage estimates intrinsic sparsity for each layer and head. For calibration input $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 5, layer $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 6, and head $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 7, the post-softmax attention matrix is $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 8. For each query row, the smallest index set covering a recall threshold $S = QK^\top / \sqrt{d}, \qquad A = \text{softmax}(S), \qquad O = AV.$ 9 is found, and the attention density is defined as

$(T, H, W)$ 0

These densities are modeled as Gaussian across calibration samples, $(T, H, W)$ 1, with conservative estimate

$(T, H, W)$ 2

and sparsity schedule

$(T, H, W)$ 3

The paper argues that this generalizes across inputs because attention sparsity is an intrinsic property of each layer, with minor effects across different inputs, and it supports this claim through a stability bound on a pre-softmax logit variance proxy $(T, H, W)$ 4 (Luo et al., 19 Mar 2026).

The online stage performs affinity-driven, alternating bidirectional co-clustering. Query and key tokens are partitioned into $(T, H, W)$ 5 and $(T, H, W)$ 6 blocks, with the default experimental choice $(T, H, W)$ 7 and $(T, H, W)$ 8. Current query anchors induce key-side affinity patterns,

$(T, H, W)$ 9

which are normalized and used to assign each key to the nearest key cluster. Updated key anchors then induce query-side affinity patterns,

$L = T \times H \times W$ 0

which are used to assign queries to clusters. The paper uses $L = T \times H \times W$ 1 co-clustering iterations per recompute. Block-pair saliency is then approximated by centroid dot products

$L = T \times H \times W$ 2

and the fraction of active block pairs is chosen by a rule that balances intrinsic sparsity schedule and recall target:

$L = T \times H \times W$ 3

with $L = T \times H \times W$ 4.

The resulting attention complexity is written as $L = T \times H \times W$ 5 with $L = T \times H \times W$ 6, where $L = T \times H \times W$ 7 is the number of active block pairs (Luo et al., 19 Mar 2026). Integration is again hardware-conscious: Triton kernels are used for co-clustering, dynamic block-size FlashInfer kernels are used for block-sparse attention, and clustering assignments are reused every $L = T \times H \times W$ 8 diffusion steps because partitions are empirically stable across steps. First-layer dense attention and warm-up dense diffusion steps are retained: $L = T \times H \times W$ 9 for Wan-series and $4L^2 d$ 0 for HunyuanVideo-series (Luo et al., 19 Mar 2026).

5. Bidirectional alignment and other generalizations

A broader strand of work uses bidirectionality in supervision rather than in token selection. "SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space" does not use the term BSA, but the provided terminology mapping presents it as a close functional instance of the concept (Shen et al., 25 Nov 2025). SSA maintains two attention paths per layer: a native sparse-attention path and a native full-attention path. The main path is sampled per iteration with probability $4L^2 d$ 1, and the auxiliary path is the opposite mode. The layerwise alignment losses are

$4L^2 d$ 2

with

$4L^2 d$ 3

and total objective

$4L^2 d$ 4

The paper’s central diagnosis is gradient update deficiency: low-ranked key–value pairs excluded during sparse training receive neither forward contribution nor backward gradients, so they never learn proper suppression. Alternating full and sparse streams, while aligning their outputs symmetrically, is the proposed remedy (Shen et al., 25 Nov 2025).

Other related formulations situate BSA-like ideas in broader bidirectional modeling. BA-Att is presented for diffusion LLMs, which require globally coherent, bidirectional, and controllable text generation. Its abstract states that the method identifies informative regions in a compact downsampled space, avoids fixed positional priors, achieves up to $4L^2 d$ 5 acceleration over FlashAttention in attention computation, and maintains near full-attention performance at $4L^2 d$ 6 sparsity across LLMs, multimodal LLMs, and video generation models (Zhang et al., 19 May 2026). The accompanying description, however, presents the operator as a theoretically motivated blueprint rather than a fully grounded empirical account, with the normalization perturbation lemma serving as the main formal anchor.

An earlier precursor is Combiner, which is not a sparse masking method in the usual sense but is directly relevant to bidirectional sparse-attention discussions because it preserves full attention capability with sub-quadratic cost (Ren et al., 2021). Combiner factors the conditional attention distribution through region abstractions,

$4L^2 d$ 7

and thereby retains full support in bidirectional MLM settings while achieving $4L^2 d$ 8 or $4L^2 d$ 9 complexity depending on the partition scheme. This suggests that some lines of research adjacent to BSA aim not to sparsify the support irreversibly, but to reparameterize or approximate full attention through structured intermediates (Ren et al., 2021).

6. Empirical behavior, trade-offs, and limitations

The empirical profile of BSA depends strongly on the formulation.

For trainable video diffusion BSA, experiments are reported with a Wan2.1-1.3B backbone, 300k videos from Vchitect T2V DataVerse, preprocessing including shot segmentation, 5-second truncation, and captions from Tarsier2, over 30,000 training steps on NVIDIA H100 GPUs (Zhan et al., 1 Sep 2025). At approximately 23K tokens, full attention yields Text Consistency $h$ 0, BG Consistency $h$ 1, Image Quality $h$ 2, Subject Consistency $h$ 3, and FLOPs approximately $h$ 4; BSA yields Text Consistency $h$ 5, BG Consistency $h$ 6, Image Quality $h$ 7, Subject Consistency $h$ 8, and FLOPs approximately $h$ 9, corresponding to $4hL^2 d$ 0 speedup. At approximately 153K tokens, full attention yields Text $4hL^2 d$ 1, BG $4hL^2 d$ 2, Image $4hL^2 d$ 3, Subject $4hL^2 d$ 4, and FLOPs approximately $4hL^2 d$ 5; BSA yields Text $4hL^2 d$ 6, BG $4hL^2 d$ 7, Image $4hL^2 d$ 8, Subject $4hL^2 d$ 9, and FLOPs approximately $O(L^2)$ 0, corresponding to $O(L^2)$ 1 speedup (Zhan et al., 1 Sep 2025). Inference latency on H100 is reduced from 31s to 5s, approximately $O(L^2)$ 2, with no perceptible quality degradation. The ablations are structurally informative: query-sparse with $O(L^2)$ 3 gives approximately $O(L^2)$ 4 speedup; KV-sparse with fixed threshold gives approximately $O(L^2)$ 5; KV-sparse with statistical dynamic threshold gives approximately $O(L^2)$ 6; combined query+KV sparsity gives approximately $O(L^2)$ 7 at sparsity approximately $O(L^2)$ 8 (Zhan et al., 1 Sep 2025).

For SVOO, the reported quality–speed trade-off is more modest in raw speedup but broad across seven video generation models (Luo et al., 19 Mar 2026). On Wan2.1-T2V-1.3B at 720p and 81 frames, SVOO reports PSNR $O(L^2)$ 9 dB, SSIM $L$ 00, LPIPS $L$ 01, ImageQual $L$ 02, AesQual $L$ 03, SubjectConsistency $L$ 04, BackgroundConsistency $L$ 05, latency 216 s, and speedup $L$ 06 versus dense attention. HunyuanVideo-T2V reports latency 821 s and speedup $L$ 07, which is the best speedup across the T2V experiments. Ablations show that removing offline profiling reduces efficiency for similar quality, while removing bidirectional co-clustering degrades PSNR and SSIM and raises LPIPS for similar speed (Luo et al., 19 Mar 2026).

For bidirectional alignment in SSA, the strongest evidence concerns sparsity fidelity rather than video generation throughput (Shen et al., 25 Nov 2025). SSA reports the smallest KL divergence between sparse and full modes, at $L$ 08, and the highest attention sparsity, with AttnSparsity $L$ 09 in sparse mode and $L$ 10 in full mode. Under full-attention inference, SSA attains commonsense average $L$ 11 and WikiText perplexity $L$ 12; under sparse attention inference with receptive field 256, it attains average $L$ 13 and perplexity $L$ 14; with receptive field 1024, it attains average $L$ 15 and perplexity $L$ 16 (Shen et al., 25 Nov 2025). Long-context extrapolation is also reported as unusually strong: in Needle-in-a-Haystack under full-attention inference, SSA maintains $L$ 17 at 4k and 8k, $L$ 18 at 16k, and $L$ 19 at 32k.

The limitations are correspondingly specific. In trainable video BSA, failure modes include missing salient tokens if centers are poorly chosen or if $L$ 20 is too low, and KV under-selection if $L$ 21 is too high under unusual score distributions; validation loss remains stable up to approximately $L$ 22 sparsity and degrades only beyond approximately $L$ 23 (Zhan et al., 1 Sep 2025). In SVOO, extremely long sequences with low redundancy may require larger $L$ 24 and higher $L$ 25, unusual inputs may benefit from input-aware schedule adjustments, and high-frequency textures or abrupt motion may need more active blocks (Luo et al., 19 Mar 2026). In SSA, one-sided alignment is reported as unstable, and eliminating either stream harms performance, indicating that the “bidirectional” coupling is not merely auxiliary but structurally necessary (Shen et al., 25 Nov 2025).

Taken together, these results show that BSA is less a single algorithm than a design principle: sparsity should be imposed in a way that respects the bidirectional structure of the attention problem being solved. In video diffusion training, that structure lies in the simultaneous redundancy of queries and Key–Value pairs (Zhan et al., 1 Sep 2025). In training-free video generation, it lies in the coupling between query and key partitions (Luo et al., 19 Mar 2026). In dual-stream long-context training, it lies in reciprocal supervision between sparse and dense pathways (Shen et al., 25 Nov 2025).