VMoBA: Video Mixture of Block Attention

Updated 1 July 2025

Video Mixture of Block Attention (VMoBA) is a sparse attention method that partitions video tokens into adaptive 1D, 2D, and 3D blocks to capture spatio-temporal dependencies.
It employs a cyclical, layer-wise block partitioning combined with global and threshold-based adaptive block selection to optimize computational resources.
Experimental results show that VMoBA significantly reduces FLOPs and latency while maintaining or improving video generation quality in diffusion models.

Video Mixture of Block Attention (VMoBA) is a sparse attention strategy designed to efficiently and adaptively model the spatio-temporal dependencies inherent in high-resolution, long-duration video data—particularly within Video Diffusion Models (VDMs). VMoBA extends the Mixture-of-Block Attention (MoBA) concept to address domain-specific requirements of video, introducing several innovations that enable it to surpass both traditional full attention and prior sparse attention variants in computational efficiency and generative fidelity.

1. Concept and Motivation

Video Mixture of Block Attention (VMoBA) targets the prohibitive quadratic scaling of standard full attention mechanisms in transformers when applied to large video sequences. Video data, when represented as tokens (e.g., per-pixel or per-patch per-frame), often results in tens of thousands to hundreds of thousands of tokens per sequence. In this regime, full self-attention becomes computationally infeasible, especially for Video Diffusion Models that require frequent passes through the network for each generated or denoised frame.

Earlier sparse attention methods—though effective for accelerating inference—are usually training-free or optimized for natural language, neglecting key video-specific inductive biases such as spatio-temporal locality and the heterogeneous importance of queries across the space-time grid. Native integration of these biases into the training and architecture is crucial: direct application of MoBA to video can undermine local correlation, resulting in degraded output quality. VMoBA addresses this through spatio-temporally adaptive block partitioning and selection strategies.

2. Layer-wise Recurrent Block Partitioning

VMoBA introduces a cyclical, layer-wise block partitioning mechanism tailored to how attention patterns evolve through the network:

1D (Temporal) Partitioning: For certain transformer layers, keys are blocked along the temporal axis, grouping spatial patches across adjacent frames.
2D (Spatial) Partitioning: Other layers partition blocks spatially, grouping local neighborhoods within each frame.
3D (Spatio-temporal) Partitioning: Further layers use blocks that are local cubes in both time and space, capturing more complex local video structures.

The partitioning cycles across these types in depth (layer mod 3), matching the observation that attention patterns diversify through network depth, sometimes prioritizing temporal context, sometimes spatial, and sometimes both jointly. Within each block, a summary key is computed (typically the mean), and the token-block map is rearranged accordingly.

This design maintains local spatio-temporal coherence necessary for high-fidelity video modeling, while dramatically reducing the number of key blocks (and thus compute) compared to full gridwise or naive 3D blockings.

3. Global and Adaptive Block Selection

VMoBA incorporates two selection principles to maximize model expressivity with minimal computation:

Global Block Selection: Instead of assigning a fixed number of block queries per query position (as in classic MoBA), VMoBA evaluates similarity scores between all queries and all key blocks per attention head and globally selects the highest-scoring query-block pairs for participation. This mechanism concentrates computational bandwidth on the most salient interactions, reflecting variable importance across queries and regions.
Threshold-based Adaptive Block Count: For each attention head, the number of blocks to attend to is not a constant hyperparameter. Rather, VMoBA employs a cumulative similarity threshold, selecting as many blocks as needed for each head to reach a specified fraction (e.g., 25%) of total similarity mass. Heads that are sharply focused on few blocks use fewer, while more diffuse heads are allowed proportionally more. This adaptivity mirrors observed head specialization and results in a more flexible and content-aware allocation of computational resources.

Together, these mechanisms replace the rigid, uniform block selection of prior models, offering head-specific, query-key pair-specific granularity.

4. Computational Efficiency and Generation Quality

VMoBA offers substantial efficiency improvements over full attention and static sparse methods:

Training Efficiency: Experiments demonstrate a reduction in FLOPs by up to 2.92× and a latency speedup of up to 1.48× during training for long video sequences. Speedup grows as sequence length increases.
Inference Acceleration: For high-resolution videos (e.g., 720p), VMoBA achieves up to 2.40× FLOPs reduction and up to 1.35× lower latency with competitive or improved output quality.
Scalability: For short video sequences, the benefits are muted due to the proportionally higher overhead of managing sparse maps, but VMoBA's efficiency becomes more pronounced as the dimensionality increases.

Despite the aggressive computation reduction, VMoBA delivers comparable or superior video generation quality, as quantified by task-specific nuclear metrics such as VBench (TextConsis, ImageQual, BGConsis, SubConsist) and perceptual assessments. In some prominent cases, VMoBA even achieves slightly better alignment with textual prompts and produces generative outputs with enhanced semantic fidelity.

5. Technical Formulation

VMoBA's core mechanisms are mathematically formalized as follows:

Block Partition: For a given input $K \in (T, H, W)$ at layer $l$ , partition is applied as

$\mathbf{K}' = \text{Rearrange}(\mathbf{K}) \to \begin{cases} (N^T_b)(s^T_b, H, W), & l \bmod 3 = 0 \ (N^H_b N^W_b)(T, s^H_b, s^W_b), & l \bmod 3 = 1 \ (N^T_b N^H_b N^W_b)(s^T_b, s^H_b, s^W_b), & l \bmod 3 = 2 \end{cases}$

Block summary keys $\mathbf{B}$ are obtained as blockwise means.

Global Block Selection: For attention head $i$ ,

$\mathbf{M}_i = \text{TopkMask}(\mathbf{q}_i \mathbf{b}_i^T, k)$

where $\mathbf{M}_i$ indicates selected query-block pairs.

Thresholded Block Count: The number of blocks $k$ per head is computed dynamically to reach a cumulative similarity $\tau$ :

$k = \min\left\{ k' \mid \sum_{j=1}^{k'} \text{Sorted}(\hat{S}_j) \ge \tau \right\}, \quad \hat{S} = \mathbf{q}_i \mathbf{b}_i^T$

Overall Complexity: The resultant computational complexity is

$\text{FLOPs} \sim O\left(sd \left( \frac{s}{s_b} + k_{avg} s_b \right) \right)$

where $s$ is number of tokens, $d$ is hidden size, $s_b$ is block size, and $k_{avg}$ is mean selected blocks per query.

6. Comparative Results and Ablation

Experimental benchmarks underline VMoBA's superiority:

Aspect	Full Attention	MoBA	VMoBA
Attention Pattern	All-pairs, quadratic	1D blocks, static	1D-2D-3D (layer-cycled), spatio-temporally local
Query Importance	Uniform	Uniform per query	Global, dynamic, head-specific
Head Adaptivity	None	None	Thresholded, head-specific
FLOPs/Latency	High	Moderate	Lowest
Generation Quality	State-of-the-art	Degraded for video	Matches or exceeds full attention

Ablation studies confirm the necessity of each VMoBA component; removing block cycling, global pair selection, or adaptive thresholding impairs either efficiency or quality.

7. Practical Implications and Future Directions

VMoBA stands as a pivotal development for the training and deployment of Video Diffusion Models at scale, permitting both longer sequences and higher resolutions within the constraints of available hardware. The principles underlying its design—cyclical spatio-temporal blocking, globally adaptive selection, and head-specific control—suggest broader applicability for any transformer-based video modeling or generation task requiring scalable yet precise dependency modeling.

Notably, for short videos, practical speedup may be throttled by current implementations of sparse attention (kernel overhead, hardware utilization). Advances in kernel engineering (e.g., further optimized CUDA primitives) are likely to yield better speedup matching theoretical FLOPs reduction. The concepts introduced in VMoBA may also inspire future adaptive attention regimes for other data modalities with inherently structured dependencies, such as multimodal transformers or hierarchical token graphs.

References and implementation: The VMoBA approach is described in detail and evaluated in "VMoBA: Mixture-of-Block Attention for Video Diffusion Models" (Wu et al., 30 Jun 2025), with code available at https://github.com/KwaiVGI/VMoBA.

PDF Markdown Chat (Pro)

References (1)

VMoBA: Mixture-of-Block Attention for Video Diffusion Models (2025)

Follow Topic

Get notified by email when new papers are published related to Video Mixture of Block Attention (VMoBA).