Block-Sparse Diffusion Transformer (DiT)

Updated 12 January 2026

The paper introduces a Transformer framework that exploits block sparsity via MoE and block-skipping to aggressively reduce computation and memory while maintaining high-quality generation.
It employs structured approaches like hierarchical attention and multi-step distillation pipelines, efficiently scaling text-to-image and video diffusion models under hardware constraints.
Empirical results demonstrate up to 60% reduction in activated parameters and significant speedups, validating the trade-off between computational savings and minimal quality loss.

Block-sparse Diffusion Transformers (DiTs) constitute a class of generative diffusion models that exploit block-level structured sparsity—principally block-sparse mixture-of-experts (MoE), block-sparse attention, and block reuse/skipping—to substantially reduce computational and memory requirements while preserving generative quality. These methods are central in scaling text-to-image and video diffusion models to high resolutions or large backbone sizes under practical hardware constraints. Key variants include Dense2MoE’s blockwise MoE and block-skipping (Zheng et al., 10 Oct 2025), Switch Diffusion Transformers’ MoE-based denoising task routing (Park et al., 2024), efficient block-sparse and hierarchical attention for long-context DiTs (Zhou et al., 18 Dec 2025, Yang et al., 2024), unified sparse kernel engines (Qiao et al., 29 Sep 2025), and training-free block-skipping via feature similarity (Chen et al., 1 Aug 2025).

1. Block-Sparse Mixture-of-Experts in DiT

The Mixture-of-Experts paradigm in DiT replaces each Transformer feed-forward network (FFN) with a set of expert MLPs, of which only a subset is activated per input or denoising step. Dense2MoE (Zheng et al., 10 Oct 2025) systematically sparsifies DiT by introducing two levels of block sparsity:

FFN-level MoE: Each FFN is decomposed into one shared expert (expansion $r_s$ ) and $n$ normal experts (expansion $r_n$ each), with a lightweight gating network $g_\alpha(x) = \mathrm{softmax}(W_g x)$ . At inference, only the top- $k$ scoring normal experts (plus the shared expert) are evaluated:

$y = f_s(x) + \sum_{i \in \mathrm{TopK}(g_\alpha(x), k)} g_i(x) \cdot f_n^{(i)}(x)$

This reduces the number of activated FFN parameters by $1 - \frac{r_a}{r}$ , with $r_a = r_s + k r_n$ and $r = r_s + n r_n$ . Empirically, settings like $(r_s=1, r_n=0.25, n=12, k=2)$ yield 62.5% parameter activation reduction.

Block-level MoE (Mixture of Blocks, MoB): Groups of $m$ consecutive blocks are equipped with a block router. At each denoising step and for each sample, only $\kappa$ of $m$ blocks in a group are executed; blocks are selected based on a score that depends on current block activations and global condition. The total sparsity factor across a MoB group is roughly $\kappa/m$ , compounding with FFN-level sparsity for an overall 60% reduction in parameter activation.

Switch Diffusion Transformer (Park et al., 2024) generalizes FFN-level block sparsity to synergize denoising tasks (specific timelines of the diffusion process) via a sparse MoE layer in each block. Each block maintains $M=3$ experts; at each diffusion step $t$ , a gating MLP (input: diffusion timestep embedding) selects $K=2$ experts, enforcing block-sparse routing. The model introduces a diffusion-prior loss to encourage adjacent denoising steps to share expert usage, thus capturing inter-task correlation and parameter isolation.

Both approaches demonstrate that block-sparse MoE, with task-conditional or sample-conditional expert routing, enables aggressive parameter and computation savings without commensurate quality loss.

2. Block-Sparse and Hierarchical Attention

Memory and compute demands in attention layers dominate DiT models at high sequence lengths. Recent approaches replace quadratic full attention by block-sparse schemes:

Single-level block-sparse Top-K selects, for each block, a subset ( $K$ out of $T$ ) of blocks to attend to using a similarity metric. This reduces attention cost from $O(N^2)$ to $O(N K)$ ( $N$ = total tokens).
Log-linear Sparse Attention (LLSA) (Zhou et al., 18 Dec 2025) overcomes scalability limits of the single-level approach by introducing a hierarchical coarse-to-fine Top-K block selection. At each compression level $l$ (spanning from full to coarsest block granularity), Top-K selection is recursively performed, propagating candidate sets down the hierarchy. The selection and final sparse attention cost is $O(N K \log N)$ , with global context restored via Hierarchical KV Enrichment (inserting top-K keys/values from each level). LLSA achieves $28.27\times$ inference and $6.09\times$ training acceleration on $256\times256$ pixel-token DiTs, without FID degradation.
Unidirectional block attention (UniBA) in Inf-DiT (Yang et al., 2024) structures attention as a raster-order directed acyclic graph at the block level. Each block attends only to itself and neighboring blocks encountered earlier in raster order. This strategy reduces time and memory complexity to $O(N)$ (compared to $O(N^2)$ ), enabling upsampling to $4096\times4096$ resolutions with $5\times$ memory reduction over UNet baselines.

A unified theme in these approaches is the use of block partitioning combined with structured block masking or expert routing to realize linear- or log-linear-complexity DiT attention layers.

3. Block-Sparse Inference: Skipping and Reuse

Block-sparse inference strategies reduce runtime by skipping computation for blocks whose activations evolve slowly across steps:

Feature-Based Block Skipping: Sortblock (Chen et al., 1 Aug 2025) implements a training-free, similarity-aware policy. At each denoising step, for each Transformer block $l$ , the cosine similarity $S_l$ between its feature evolution across steps is computed:

$S_l = \frac{(\Delta_l^t)^\top\,\Delta_l^{t+1}}{\|\Delta_l^t\|\,\|\Delta_l^{t+1}\|}$

Blocks with $S_l \approx 1$ (i.e., changes are parallel across steps) are considered inactive and skipped via feature reuse or lightweight linear prediction. The dynamic recomputation ratio $\rho(t)$ is estimated from polynomial fits to the global feature-evolution metric. As a result, Sortblock achieves $2\times$ speedup (latency reductions from 23s to $\sim$ 12s) on Flux.1-dev and comparable accelerations on Wan2.1 and HunyuanVideo, with negligible losses in CLIP, IR, PSNR, LPIPS, and VBench metrics.

Unified Inference Kernels: FlashOmni (Qiao et al., 29 Sep 2025) introduces a generic kernel capable of supporting arbitrary block-skipping, feature-caching, and multi-granularity sparsity patterns by encoding skip/cache policies as compact 8-bit-per-block "sparse symbols." This enables dynamic block-pair skipping and cache reuse in attention without retraining, realizing near-linear ($1/(1-s)$) speedup up to 90% sparsity and up to 1.9 $\times$ end-to-end acceleration on 33K-token video DiTs, without visible quality degradation.

This blockwise adaptivity complements block-sparse design at the architectural level, allowing for inference-time control over the speed-quality tradeoff.

4. Distillation Pipelines and Training Protocols

Block-sparse DiTs employing MoE or block-skipping necessitate specialized training or conversion pipelines to retain generative fidelity:

Multi-Step Distillation (Dense2MoE) (Zheng et al., 10 Oct 2025):
1. Taylor-Metric Expert Initialization: Compute first-order Taylor importance for each FFN weight, optimally partition into shared and normal experts.
2. Shared-Only Distillation: Pre-train a model using only the shared expert for each FFN, matching teacher outputs and features.
3. KD with Load Balancing: Full MoE network is assembled and trained using knowledge distillation, additional feature-alignment terms, and an explicit load-balancing loss to promote expert diversity.
4. Group Feature Loss: For block groups (MoB), output matches are enforced with the teacher’s endpoints for each group.
Diffusion-Prior Loss (Switch-DiT) (Park et al., 2024): A hybrid objective combines standard DDPM noise loss with a Jensen–Shannon divergence that aligns the learned gating patterns to a known per-timestep prior, promoting both task-specific and global semantic pathways.

Notably, in training-free approaches such as Sortblock (Chen et al., 1 Aug 2025), block-skipping policies are derived entirely at inference, requiring no architectural modification or retraining.

5. Empirical Results and Comparative Performance

Block-sparse DiTs consistently preserve generative capability relative to dense baselines, outperforming unstructured pruning and other naive parameter reduction schemes:

Dense2MoE (Zheng et al., 10 Oct 2025): Up to 60% reduction in parameter activation for text-to-image generation (Flux.1-MoE-L to XS: 5.15–2.64B activated params, $>$ 35% FLOPs saved, latency reductions up to 16%). Under equivalent parameter budgets, MoE/MoB models match or surpass CLIP, IR, and GenEval metrics, while MLP/pruning baselines incur pronounced quality losses.
Switch-DiT (Park et al., 2024): Block-sparse SMoE achieves FID reductions (ImageNet: DiT-B 27.96→Switch-DiT-B 16.21; DiT-XL 9.40→8.76), and accelerates convergence by 2–3 $\times$ under the same data and optimizer settings.
LLSA (Zhou et al., 18 Dec 2025): Attains $28.27\times$ speedup in attention, $6.09\times$ faster DiT training, and maintains sample quality measured by FID, outperforming single-level block-sparse or Top-K compressed attention baselines.
Sortblock (Chen et al., 1 Aug 2025): Achieves 2.0–2.4 $\times$ reduction in generation time over various tasks and models, with negligible FID or PSNR/SSIM loss.
FlashOmni (Qiao et al., 29 Sep 2025): Delivers $>1.5\times$ acceleration in large-scale multi-modal DiTs through efficient block-level skipping and caching via its unified kernel abstraction.

6. Trade-offs, Limitations, and Future Directions

Block-sparse DiTs offer a clear trade-off curve: increasing sparsity proportionally reduces runtime and activation cost, but can precipitate marginal metric degradation. MoE and block-skipping approaches robustly outperform magnitude pruning at equivalent sparsity levels. Crucially, distillation pipelines and loss balancing are necessary for high-sparsity fidelity.

Current limitations include the static nature of most block routers (per-sample, per-timestep adaptivity is largely unexplored), the need for joint optimization of block and expert routers, and hardware implementation efficiency at ultra-large scales (Zheng et al., 10 Oct 2025). Future venues of research comprise dynamic per-sample routing, adaptive or learned pooling in hierarchical attention, exploiting cross-modal block sparsity, and further integration of block-skipping kernels with quantization or kernel-based approximations.

A plausible implication is that block-sparse DiTs will play a major role in deploying diffusion models in real-time, high-resolution, or resource-constrained settings, as these structured sparsification methods scale more gracefully than dense or unstructured counterparts.