Papers
Topics
Authors
Recent
2000 character limit reached

Block-Sparse Diffusion Transformer (DiT)

Updated 12 January 2026
  • The paper introduces a Transformer framework that exploits block sparsity via MoE and block-skipping to aggressively reduce computation and memory while maintaining high-quality generation.
  • It employs structured approaches like hierarchical attention and multi-step distillation pipelines, efficiently scaling text-to-image and video diffusion models under hardware constraints.
  • Empirical results demonstrate up to 60% reduction in activated parameters and significant speedups, validating the trade-off between computational savings and minimal quality loss.

Block-sparse Diffusion Transformers (DiTs) constitute a class of generative diffusion models that exploit block-level structured sparsity—principally block-sparse mixture-of-experts (MoE), block-sparse attention, and block reuse/skipping—to substantially reduce computational and memory requirements while preserving generative quality. These methods are central in scaling text-to-image and video diffusion models to high resolutions or large backbone sizes under practical hardware constraints. Key variants include Dense2MoE’s blockwise MoE and block-skipping (Zheng et al., 10 Oct 2025), Switch Diffusion Transformers’ MoE-based denoising task routing (Park et al., 2024), efficient block-sparse and hierarchical attention for long-context DiTs (Zhou et al., 18 Dec 2025, Yang et al., 2024), unified sparse kernel engines (Qiao et al., 29 Sep 2025), and training-free block-skipping via feature similarity (Chen et al., 1 Aug 2025).

1. Block-Sparse Mixture-of-Experts in DiT

The Mixture-of-Experts paradigm in DiT replaces each Transformer feed-forward network (FFN) with a set of expert MLPs, of which only a subset is activated per input or denoising step. Dense2MoE (Zheng et al., 10 Oct 2025) systematically sparsifies DiT by introducing two levels of block sparsity:

  • FFN-level MoE: Each FFN is decomposed into one shared expert (expansion rsr_s) and nn normal experts (expansion rnr_n each), with a lightweight gating network gα(x)=softmax(Wgx)g_\alpha(x) = \mathrm{softmax}(W_g x). At inference, only the top-kk scoring normal experts (plus the shared expert) are evaluated:

y=fs(x)+iTopK(gα(x),k)gi(x)fn(i)(x)y = f_s(x) + \sum_{i \in \mathrm{TopK}(g_\alpha(x), k)} g_i(x) \cdot f_n^{(i)}(x)

This reduces the number of activated FFN parameters by 1rar1 - \frac{r_a}{r}, with ra=rs+krnr_a = r_s + k r_n and r=rs+nrnr = r_s + n r_n. Empirically, settings like (rs=1,rn=0.25,n=12,k=2)(r_s=1, r_n=0.25, n=12, k=2) yield 62.5% parameter activation reduction.

  • Block-level MoE (Mixture of Blocks, MoB): Groups of mm consecutive blocks are equipped with a block router. At each denoising step and for each sample, only κ\kappa of mm blocks in a group are executed; blocks are selected based on a score that depends on current block activations and global condition. The total sparsity factor across a MoB group is roughly κ/m\kappa/m, compounding with FFN-level sparsity for an overall 60% reduction in parameter activation.

Switch Diffusion Transformer (Park et al., 2024) generalizes FFN-level block sparsity to synergize denoising tasks (specific timelines of the diffusion process) via a sparse MoE layer in each block. Each block maintains M=3M=3 experts; at each diffusion step tt, a gating MLP (input: diffusion timestep embedding) selects K=2K=2 experts, enforcing block-sparse routing. The model introduces a diffusion-prior loss to encourage adjacent denoising steps to share expert usage, thus capturing inter-task correlation and parameter isolation.

Both approaches demonstrate that block-sparse MoE, with task-conditional or sample-conditional expert routing, enables aggressive parameter and computation savings without commensurate quality loss.

2. Block-Sparse and Hierarchical Attention

Memory and compute demands in attention layers dominate DiT models at high sequence lengths. Recent approaches replace quadratic full attention by block-sparse schemes:

  • Single-level block-sparse Top-K selects, for each block, a subset (KK out of TT) of blocks to attend to using a similarity metric. This reduces attention cost from O(N2)O(N^2) to O(NK)O(N K) (NN = total tokens).
  • Log-linear Sparse Attention (LLSA) (Zhou et al., 18 Dec 2025) overcomes scalability limits of the single-level approach by introducing a hierarchical coarse-to-fine Top-K block selection. At each compression level ll (spanning from full to coarsest block granularity), Top-K selection is recursively performed, propagating candidate sets down the hierarchy. The selection and final sparse attention cost is O(NKlogN)O(N K \log N), with global context restored via Hierarchical KV Enrichment (inserting top-K keys/values from each level). LLSA achieves 28.27×28.27\times inference and 6.09×6.09\times training acceleration on 256×256256\times256 pixel-token DiTs, without FID degradation.
  • Unidirectional block attention (UniBA) in Inf-DiT (Yang et al., 2024) structures attention as a raster-order directed acyclic graph at the block level. Each block attends only to itself and neighboring blocks encountered earlier in raster order. This strategy reduces time and memory complexity to O(N)O(N) (compared to O(N2)O(N^2)), enabling upsampling to 4096×40964096\times4096 resolutions with 5×5\times memory reduction over UNet baselines.

A unified theme in these approaches is the use of block partitioning combined with structured block masking or expert routing to realize linear- or log-linear-complexity DiT attention layers.

3. Block-Sparse Inference: Skipping and Reuse

Block-sparse inference strategies reduce runtime by skipping computation for blocks whose activations evolve slowly across steps:

  • Feature-Based Block Skipping: Sortblock (Chen et al., 1 Aug 2025) implements a training-free, similarity-aware policy. At each denoising step, for each Transformer block ll, the cosine similarity SlS_l between its feature evolution across steps is computed:

Sl=(Δlt)Δlt+1ΔltΔlt+1S_l = \frac{(\Delta_l^t)^\top\,\Delta_l^{t+1}}{\|\Delta_l^t\|\,\|\Delta_l^{t+1}\|}

Blocks with Sl1S_l \approx 1 (i.e., changes are parallel across steps) are considered inactive and skipped via feature reuse or lightweight linear prediction. The dynamic recomputation ratio ρ(t)\rho(t) is estimated from polynomial fits to the global feature-evolution metric. As a result, Sortblock achieves 2×2\times speedup (latency reductions from 23s to \sim12s) on Flux.1-dev and comparable accelerations on Wan2.1 and HunyuanVideo, with negligible losses in CLIP, IR, PSNR, LPIPS, and VBench metrics.

  • Unified Inference Kernels: FlashOmni (Qiao et al., 29 Sep 2025) introduces a generic kernel capable of supporting arbitrary block-skipping, feature-caching, and multi-granularity sparsity patterns by encoding skip/cache policies as compact 8-bit-per-block "sparse symbols." This enables dynamic block-pair skipping and cache reuse in attention without retraining, realizing near-linear ($1/(1-s)$) speedup up to 90% sparsity and up to 1.9×\times end-to-end acceleration on 33K-token video DiTs, without visible quality degradation.

This blockwise adaptivity complements block-sparse design at the architectural level, allowing for inference-time control over the speed-quality tradeoff.

4. Distillation Pipelines and Training Protocols

Block-sparse DiTs employing MoE or block-skipping necessitate specialized training or conversion pipelines to retain generative fidelity:

  • Multi-Step Distillation (Dense2MoE) (Zheng et al., 10 Oct 2025):

    1. Taylor-Metric Expert Initialization: Compute first-order Taylor importance for each FFN weight, optimally partition into shared and normal experts.
    2. Shared-Only Distillation: Pre-train a model using only the shared expert for each FFN, matching teacher outputs and features.
    3. KD with Load Balancing: Full MoE network is assembled and trained using knowledge distillation, additional feature-alignment terms, and an explicit load-balancing loss to promote expert diversity.
    4. Group Feature Loss: For block groups (MoB), output matches are enforced with the teacher’s endpoints for each group.
  • Diffusion-Prior Loss (Switch-DiT) (Park et al., 2024): A hybrid objective combines standard DDPM noise loss with a Jensen–Shannon divergence that aligns the learned gating patterns to a known per-timestep prior, promoting both task-specific and global semantic pathways.

Notably, in training-free approaches such as Sortblock (Chen et al., 1 Aug 2025), block-skipping policies are derived entirely at inference, requiring no architectural modification or retraining.

5. Empirical Results and Comparative Performance

Block-sparse DiTs consistently preserve generative capability relative to dense baselines, outperforming unstructured pruning and other naive parameter reduction schemes:

  • Dense2MoE (Zheng et al., 10 Oct 2025): Up to 60% reduction in parameter activation for text-to-image generation (Flux.1-MoE-L to XS: 5.15–2.64B activated params, >>35% FLOPs saved, latency reductions up to 16%). Under equivalent parameter budgets, MoE/MoB models match or surpass CLIP, IR, and GenEval metrics, while MLP/pruning baselines incur pronounced quality losses.
  • Switch-DiT (Park et al., 2024): Block-sparse SMoE achieves FID reductions (ImageNet: DiT-B 27.96→Switch-DiT-B 16.21; DiT-XL 9.40→8.76), and accelerates convergence by 2–3×\times under the same data and optimizer settings.
  • LLSA (Zhou et al., 18 Dec 2025): Attains 28.27×28.27\times speedup in attention, 6.09×6.09\times faster DiT training, and maintains sample quality measured by FID, outperforming single-level block-sparse or Top-K compressed attention baselines.
  • Sortblock (Chen et al., 1 Aug 2025): Achieves 2.0–2.4×\times reduction in generation time over various tasks and models, with negligible FID or PSNR/SSIM loss.
  • FlashOmni (Qiao et al., 29 Sep 2025): Delivers >1.5×>1.5\times acceleration in large-scale multi-modal DiTs through efficient block-level skipping and caching via its unified kernel abstraction.

6. Trade-offs, Limitations, and Future Directions

Block-sparse DiTs offer a clear trade-off curve: increasing sparsity proportionally reduces runtime and activation cost, but can precipitate marginal metric degradation. MoE and block-skipping approaches robustly outperform magnitude pruning at equivalent sparsity levels. Crucially, distillation pipelines and loss balancing are necessary for high-sparsity fidelity.

Current limitations include the static nature of most block routers (per-sample, per-timestep adaptivity is largely unexplored), the need for joint optimization of block and expert routers, and hardware implementation efficiency at ultra-large scales (Zheng et al., 10 Oct 2025). Future venues of research comprise dynamic per-sample routing, adaptive or learned pooling in hierarchical attention, exploiting cross-modal block sparsity, and further integration of block-skipping kernels with quantization or kernel-based approximations.

A plausible implication is that block-sparse DiTs will play a major role in deploying diffusion models in real-time, high-resolution, or resource-constrained settings, as these structured sparsification methods scale more gracefully than dense or unstructured counterparts.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Block-Sparse Diffusion Transformer (DiT).