Block-Sparse Diffusion Transformer (DiT)
- The paper introduces a Transformer framework that exploits block sparsity via MoE and block-skipping to aggressively reduce computation and memory while maintaining high-quality generation.
- It employs structured approaches like hierarchical attention and multi-step distillation pipelines, efficiently scaling text-to-image and video diffusion models under hardware constraints.
- Empirical results demonstrate up to 60% reduction in activated parameters and significant speedups, validating the trade-off between computational savings and minimal quality loss.
Block-sparse Diffusion Transformers (DiTs) constitute a class of generative diffusion models that exploit block-level structured sparsity—principally block-sparse mixture-of-experts (MoE), block-sparse attention, and block reuse/skipping—to substantially reduce computational and memory requirements while preserving generative quality. These methods are central in scaling text-to-image and video diffusion models to high resolutions or large backbone sizes under practical hardware constraints. Key variants include Dense2MoE’s blockwise MoE and block-skipping (Zheng et al., 10 Oct 2025), Switch Diffusion Transformers’ MoE-based denoising task routing (Park et al., 2024), efficient block-sparse and hierarchical attention for long-context DiTs (Zhou et al., 18 Dec 2025, Yang et al., 2024), unified sparse kernel engines (Qiao et al., 29 Sep 2025), and training-free block-skipping via feature similarity (Chen et al., 1 Aug 2025).
1. Block-Sparse Mixture-of-Experts in DiT
The Mixture-of-Experts paradigm in DiT replaces each Transformer feed-forward network (FFN) with a set of expert MLPs, of which only a subset is activated per input or denoising step. Dense2MoE (Zheng et al., 10 Oct 2025) systematically sparsifies DiT by introducing two levels of block sparsity:
- FFN-level MoE: Each FFN is decomposed into one shared expert (expansion ) and normal experts (expansion each), with a lightweight gating network . At inference, only the top- scoring normal experts (plus the shared expert) are evaluated:
This reduces the number of activated FFN parameters by , with and . Empirically, settings like yield 62.5% parameter activation reduction.
- Block-level MoE (Mixture of Blocks, MoB): Groups of consecutive blocks are equipped with a block router. At each denoising step and for each sample, only of blocks in a group are executed; blocks are selected based on a score that depends on current block activations and global condition. The total sparsity factor across a MoB group is roughly , compounding with FFN-level sparsity for an overall 60% reduction in parameter activation.
Switch Diffusion Transformer (Park et al., 2024) generalizes FFN-level block sparsity to synergize denoising tasks (specific timelines of the diffusion process) via a sparse MoE layer in each block. Each block maintains experts; at each diffusion step , a gating MLP (input: diffusion timestep embedding) selects experts, enforcing block-sparse routing. The model introduces a diffusion-prior loss to encourage adjacent denoising steps to share expert usage, thus capturing inter-task correlation and parameter isolation.
Both approaches demonstrate that block-sparse MoE, with task-conditional or sample-conditional expert routing, enables aggressive parameter and computation savings without commensurate quality loss.
2. Block-Sparse and Hierarchical Attention
Memory and compute demands in attention layers dominate DiT models at high sequence lengths. Recent approaches replace quadratic full attention by block-sparse schemes:
- Single-level block-sparse Top-K selects, for each block, a subset ( out of ) of blocks to attend to using a similarity metric. This reduces attention cost from to ( = total tokens).
- Log-linear Sparse Attention (LLSA) (Zhou et al., 18 Dec 2025) overcomes scalability limits of the single-level approach by introducing a hierarchical coarse-to-fine Top-K block selection. At each compression level (spanning from full to coarsest block granularity), Top-K selection is recursively performed, propagating candidate sets down the hierarchy. The selection and final sparse attention cost is , with global context restored via Hierarchical KV Enrichment (inserting top-K keys/values from each level). LLSA achieves inference and training acceleration on pixel-token DiTs, without FID degradation.
- Unidirectional block attention (UniBA) in Inf-DiT (Yang et al., 2024) structures attention as a raster-order directed acyclic graph at the block level. Each block attends only to itself and neighboring blocks encountered earlier in raster order. This strategy reduces time and memory complexity to (compared to ), enabling upsampling to resolutions with memory reduction over UNet baselines.
A unified theme in these approaches is the use of block partitioning combined with structured block masking or expert routing to realize linear- or log-linear-complexity DiT attention layers.
3. Block-Sparse Inference: Skipping and Reuse
Block-sparse inference strategies reduce runtime by skipping computation for blocks whose activations evolve slowly across steps:
- Feature-Based Block Skipping: Sortblock (Chen et al., 1 Aug 2025) implements a training-free, similarity-aware policy. At each denoising step, for each Transformer block , the cosine similarity between its feature evolution across steps is computed:
Blocks with (i.e., changes are parallel across steps) are considered inactive and skipped via feature reuse or lightweight linear prediction. The dynamic recomputation ratio is estimated from polynomial fits to the global feature-evolution metric. As a result, Sortblock achieves speedup (latency reductions from 23s to 12s) on Flux.1-dev and comparable accelerations on Wan2.1 and HunyuanVideo, with negligible losses in CLIP, IR, PSNR, LPIPS, and VBench metrics.
- Unified Inference Kernels: FlashOmni (Qiao et al., 29 Sep 2025) introduces a generic kernel capable of supporting arbitrary block-skipping, feature-caching, and multi-granularity sparsity patterns by encoding skip/cache policies as compact 8-bit-per-block "sparse symbols." This enables dynamic block-pair skipping and cache reuse in attention without retraining, realizing near-linear ($1/(1-s)$) speedup up to 90% sparsity and up to 1.9 end-to-end acceleration on 33K-token video DiTs, without visible quality degradation.
This blockwise adaptivity complements block-sparse design at the architectural level, allowing for inference-time control over the speed-quality tradeoff.
4. Distillation Pipelines and Training Protocols
Block-sparse DiTs employing MoE or block-skipping necessitate specialized training or conversion pipelines to retain generative fidelity:
- Multi-Step Distillation (Dense2MoE) (Zheng et al., 10 Oct 2025):
- Taylor-Metric Expert Initialization: Compute first-order Taylor importance for each FFN weight, optimally partition into shared and normal experts.
- Shared-Only Distillation: Pre-train a model using only the shared expert for each FFN, matching teacher outputs and features.
- KD with Load Balancing: Full MoE network is assembled and trained using knowledge distillation, additional feature-alignment terms, and an explicit load-balancing loss to promote expert diversity.
- Group Feature Loss: For block groups (MoB), output matches are enforced with the teacher’s endpoints for each group.
Diffusion-Prior Loss (Switch-DiT) (Park et al., 2024): A hybrid objective combines standard DDPM noise loss with a Jensen–Shannon divergence that aligns the learned gating patterns to a known per-timestep prior, promoting both task-specific and global semantic pathways.
Notably, in training-free approaches such as Sortblock (Chen et al., 1 Aug 2025), block-skipping policies are derived entirely at inference, requiring no architectural modification or retraining.
5. Empirical Results and Comparative Performance
Block-sparse DiTs consistently preserve generative capability relative to dense baselines, outperforming unstructured pruning and other naive parameter reduction schemes:
- Dense2MoE (Zheng et al., 10 Oct 2025): Up to 60% reduction in parameter activation for text-to-image generation (Flux.1-MoE-L to XS: 5.15–2.64B activated params, 35% FLOPs saved, latency reductions up to 16%). Under equivalent parameter budgets, MoE/MoB models match or surpass CLIP, IR, and GenEval metrics, while MLP/pruning baselines incur pronounced quality losses.
- Switch-DiT (Park et al., 2024): Block-sparse SMoE achieves FID reductions (ImageNet: DiT-B 27.96→Switch-DiT-B 16.21; DiT-XL 9.40→8.76), and accelerates convergence by 2–3 under the same data and optimizer settings.
- LLSA (Zhou et al., 18 Dec 2025): Attains speedup in attention, faster DiT training, and maintains sample quality measured by FID, outperforming single-level block-sparse or Top-K compressed attention baselines.
- Sortblock (Chen et al., 1 Aug 2025): Achieves 2.0–2.4 reduction in generation time over various tasks and models, with negligible FID or PSNR/SSIM loss.
- FlashOmni (Qiao et al., 29 Sep 2025): Delivers acceleration in large-scale multi-modal DiTs through efficient block-level skipping and caching via its unified kernel abstraction.
6. Trade-offs, Limitations, and Future Directions
Block-sparse DiTs offer a clear trade-off curve: increasing sparsity proportionally reduces runtime and activation cost, but can precipitate marginal metric degradation. MoE and block-skipping approaches robustly outperform magnitude pruning at equivalent sparsity levels. Crucially, distillation pipelines and loss balancing are necessary for high-sparsity fidelity.
Current limitations include the static nature of most block routers (per-sample, per-timestep adaptivity is largely unexplored), the need for joint optimization of block and expert routers, and hardware implementation efficiency at ultra-large scales (Zheng et al., 10 Oct 2025). Future venues of research comprise dynamic per-sample routing, adaptive or learned pooling in hierarchical attention, exploiting cross-modal block sparsity, and further integration of block-skipping kernels with quantization or kernel-based approximations.
A plausible implication is that block-sparse DiTs will play a major role in deploying diffusion models in real-time, high-resolution, or resource-constrained settings, as these structured sparsification methods scale more gracefully than dense or unstructured counterparts.