Sparse-Expert Diffusion Transformer (DiT-MoE)

Updated 27 February 2026

The paper introduces DiT-MoE, a diffusion Transformer that uses sparse Mixture-of-Experts to scale capacity and lower inference costs.
It employs top-k routing and auxiliary balancing losses to achieve expert specialization and competitive FID metrics compared to dense architectures.
The architecture supports dual-stream processing for multimodal fusion, efficiently handling tasks in images, text, and long sequential domains.

A Sparse-Expert Diffusion Transformer (DiT-MoE) is a Transformer-based generative diffusion model that employs sparse Mixture-of-Experts (MoE) architectures within its feed-forward or block structure to scale model capacity, improve parameter efficiency, and reduce inference cost. DiT-MoE models have become a dominant methodology for scaling image and multimodal generative models to tens of billions of parameters while maintaining competitive or superior sample quality, as evidenced by state-of-the-art FID and GenEval metrics at substantially reduced computational cost compared to densely activated DiT architectures (Cai et al., 28 May 2025, Fei et al., 2024, Sun et al., 2024). Their primary innovation is the explicit application of top- $k$ routing or expert-choice strategies in diffusion Transformers, coupled with additional innovations in balancing loss, dual-stream fusion, and knowledge distillation pipelines.

1. Architectural Foundations and Dual-Stream Design

DiT-MoE extends the standard Diffusion Transformer pipeline by sparsifying feed-forward sub-layers via a dynamic MoE structure. In archetypal models such as HiDream-I1, the architecture begins with a dual-stream (decoupled) processing phase: a text stream (hybrid-encoded using CLIP, T5, LLM) and an image stream (patchified, VAE-encoded latents) each processed separately with $L_{\mathrm{dual}}$ layers of sparse DiT blocks with MoE FFNs. After independent processing, token sequences are merged and fed to $L_{\mathrm{single}}$ layers of single-stream sparse DiT blocks to enable cross-modal interaction and generation.

The core DiT-MoE block replaces the conventional dense FFN $x \mapsto W_2 \cdot \text{GELU}(W_1 x)$ with a MoE layer: for token $x \in \mathbb{R}^{d}$ , the router (gating network) produces $g(x) = \mathrm{Softmax}(W_g x) \in \mathbb{R}^{E}$ and only the top- $k$ entries are used, yielding $\operatorname{MoE}(x)=\sum_{e \in \mathrm{TopK}(x)} g_e(x) \cdot \operatorname{FFN}_e(x)$ with $k \ll E$ (Cai et al., 28 May 2025, Fei et al., 2024).

This design is found across major models including HiDream-I1 (32 experts, dual+single stream), EC-DiT (adaptive expert-choice, capacity-factor constrained), Switch-DiT (SMoE via timestep routing), Race-DiT (global Expert Race router), Dense2MoE (block-level MoE and block-level routing), and MoE-DiffuSeq (text diffusion with sparse attention and MoE FFNs).

2. Expert Routing, Balancing, and Specialization

Expert routing in DiT-MoE is implemented via various strategies, all with the goal of mapping each token (or token group) to a small subset of highly specialized experts per layer:

Top- $k$ routing: The standard approach is to activate the $k$ experts with the largest routing probabilities $g(x)$ for each token, reducing FFN compute complexity from $O(E d^2)$ to $O(k d^2)$ . Typical choices are $k=1$ or $2$.
Capacity constraints: Each expert's throughput is limited to a maximum capacity $C$ , with overflow tokens redirected to a shared expert or dropped. This ensures deterministic GPU memory and batch efficiency (Cai et al., 28 May 2025).
Auxiliary balancing losses: To prevent mode collapse or expert starvation, auxiliary losses are employed. For example, HiDream-I1 uses:
- A load-balancing loss: $L_\mathrm{load} = \|\mathrm{Importance}\|_2^2$ with $\mathrm{Importance}_e = \frac{1}{B} \sum_{i=1}^B g_e(x_i)$ .
- Capacity loss: penalizing experts whose load exceeds $C$ .
Shared experts: Some models (e.g., DiT-MoE (Fei et al., 2024, Cai et al., 28 May 2025)) include $n_s$ shared experts that are always active for every token, capturing dataset-common transformations and further reducing redundancy.
Task- or modality-specific routing: HiDream-I1 uses independent routers per stream in dual-stream mode, while Switch-DiT employs timestep-based gating, and ProMoE (Wei et al., 28 Oct 2025) introduces a two-step router (conditional and prototypical). Batch-level token pooling and dynamic capacity prediction are used in DiffMoE (Shi et al., 18 Mar 2025).

Empirical analyses reveal that expert specialization maps onto distinct spatial and temporal axes: early MoE layers and early diffusion steps focus on coarse/low-frequency structure with concentrated expert usage; later layers/steps require broader expert activation for high-frequency details (Fei et al., 2024).

3. Training Objectives, Losses, and Initialization

The backbone DiT-MoE network is trained with standard diffusion objectives (noise-prediction, velocity/flow matching, or Rectified Flow), i.e.,

$\mathcal{L}_\mathrm{FM} = \mathbb{E}_{x_0, t, \epsilon} \| v_\theta(x_t, t; y) - (x_t - x_0) \|^2$

with additional losses layered for MoE stability:

Loss Term	Purpose	Found in Models
Load-balancing loss	Uniform expert utilization	HiDream-I1, DiT-MoE
Capacity loss	Prevent expert overload	HiDream-I1
Routing contrastive loss	Intra-/inter-expert diversity	ProMoE
Diffusion-prior loss	Expert-timestep alignment	Switch-DiT
Distillation loss	Dense-to-MoE migration	Dense2MoE

Initialization is sometimes enhanced via Taylor-metric importance sampling to seed shared and normal experts, and knowledge distillation is used to effectively transform dense DiT models into performant MoE variants (e.g., in Dense2MoE) (Zheng et al., 10 Oct 2025).

4. Computational Complexity, Scaling, and Efficiency

Sparse-expert routing yields substantial reductions in per-token computational cost and wall-clock inference time, while unlocking the ability to scale total model parameter count far beyond what is feasible for dense Transformers:

Per-block cost: Self-attention remains $O(B n^2 d)$ ; the MoE FFN costs $O(B n k d h)$ per block, with $k\ll E$ .
Model scaling: HiDream-I1 reaches 17B params (Full), and DiT-MoE demonstrates competitive or better wall-clock inference versus dense models of far smaller total size, as only a fractional slice ( $\sim$ 20%) of the parameters are active per example (Fei et al., 2024, Cai et al., 28 May 2025).
Latency: HiDream-I1 measures 3.2 s / 50 steps ('Full'), 0.9 s / 14 steps ('Fast') on A100 GPUs (Cai et al., 28 May 2025).
Activated parameter and FLOPs reduction: Dense2MoE demonstrates a 57–62% reduction in activated parameters and 16–34% inference FLOP reduction versus vanilla dense DiTs, with competitive sample quality (Zheng et al., 10 Oct 2025).
Scalability: EC-DiT-M scales to 97B parameters with a +23% inference overhead versus an 8B dense baseline (Sun et al., 2024).

5. Empirical Performance and Ablation Studies

DiT-MoE architectures consistently outperform or match their dense counterparts on common generative benchmarks:

Model Variant	FID (↓)	Inference Latency	Activated Params	Reference
Dense DiT (no MoE)	12.5	150 ms	100%	HiDream-I1 (512²)
DiT + MoE (32E, 2A)	9.8	110 ms (–27%)	60%	HiDream-I1
Dense2MoE (FLUX.1-MoE-L)	0.5702†	17.8 s (1024²)	5.15B (–57%)	FLUX.1 vs FLUX.1-MoE-L
DiT-MoE-XL/2-8E2A (4.1B)	1.72	--	1.5B	DiT-MoE (Fei et al., 2024)
ProMoE-XL (14E)	2.59	--	675M act.	ProMoE (Wei et al., 28 Oct 2025)

† Note: GenEval, not FID; refer to original tables for full metric context.

Load and capacity balancing losses improve FID by ∼6%. Increasing top- $k$ beyond 2 shows marginal benefit but increases latency (Cai et al., 28 May 2025).
Ablations on expert count, MoE group sizing, and expert-sharing consistently show that balanced, diversified expert utilization and small group sizes yield optimal performance (Fei et al., 2024, Zheng et al., 10 Oct 2025).
ProMoE achieves 20–30% FID reduction and substantial IS gains by explicit semantic routing and contrastive objectives (e.g., FID=2.59 vs. 3.23 for Dense-DiT-XL on ImageNet 256²) (Wei et al., 28 Oct 2025).

6. Modalities, Multimodality, and Extensions

Sparse-Expert DiT-MoE can be readily adapted beyond standard image diffusion to multimodal, text, and long-form domains:

Multimodal fusion: HiDream-I1 utilizes a dual-stream MoE structure for independent token processing, then joint single-stream layers to enable efficient fusion for both text-to-image generation and instruction-based editing (Cai et al., 28 May 2025).
Long sequential domains: MoE-DiffuSeq extends sparse-expert DiT to text diffusion, combining sparse attention and MoE FFNs to handle long documents with reduced memory and computation, reporting significant gains in perplexity, coherence, and ROUGE (Christoforos et al., 23 Dec 2025).
Instruction and image editing: HiDream-E1 leverages MoE-based DiT blocks conditioned on both original and instruction-provided latent/image tokens for high-precision image editing (Cai et al., 28 May 2025).

7. Practical Lessons and Deployment Considerations

Implementation of DiT-MoE requires careful architectural and training choices:

MoE configuration: Small $k$ (typically 2), moderate $E$ (8–32), and inclusion of shared experts are robust choices. Top- $k$ routing strategies must be GPU/batch-aligned for efficiency.
Auxiliary objectives: Load/capacity balancing, or in some frameworks, projection-based or contrastive losses, are needed for robust training (Wei et al., 28 Oct 2025, Cai et al., 28 May 2025).
Distillation for migration: Knowledge distillation pipelines with staged parameter transfer, Taylor-metric initialization, and group-level feature matching provide effective means to adapt existing dense DiT models to MoE with negligible performance loss (Zheng et al., 10 Oct 2025).
Batch-level pooling: For maximum expert specialization and scaling, batch/global token pooling and global routing (as in DiffMoE) outperform local/token-only MoE strategies (Shi et al., 18 Mar 2025).
Compute-floor: Activated parameter and FLOP budgets can be tuned directly by targetting specific $k, E$ and block sparsity rates, enabling direct control of deployment-time efficiency.