MoE dLLM: Diffusion Language Model

Updated 24 April 2026

Mixture-of-Experts (MoE) dLLM is a neural language model that blends sparse expert routing from MoE layers with discrete diffusion denoising to enable parallel text generation.
It employs blockwise bidirectional attention and a structured three-phase training protocol (WSD) to enhance speed, throughput, and scalability for models up to 100B parameters.
TEAM strategies, such as delayed caching and speculative decoding, reduce redundant expert computation while preserving output quality with minimal trade-offs.

A Mixture-of-Experts (MoE) Discrete Diffusion LLM (dLLM) is a neural language modeling paradigm that synthesizes the computational sparsity of MoE architectures with the parallel generation capabilities of discrete diffusion-based LLMs. Built upon pre-trained autoregressive (AR) MoE backbones, these models leverage blockwise bidirectional attention and non-sequential denoising steps to enable efficient, scalable, and high-throughput text generation, with specialized frameworks addressing inherent inefficiencies in expert activation during parallel decoding (Wei et al., 9 Feb 2026, Bie et al., 10 Dec 2025).

1. Architectural Foundations and Integration of MoE with dLLMs

MoE dLLMs commence from a Transformer backbone augmented with Mixture-of-Experts (MoE) layers, where standard dense feed-forward layers are replaced with sets of $E$ parallel experts. Each expert is a two-layer MLP, and a gating network $g(x)\in\mathbb{R}^E$ computes routing logits, enabling tokenwise sparse expert selection—typically top-2 per token with softmax-weighted outputs (Bie et al., 10 Dec 2025). To facilitate scaling, all non-MoE parameters, including token and positional embeddings, self-attention, and normalization weights, are directly inherited from a pre-trained AR checkpoint. The experts themselves are initialized via direct cloning of AR FFN weights, while gating networks are freshly initialized.

In the dLLM framework, sequence generation proceeds by initializing all positions with the mask token, partitioning the output into $B$ blocks of size $L$ , and running iterative denoising. Each block is processed via bidirectional attention, and at each position $k$ in block $i$ , the model predicts

$\hat y_i^k = \arg\max_v p_\theta(y_i^k = v \mid P, Y_{\leq i}), \quad c_k = p_\theta(y_i^k = \hat y_i^k \mid P, Y_{\leq i}),$

where $c_k$ serves as the acceptance confidence. Tokens with $c_k > \tau$ (threshold) are unmasked, advancing towards the final output (Wei et al., 9 Feb 2026, Bie et al., 10 Dec 2025).

2. Discrete Diffusion Language Modeling and Training Protocols

The discrete diffusion paradigm, as formalized in Block Diffusion LLMs (BDLM), operates by progressively corrupting (masking) and reconstructing agents over a finite number of denoising steps. At each step, individual tokens are independently masked with probability $1-\alpha_t$ or left unchanged with probability $g(x)\in\mathbb{R}^E$ 0; denoising is performed per token using the learnable reverse process. Block-wise diffusion further accelerates training and inference by splitting sequences into contiguous blocks and training the model to reconstruct only the currently noised block, conditioned on both clean and noisy context (Bie et al., 10 Dec 2025).

LLaDA2.0 introduced a three-phase block-level Warmup–Stable–Decay (WSD) training scheme:

Warm-up: Gradual increase of block size $g(x)\in\mathbb{R}^E$ 1 from 1 (AR block) to full sequence length, so that AR models can adapt to block-level diffusion.
Stable: Full sequence denoising with fixed block size, imposing global context consistency.
Decay: Stepwise reduction of block size to restore efficient KV-cache utilization and variable-length generation (Bie et al., 10 Dec 2025).

Following pre-training, post-training alignment—comprising instruction-tuned supervised fine-tuning (SFT) and direct preference optimization (DPO)—further refines the generation quality for practical deployments. Auxiliary objectives, such as confidence-aware parallel loss, combine with soft acceptance masking to maximize token reconstruction per batch.

3. MoE Routing Dynamics and Fundamental Bottlenecks in dLLM Decoding

A core challenge in integrating MoE architectures with blockwise diffusion decoding is the inefficiency arising from expert routing given the parallel, non-sequential denoising pattern. In each denoising iteration, most tokens do not cross the acceptance threshold and remain masked for subsequent iteration, yet their MoE routing is freshly computed each time. Empirical findings highlight key properties:

Temporal Consistency: Once a token is accepted, its hidden state stabilizes, but MoE layers redundantly re-activate experts for these tokens in future iterations.
Spatial Consistency: Routing decisions among masked tokens concentrate on a small expert subset, with spatially proximal tokens often selecting identical or overlapping experts.
Temporal-Spatial Locality: Token unmasking order follows an approximately AR (autoregressive) pattern, so that nearby tokens are jointly likely to become accepted (Wei et al., 9 Feb 2026).

Consequently, a single pass can activate substantially more experts (often 50+ with $g(x)\in\mathbb{R}^E$ 2) than the nominal top $g(x)\in\mathbb{R}^E$ 3, despite only a handful of new tokens being accepted (e.g., three), leading to an "activated experts per decoded token" (APT) that is much greater than $g(x)\in\mathbb{R}^E$ 4.

4. TEAM: Temporal-Spatial Consistency Guided Expert Activation

The TEAM framework addresses the MoE-diffusion mismatch by reducing redundant expert computation and maximizing tokens accepted per step. It does so via three key strategies:

Delayed Caching for Decoded Tokens (DCD): Caches the key–value pairs and expert activations for all tokens accepted up to iteration $g(x)\in\mathbb{R}^E$ 5, avoiding unnecessary re-routing of already-decoded tokens. Only masked tokens and those newly accepted at the immediate previous iteration are routed through the MoE; older tokens reuse cached states, enabled by the inherent stability in block diffusion (Wei et al., 9 Feb 2026).
Limited Activation for Cold Tokens (LAC): Masked tokens far from decoded tokens and with low confidence ("cold tokens") are routed only over the expert set activated by the union of newly decoded and "hot" tokens. Routing involves two steps:
- Route $g(x)\in\mathbb{R}^E$ 6 (newly decoded and hot tokens) over all experts to obtain $g(x)\in\mathbb{R}^E$ 7 (the union of top-k experts).
- Route cold tokens $g(x)\in\mathbb{R}^E$ 8 using only $g(x)\in\mathbb{R}^E$ 9, ensuring activation is limited and controlled (Wei et al., 9 Feb 2026).
Speculative Exploration for Hot Tokens (SEH): For "hot" tokens—those likely to be accepted, either due to high confidence or spatial proximity to already-accepted tokens—multiple candidate predictions (typically $B$ 0) are explored per token, resulting in speculative parallel branches. Because these branches share most context, expert activation overhead grows sublinearly with $B$ 1. This speculative decoding increases tokens accepted per forward pass (TPF), in practice by factors of $B$ 2– $B$ 3 (Wei et al., 9 Feb 2026).

Collectively, these mechanisms reduce activated experts per forward pass (APF), increase TPF, and bound overall APT near or even below the per-token top-k.

5. Mathematical Formulation and Routing Algorithms

The dLLM decoding objective for a sequence with $B$ 4 blocks is

$B$ 5

with per-token predictions and confidence

$B$ 6

Acceptance operates via a threshold: $B$ 7 Hot tokens are defined as

$B$ 8

For a token set $B$ 9 and expert set $L$ 0, $L$ 1 yields a sparse weight matrix, and only the union of per-token top-k experts are activated.

Algorithmically, the LAC routing can be expressed as: $L$ 2 (Wei et al., 9 Feb 2026)

6. Scalability, Performance, and Efficiency Benchmarks

LLaDA2.0 demonstrates that MoE dLLMs scale efficiently to 100B+ parameters by leveraging AR model inheritance, three-phase WSD training, and confidence-augmented decoding. Key reported metrics:

Throughput: LLaDA2.0-flash (100B) yields 383 TPS (dInfer), rising to 535 TPS with CAP, saturating at 2.1× over AR baselines for 256 TPS workloads.
Speed: Blockwise denoising achieves up to 2× speed advantage relative to sequential AR decoding; TEAM achieves further 1.64–2.20× speedup over vanilla MoE dLLMs (Wei et al., 9 Feb 2026, Bie et al., 10 Dec 2025).
Efficiency: TEAM reduces APF by 35–39% and APT by >60% (e.g., HumanEval task: from 18.33 to 6.80 experts/token); token per forward pass increases from 3.14 to 5.00 (avg.), with minimal Δscore (≤0.6% absolute) [(Wei et al., 9 Feb 2026), Table 1 below].
Quality: Quality degradation is negligible, with benchmark scores identical or within ±0.6 points of vanilla MoE dLLM.

Benchmark	Vanilla Score	TEAM Score	APF ↓	TPF ↑	APT ↓	Speedup
HumanEval	79.27	79.88	53.3→34.5	2.91→5.07	18.33→6.80	×2.20
MBPP	65.76	65.76	49.6→30.9	2.74→4.56	18.10→6.78	×2.08
GSM8K	90.60	90.30	59.1→36.2	3.16→4.79	18.71→7.56	×1.83
Math-500	76.00	75.40	57.9→36.3	3.74→5.57	15.48→6.52	×1.64
Average	77.91	77.84	55.0→34.5	3.14→5.00	17.66→6.92	×1.94

On the hardware side, cuDNN-fused block-diffusion attention modules provide a further 1.3× speedup and reduce attention memory consumption by >90% compared to unfused variants. The model maintains robust context handling to 32K tokens, with moderate quality drop up to 64K when using RoPE scaling (Bie et al., 10 Dec 2025).

7. Research Trajectory and Open Implications

Effective application of MoE in dLLMs requires overcoming the inherent mismatch between parallel tokenwise denoising and sparse per-token expert routing. Innovations such as TEAM that leverage temporal and spatial consistency in expert routing, cached activation reuse, and speculative decoding provide significant efficiency gains without sacrificing output quality (Wei et al., 9 Feb 2026). The paradigm allows for scaling to frontier model sizes (≥100B) with competitive or superior benchmark performance relative to AR baselines, while enabling substantial speed and throughput advantages in inference (Bie et al., 10 Dec 2025).

A plausible implication is that future research may focus on further reducing expert activation overhead, generalizing speculative decoding strategies, and exploring finer-grained control over confidence and masking, particularly for ultra-long context or streaming settings. The foundational methodology unifies architectural (MoE + blockwise Transformer), training (WSD + alignment), and inference (TEAM strategies, CAP) components for optimal efficiency-quality trade-offs at scale.

References:

TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion LLM Acceleration (Wei et al., 9 Feb 2026)
LLaDA2.0: Scaling Up Diffusion LLMs to 100B (Bie et al., 10 Dec 2025)

Markdown Report Issue Upgrade to Chat

References (2)

TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration (2026)

LLaDA2.0: Scaling Up Diffusion Language Models to 100B (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Experts (MoE) Discrete Diffusion Large Language Model (dLLM).