Papers
Topics
Authors
Recent
Search
2000 character limit reached

dMoE: dLLMs with Learnable Block Experts

Published 29 May 2026 in cs.CL | (2605.30876v1)

Abstract: Diffusion LLMs (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14$\times$ to 1.66$\times$ end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE

Summary

  • The paper introduces a block-level routing strategy that aggregates token-level scores into adaptive core sets to control expert activation in diffusion LLMs.
  • It reduces the mean unique expert count from 69.5 to 14.6 (a 79% decrease) while maintaining over 99% of baseline accuracy on reasoning tasks.
  • The method achieves a 76-80% drop in memory usage and up to 1.66ร— faster inference, enabling scalable deployments on resource-limited platforms.

dMoE: Block-Level Expert Routing for Diffusion LLMs

Introduction and Motivation

Diffusion LLMs (dLLMs) represent a major paradigm shift from traditional autoregressive LLMs, leveraging iterative masked-token denoising to facilitate parallel token generation within a block. This architecture unlocks significant efficiency potential at test time, particularly when paired with Mixture-of-Experts (MoE) techniques for scalable capacity expansion. However, integrating MoEโ€™s token-level routingโ€”inherited from autoregressive modelsโ€”into the block-parallel regime of dLLMs results in a substantial increase in the number of uniquely activated experts per forward pass, intensifying memory bottlenecks during inference. Figure 1

Figure 1: Comparison between the original LLaDA2.0-mini and dMoE, where dMoE replaces token-level expert routing with block-level routing, sharply reducing unique expert activation without loss in performance.

Empirical Bottleneck Analysis

The authors conduct extensive empirical analyses using LLaDA2.0-mini, focusing on observed inference bottlenecks. Their studies demonstrate that in the MoE-enabled dLLMs, MoE execution latency dominates total inference time, scaling nearly linearly with the number of unique experts activated per block-parallel step. Figure 2

Figure 2: (a) Additional empirical guidance (b) Latency breakdown showing MoE as the primary contributor to inference time. (c) Linear relationship between unique expert count and MoE kernel latency.

Further diagnostic analysis establishes two critical observations:

  1. Token-level router weights are predictive of expert importance, evidenced by a positive correlation between router confidences and loss increases upon expert masking.
  2. Expert concentration is highly variable across denoising steps/blocks, meaning the active expert set changes adaptively during inference. Figure 3

Figure 3

Figure 3: (a) Correlation between router weights and loss increase. (b) Distribution of unique expert count during inference, highlighting high variance.

dMoE Methodology

Block-Level Routing with Adaptive Coreset

dMoEโ€™s core contribution is a transition from token-level to block-level expert routing. For each noisy block processed, token-level expert affinities are aggregated into block-level scores, after which a top-p selection criterion adaptively determines a coresetโ€”a minimal expert subset sufficient to cover inference needs for that block. Subsequent token-level expert selection is strictly restricted to this coreset, drastically limiting memory usage. Figure 4

Figure 4: dMoE overview: token-level router scores are aggregated, top-p coreset selection is applied, and final token-level routing occurs within this adaptive expert subset. Training uses self-distillation with CSE loss.

The implementation maintains the same number of experts per token but sharply reduces the distinct experts used in any block. Coreset size is automatically tuned based on the blockโ€™s expert score concentration, yielding hardware- and context-adaptive efficiency.

Training and Inference Regimen

dMoE is trained on self-distilled data collected from the base model using public datasets, echoing the original routing process in the forward pass to ensure alignment between training and inference. Critical hyperparameters such as block size and top-p threshold are systematically ablated, showing robustness and tunability.

Experimental Results

The authors benchmark dMoE on multiple reasoning tasks (MATH500, GSM8K, ARC-C, MMLU) using LLaDA2.0-mini as the base model. The principal findings are:

  • dMoE reduces the mean unique expert count per layer from 69.5 to 14.6, a 79.04% decrease, while preserving 99.11% of the original aggregate benchmark accuracy.
  • Memory usage drops by 76.64โ€“79.84% and end-to-end latency improves by factors of 1.14ร—โ€“1.66ร—, attributable to a pronounced reduction in expert activation and thus memory accesses. Figure 5

Figure 5

Figure 5: (a) Memory usage comparison across baselines and dMoE. (b) End-to-end and MoE-specific latency, highlighting speedup achieved by dMoE.

dMoE consistently outperforms system-level adaptive baselines such as DES-S and DES-V, especially in regimes of extreme expert compression, where the baseline models exhibit marked performance collapse but dMoE maintains high accuracy.

Extensive ablation demonstrates that dMoEโ€™s performance-efficiency trade-off can be finely modulated via the cumulative probability threshold pp and block size, enabling deployment optimization tailored to specific hardware or application constraints. Figure 6

Figure 6: Performance-efficiency trade-off of dMoE versus baselines across GSM8K, MATH500, and ARC-C, confirming superior retention of accuracy at reduced expert counts.

Theoretical and Practical Implications

dMoE extends block-level routing to the diffusion model context, where traditional MoE paradigms falter due to massive blockwise expert explosions. The block-level aggregation and adaptive coreset selection provide an elegant solution, offering a hardware-friendly path toward scalable, efficient inference in densely routed diffusion models.

Practically, dMoE dramatically lowers the resource footprint of dLLMs, alleviating deployment challenges on resource-constrained or memory-bound platforms. In theoretical terms, the methodology situates expert routing as a dynamic, context-sensitive process rather than a fixed per-token operation, motivating further research into block-level and coarse-grained adaptivity in large-scale architectures.

The authors suggest multiple future advances: compressing further by harmonizing expert groups across a block, reducing per-token expert counts, and transferring the methodology to other modalities (e.g., vision, multimodal reasoning).

Conclusion

dMoE introduces a block-level, learnable MoE routing strategy specifically for diffusion LLMs, enabling substantial reductions in unique expert activation, memory usage, and inference latency with virtually no loss in task performance. The proposed method establishes a robust foundation for scalable, efficient MoE routing in parallel-generative architectures and opens several avenues for accelerating and compressing future dLLM deployments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.