LLaDA-MoE: A Sparse MoE Diffusion Language Model (2509.24389v1)

Published 29 Sep 2025 in cs.CL and cs.AI

Abstract: We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion LLMs with larger parameters, surpassing previous diffusion LLMs LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion LLMs still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion LLMs. LLaDA-MoE models are available at Huggingface.

Summary

The paper introduces a novel fusion of sparse MoE and masked diffusion modeling that activates only a fraction (1.4B) of its 7B parameters per inference step.
It employs a Transformer backbone with 64 experts and top-8 expert routing per token, enhancing both computational efficiency and specialization.
Empirical results show that LLaDA-MoE outperforms dense MDMs and rivals autoregressive models on tasks including knowledge, code generation, and mathematical reasoning.

LLaDA-MoE: A Sparse MoE Diffusion LLM

Introduction and Motivation

LLaDA-MoE presents a novel integration of the Mixture-of-Experts (MoE) architecture into the Masked Diffusion Model (MDM) paradigm for large language modeling. The work addresses the computational inefficiency of dense MDMs by leveraging sparse expert routing, activating only a fraction of the total parameters per token during inference. This approach is motivated by the empirical success of MoE in autoregressive (AR) models, where sparse activation yields competitive performance with reduced resource requirements. LLaDA-MoE is trained from scratch on 20T tokens, maintaining a 7B parameter pool but activating only 1.4B parameters per inference step.

Model Architecture and Generation Process

LLaDA-MoE employs a Transformer backbone with RMSNorm, SwiGLU activations, rotary positional embeddings, and QK-layernorm in multi-head attention. The MoE layers consist of 64 experts, with the router selecting the top-8 experts per token. This design enables efficient computation and expert specialization.

The generation process follows the MDM paradigm: starting from a fully masked sequence, the model iteratively predicts masked tokens and unmasks them, progressing from $t=1$ (fully masked) to $t=0$ (fully unmasked). The MoE router dynamically selects experts for each token, and outputs are weighted combinations of the selected experts.

Figure 1: Overview of the iterative masked diffusion generation process and MoE architecture with top-2 expert routing per token.

Training Pipeline and Objectives

The training pipeline consists of multiple stages:

Pretrain Stage 1: 10T tokens on a mixed corpus.
Pretrain Stage 2: 10T tokens with increased mathematics and code sampling.
Annealing Stage 1: 500B tokens of high-quality text.
Annealing Stage 2: 500B tokens with expanded context (8k tokens, RoPE base 50,000).
Supervised Fine-Tuning (SFT): On curated prompt–answer pairs.
Figure 2: Multi-stage training pipeline for LLaDA-MoE, including pretraining, annealing, and SFT.

The pretraining objective is a variational lower bound on the log-likelihood, reconstructing masked tokens from partially observed context. Variable-length training is employed for 1% of steps to reduce train–test context mismatch. MoE routing is regularized with auxiliary load-balancing and Z-losses to prevent expert collapse and ensure balanced expert utilization.

Figure 3: Training dynamics of auxiliary losses (Z-loss and load-balancing loss) over the first 1T tokens, showing rapid stabilization.

Inference and Sampling Strategies

Inference begins with a fully masked sequence, iteratively reducing the noise level and sampling tokens at masked positions using the mask predictor. Semi-autoregressive blockwise sampling is adopted, partitioning the sequence into blocks and decoding masked positions in parallel within each block. Low-confidence remasking is used to refine outputs, improving sample quality.

Empirical Results

LLaDA-MoE is evaluated on a comprehensive suite of benchmarks covering knowledge, reasoning, mathematics, coding, agent, and alignment tasks. Despite activating only 1.4B parameters, LLaDA-MoE consistently outperforms prior dense 8B MDMs (LLaDA, LLaDA 1.5, Dream) and achieves performance comparable to Qwen2.5-3B-Instruct, an AR model.

Figure 4: Benchmark results showing LLaDA-MoE outperforming larger MDMs and matching Qwen2.5-3B-Instruct across diverse tasks with fewer activated parameters.

Notably, LLaDA-MoE-7B-A1B-Instruct achieves strong results in knowledge understanding, code generation, mathematical reasoning, and agent tasks, trailing Qwen2.5-3B-Instruct by only a small margin. The model demonstrates robust generalization and parameter efficiency, with average scores exceeding those of larger dense MDMs.

Implementation Considerations

Resource Efficiency: Sparse MoE activation reduces memory and compute requirements, enabling deployment on hardware with limited resources.
Expert Specialization: The MoE architecture facilitates expert specialization, potentially improving performance on heterogeneous tasks.
Training Stability: Auxiliary losses (load-balancing, Z-loss) are critical for stable expert routing and preventing collapse.
Context Window: Expansion to 8k context during annealing supports long-sequence modeling, but SFT is limited to 4k due to sample length distribution.
Sampling Strategy: Semi-autoregressive blockwise sampling and remasking enhance inference efficiency and output quality.

Implications and Future Directions

The integration of MoE into MDMs establishes a new direction for efficient large language modeling, combining the strengths of diffusion-based generation and sparse expert routing. The results suggest that MoE architectures can be effectively adapted to non-autoregressive paradigms, opening avenues for further scaling and specialization.

Future work may explore:

Scaling LLaDA-MoE to larger parameter pools and more experts.
Advanced expert routing mechanisms (dynamic, hierarchical).
Multimodal extensions leveraging MoE for cross-domain tasks.
Optimized inference strategies for real-time and low-latency applications.

Conclusion

LLaDA-MoE demonstrates that sparse MoE architectures can be successfully integrated into masked diffusion LLMs, yielding strong performance with reduced active parameter budgets. The model surpasses prior dense MDMs and matches AR baselines across diverse tasks, establishing MoE as a viable foundation for efficient diffusion-based language modeling. This work opens a broad design space for future research in scalable, resource-efficient LLMs.