MoE-DiffuSeq: MoE-Augmented Diffusion for Long Text
- The paper demonstrates that integrating MoE modules with sparse attention in every Transformer layer significantly improves training efficiency and sample quality on long documents.
- MoE-DiffuSeq employs a novel diffusion process with a soft absorbing state, reducing computational overhead and enabling rapid denoising of token sequences.
- Evaluation results show MoE-DiffuSeq outperforms previous diffusion models on tasks such as summarization, paraphrasing, and commonsense conversations.
MoE-DiffuSeq is a mixture-of-experts (MoE)–augmented diffusion sequence model that addresses the computational and memory inefficiencies of diffusion-based text generation when applied to long documents. By integrating a customized sparse attention mechanism and a mixture-of-experts block within each Transformer layer, MoE-DiffuSeq enables efficient and scalable modeling for extended sequences. Notably, a soft absorbing state is introduced to accelerate sequence reconstruction and improve sample quality. MoE-DiffuSeq demonstrates substantial gains in training efficiency, sampling speed, and evaluation metrics across long-form generation tasks, setting new benchmarks for diffusion-based text generation models in extended document scenarios (Christoforos et al., 23 Dec 2025).
1. Architectural Innovations
MoE-DiffuSeq is constructed as an extension of the DiffuSeq diffusion-seq2seq backbone, incorporating two principal architectural enhancements at each of its 12 Transformer layers: (1) a Longformer-style sparse attention mechanism and (2) an MoE feed-forward stage. The overall data flow in each layer is:
- Input token representations
- Sparse self-attention (sliding and dilated window with global tokens)
- LayerNorm
- MoE block: gating over experts , yielding
- LayerNorm
- Output to next layer
The sparse attention restricts each query to a window of size (with optional dilation and infrequent global tokens), reducing computational overhead from quadratic to nearly linear in sequence length. The MoE feed-forward block employs a lightweight gating network that computes logits and normalizes them with a softmax . The hidden state is then aggregated as a weighted sum over expert MLPs: . This modular design provides scalability for long sequences and improved representational power without prohibitive cost (Christoforos et al., 23 Dec 2025).
2. Diffusion Process and Soft Absorbing State
The underlying generative process employs a continuous Gaussian forward diffusion, as in DiffuSeq. At each forward step, the token representation is given by:
with . The reverse process is parameterized as:
where is a stacked sparse-attention and MoE Transformer denoiser.
A key addition is the introduction of a soft absorbing state , handled via a probability schedule . The forward process becomes:
Once a token enters the absorbing state, it is masked, reducing unnecessary noising and enabling rapid denoising on discrete representations. This facilitates acceleration in generating coherent sequences for long-form text (Christoforos et al., 23 Dec 2025).
3. Sparse Attention and MoE Mechanisms
The model’s sparse attention combines sliding-window and dilated window strategies. Dense self-attention has time and memory complexity, whereas the sparse sliding-window mechanism reduces this to for window size , a significant advantage for sequences with . The use of dilation extends the receptive field without an increase in window size, i.e., .
Global attention is reserved for a small number of global tokens, such as [CLS], that can attend to and from all positions, enabling information routing across distant parts of the sequence.
The MoE module uses a softmax-based gating function, commonly with top- selection for sparsity, routing each token's hidden representation to a subset of expert MLPs. Each expert is an independent feed-forward block, and the output is a weighted sum over selected experts:
While the current version does not define an explicit auxiliary load-balancing loss, such a loss may be incorporated to enforce equitable usage of experts (e.g., ) (Christoforos et al., 23 Dec 2025).
4. Training and Inference Algorithms
Training minimizes a combined loss consisting of denoising mean squared error (MSE) and a regularizer (typically ) on the initial state:
where is the token embedding at step and is the diffused state.
Training Algorithm Outline:
1 2 3 4 5 6 7 8 9 |
for each minibatch of token sequences w_1…w_n: z_0 ← EMB(w) sample t ∼ Uniform({2,…,T}) sample ε ∼ 𝒩(0,I) z_t ← √ᾱ_t · z_0 + √(1–ᾱ_t) · ε ŷ ← MoE-DiffuSeq(z_t, t) ℒ ← ‖z_0 – ŷ‖² + R(‖z_0‖) backpropagate ℒ → update θ, W_g, expert params end |
1 2 3 4 5 6 |
z_T ∼ 𝒩(0,I) for t = T … 1: μ, σ ← f_θ(z_t,t) z_{t−1} ← μ + σ · ε_t # ε_t∼𝒩(0,I) if t>1 else zero end tokens ← DECODER(z_0) |
5. Empirical Results and Evaluation
MoE-DiffuSeq has been exhaustively evaluated on multiple long-document and sequence tasks, including Arxiv abstract summarization, HotpotQA, commonsense conversations, and QQP paraphrasing. The following table summarizes key comparative results:
| Task | Metric | Longformer | DiffuSeq | MoE-DiffuSeq |
|---|---|---|---|---|
| Arxiv Summarization | ROUGE-1 | 41.44 | 39.12 | 44.41 |
| Arxiv Summarization | ROUGE-2 | 17.52 | 16.43 | 18.73 |
| Arxiv Summarization | ROUGE-L | 38.70 | 37.88 | 39.89 |
| HotpotQA (Answer) | EM/F1 | 71.21/82.42 | 70.91/81.43 | 72.88/85.42 |
| Commonsense Conv. | BLEU | 0.030 | 0.022 | 0.049 |
| Commonsense Conv. | ROUGE-L | 0.139 | 0.119 | 0.233 |
| Commonsense Conv. | BERTScore | 0.602 | 0.501 | 0.628 |
| QQP Paraphrasing | Accuracy (%) | 92.3 | 91.7 | 95.3 |
Ablation studies illustrate the contribution of each module. For example, removing sparse attention leads to a ROUGE-1 decrease from 44.41 to 42.52, and increasing diffusion steps above 2048 continues to yield marginal improvements. Changing the window size also modulates the trade-off between speed and text quality.
6. Comparison with Related MoE Diffusion Models
MoE-DiffuSeq is part of a broader movement towards parameter-efficient large-scale diffusion models that incorporate MoE layers, as demonstrated by models such as LLaDA-MoE (Zhu et al., 29 Sep 2025). LLaDA-MoE replaces each MLP feed-forward network in a 16-layer Transformer with a pool of 64 experts, activates only 2 per token, and achieves state-of-the-art performance for its scale across diverse tasks by activating only 1.4B parameters (of 7B) per inference. LLaDA-MoE uses discrete token masking rather than embedding dequantization, additional load-balancing regularizers for expert utilization, and a semi-autoregressive inference scheme.
These developments indicate that the integration of sparse attention and MoE in the diffusion language modeling regime brings significant efficiency gains and flexible scaling. A plausible implication is that such hybrid architectures will be increasingly critical for practical deployment of diffusion-based sequence generation in settings with massive context windows or document lengths (Christoforos et al., 23 Dec 2025, Zhu et al., 29 Sep 2025).
7. Limitations and Future Directions
Current limitations of MoE-DiffuSeq include fixed expert count and static window sizes, with routing overheads that may increase for extreme scales. Potential enhancements suggested include:
- Deployment of dynamic top- routing and more efficient load balancing to minimize expert over-specialization.
- Extension to multimodal diffusion by introducing cross-modal sparse attention modules (e.g., for text+image).
- Hierarchical MoE architectures to handle document- versus sentence-level routing decisions.
- Learnable absorbing-state schedules for optimizing denoising speed per token or context.
- Integration of retrieval augmentation for sequences exceeding 100,000 tokens.
This suggests that future research may focus on adaptivity, scaling, and the integration of additional modalities and retrieval mechanisms to further improve the applicability of MoE-augmented diffusion models for large-scale, high-quality sequence generation (Christoforos et al., 23 Dec 2025).