MoE-DiffuSeq: MoE-Augmented Diffusion for Long Text

Updated 30 December 2025

The paper demonstrates that integrating MoE modules with sparse attention in every Transformer layer significantly improves training efficiency and sample quality on long documents.
MoE-DiffuSeq employs a novel diffusion process with a soft absorbing state, reducing computational overhead and enabling rapid denoising of token sequences.
Evaluation results show MoE-DiffuSeq outperforms previous diffusion models on tasks such as summarization, paraphrasing, and commonsense conversations.

MoE-DiffuSeq is a mixture-of-experts (MoE)–augmented diffusion sequence model that addresses the computational and memory inefficiencies of diffusion-based text generation when applied to long documents. By integrating a customized sparse attention mechanism and a mixture-of-experts block within each Transformer layer, MoE-DiffuSeq enables efficient and scalable modeling for extended sequences. Notably, a soft absorbing state is introduced to accelerate sequence reconstruction and improve sample quality. MoE-DiffuSeq demonstrates substantial gains in training efficiency, sampling speed, and evaluation metrics across long-form generation tasks, setting new benchmarks for diffusion-based text generation models in extended document scenarios (Christoforos et al., 23 Dec 2025).

1. Architectural Innovations

MoE-DiffuSeq is constructed as an extension of the DiffuSeq diffusion-seq2seq backbone, incorporating two principal architectural enhancements at each of its 12 Transformer layers: (1) a Longformer-style sparse attention mechanism and (2) an MoE feed-forward stage. The overall data flow in each layer is:

Input token representations $x$
Sparse self-attention (sliding and dilated window with global tokens) $\rightarrow x'$
LayerNorm $(x + x')$
MoE block: gating over experts $E_1,\dots, E_K$ , yielding $x''$
LayerNorm $(x + x'')$
Output to next layer

The sparse attention restricts each query to a window of size $w$ (with optional dilation $d$ and infrequent global tokens), reducing computational overhead from quadratic to nearly linear in sequence length. The MoE feed-forward block employs a lightweight gating network that computes logits $u = W_g h$ and normalizes them with a softmax $p = \mathrm{softmax}(u)$ . The hidden state is then aggregated as a weighted sum over expert MLPs: $\rightarrow x'$ 0. This modular design provides scalability for long sequences and improved representational power without prohibitive cost (Christoforos et al., 23 Dec 2025).

2. Diffusion Process and Soft Absorbing State

The underlying generative process employs a continuous Gaussian forward diffusion, as in DiffuSeq. At each forward step, the token representation $\rightarrow x'$ 1 is given by:

$\rightarrow x'$ 2

with $\rightarrow x'$ 3. The reverse process is parameterized as:

$\rightarrow x'$ 4

where $\rightarrow x'$ 5 is a stacked sparse-attention and MoE Transformer denoiser.

A key addition is the introduction of a soft absorbing state $\rightarrow x'$ 6, handled via a probability schedule $\rightarrow x'$ 7. The forward process becomes:

$\rightarrow x'$ 8

Once a token enters the absorbing state, it is masked, reducing unnecessary noising and enabling rapid denoising on discrete representations. This facilitates acceleration in generating coherent sequences for long-form text (Christoforos et al., 23 Dec 2025).

3. Sparse Attention and MoE Mechanisms

The model’s sparse attention combines sliding-window and dilated window strategies. Dense self-attention has $\rightarrow x'$ 9 time and memory complexity, whereas the sparse sliding-window mechanism reduces this to $(x + x')$ 0 for window size $(x + x')$ 1, a significant advantage for sequences with $(x + x')$ 2. The use of dilation extends the receptive field without an increase in window size, i.e., $(x + x')$ 3.

Global attention is reserved for a small number of global tokens, such as [CLS], that can attend to and from all positions, enabling information routing across distant parts of the sequence.

The MoE module uses a softmax-based gating function, commonly with top- $(x + x')$ 4 selection for sparsity, routing each token's hidden representation to a subset of expert MLPs. Each expert is an independent feed-forward block, and the output is a weighted sum over selected experts:

$(x + x')$ 5

While the current version does not define an explicit auxiliary load-balancing loss, such a loss may be incorporated to enforce equitable usage of experts (e.g., $(x + x')$ 6) (Christoforos et al., 23 Dec 2025).

4. Training and Inference Algorithms

Training minimizes a combined loss consisting of denoising mean squared error (MSE) and a regularizer $(x + x')$ 7 (typically $(x + x')$ 8) on the initial state:

$(x + x')$ 9

where $E_1,\dots, E_K$ 0 is the token embedding at step $E_1,\dots, E_K$ 1 and $E_1,\dots, E_K$ 2 is the diffused state.

Training Algorithm Outline:

$E_1,\dots, E_K$ 7 Sampling (Inference) Pseudocode:

$E_1,\dots, E_K$ 8 In practice, sampling is accelerated using DPM-Solver++ to reduce step count (Christoforos et al., 23 Dec 2025).

5. Empirical Results and Evaluation

MoE-DiffuSeq has been exhaustively evaluated on multiple long-document and sequence tasks, including Arxiv abstract summarization, HotpotQA, commonsense conversations, and QQP paraphrasing. The following table summarizes key comparative results:

Task	Metric	Longformer	DiffuSeq	MoE-DiffuSeq
Arxiv Summarization	ROUGE-1	41.44	39.12	44.41
Arxiv Summarization	ROUGE-2	17.52	16.43	18.73
Arxiv Summarization	ROUGE-L	38.70	37.88	39.89
HotpotQA (Answer)	EM/F1	71.21/82.42	70.91/81.43	72.88/85.42
Commonsense Conv.	BLEU	0.030	0.022	0.049
Commonsense Conv.	ROUGE-L	0.139	0.119	0.233
Commonsense Conv.	BERTScore	0.602	0.501	0.628
QQP Paraphrasing	Accuracy (%)	92.3	91.7	95.3

Ablation studies illustrate the contribution of each module. For example, removing sparse attention leads to a ROUGE-1 decrease from 44.41 to 42.52, and increasing diffusion steps above 2048 continues to yield marginal improvements. Changing the window size $E_1,\dots, E_K$ 3 also modulates the trade-off between speed and text quality.

MoE-DiffuSeq is part of a broader movement towards parameter-efficient large-scale diffusion models that incorporate MoE layers, as demonstrated by models such as LLaDA-MoE (Zhu et al., 29 Sep 2025). LLaDA-MoE replaces each MLP feed-forward network in a 16-layer Transformer with a pool of 64 experts, activates only 2 per token, and achieves state-of-the-art performance for its scale across diverse tasks by activating only $E_1,\dots, E_K$ 41.4B parameters (of 7B) per inference. LLaDA-MoE uses discrete token masking rather than embedding dequantization, additional load-balancing regularizers for expert utilization, and a semi-autoregressive inference scheme.

These developments indicate that the integration of sparse attention and MoE in the diffusion language modeling regime brings significant efficiency gains and flexible scaling. A plausible implication is that such hybrid architectures will be increasingly critical for practical deployment of diffusion-based sequence generation in settings with massive context windows or document lengths (Christoforos et al., 23 Dec 2025, Zhu et al., 29 Sep 2025).

7. Limitations and Future Directions

Current limitations of MoE-DiffuSeq include fixed expert count and static window sizes, with routing overheads that may increase for extreme scales. Potential enhancements suggested include:

Deployment of dynamic top- $E_1,\dots, E_K$ 5 routing and more efficient load balancing to minimize expert over-specialization.
Extension to multimodal diffusion by introducing cross-modal sparse attention modules (e.g., for text+image).
Hierarchical MoE architectures to handle document- versus sentence-level routing decisions.
Learnable absorbing-state schedules $E_1,\dots, E_K$ 6 for optimizing denoising speed per token or context.
Integration of retrieval augmentation for sequences exceeding 100,000 tokens.

This suggests that future research may focus on adaptivity, scaling, and the integration of additional modalities and retrieval mechanisms to further improve the applicability of MoE-augmented diffusion models for large-scale, high-quality sequence generation (Christoforos et al., 23 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts (2025)

LLaDA-MoE: A Sparse MoE Diffusion Language Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoE-DiffuSeq.

MoE-DiffuSeq: MoE-Augmented Diffusion for Long Text

1. Architectural Innovations

2. Diffusion Process and Soft Absorbing State

3. Sparse Attention and MoE Mechanisms

4. Training and Inference Algorithms

5. Empirical Results and Evaluation

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MoE-DiffuSeq: MoE-Augmented Diffusion for Long Text

1. Architectural Innovations

2. Diffusion Process and Soft Absorbing State

3. Sparse Attention and MoE Mechanisms

4. Training and Inference Algorithms

5. Empirical Results and Evaluation

6. Comparison with Related MoE Diffusion Models

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research