Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoE-DiffuSeq: MoE-Augmented Diffusion for Long Text

Updated 30 December 2025
  • The paper demonstrates that integrating MoE modules with sparse attention in every Transformer layer significantly improves training efficiency and sample quality on long documents.
  • MoE-DiffuSeq employs a novel diffusion process with a soft absorbing state, reducing computational overhead and enabling rapid denoising of token sequences.
  • Evaluation results show MoE-DiffuSeq outperforms previous diffusion models on tasks such as summarization, paraphrasing, and commonsense conversations.

MoE-DiffuSeq is a mixture-of-experts (MoE)–augmented diffusion sequence model that addresses the computational and memory inefficiencies of diffusion-based text generation when applied to long documents. By integrating a customized sparse attention mechanism and a mixture-of-experts block within each Transformer layer, MoE-DiffuSeq enables efficient and scalable modeling for extended sequences. Notably, a soft absorbing state is introduced to accelerate sequence reconstruction and improve sample quality. MoE-DiffuSeq demonstrates substantial gains in training efficiency, sampling speed, and evaluation metrics across long-form generation tasks, setting new benchmarks for diffusion-based text generation models in extended document scenarios (Christoforos et al., 23 Dec 2025).

1. Architectural Innovations

MoE-DiffuSeq is constructed as an extension of the DiffuSeq diffusion-seq2seq backbone, incorporating two principal architectural enhancements at each of its 12 Transformer layers: (1) a Longformer-style sparse attention mechanism and (2) an MoE feed-forward stage. The overall data flow in each layer is:

  1. Input token representations xx
  2. Sparse self-attention (sliding and dilated window with global tokens) x\rightarrow x'
  3. LayerNorm(x+x)(x + x')
  4. MoE block: gating over experts E1,,EKE_1,\dots, E_K, yielding xx''
  5. LayerNorm(x+x)(x + x'')
  6. Output to next layer

The sparse attention restricts each query to a window of size ww (with optional dilation dd and infrequent global tokens), reducing computational overhead from quadratic to nearly linear in sequence length. The MoE feed-forward block employs a lightweight gating network that computes logits u=Wghu = W_g h and normalizes them with a softmax p=softmax(u)p = \mathrm{softmax}(u). The hidden state is then aggregated as a weighted sum over expert MLPs: y^=ipiEi(h)\widehat{y} = \sum_i p_i E_i(h). This modular design provides scalability for long sequences and improved representational power without prohibitive cost (Christoforos et al., 23 Dec 2025).

2. Diffusion Process and Soft Absorbing State

The underlying generative process employs a continuous Gaussian forward diffusion, as in DiffuSeq. At each forward step, the token representation xtx_{t} is given by:

xt=αtx0+1αtϵ,ϵN(0,I),x_t = \sqrt{\overline{\alpha}_t} x_0 + \sqrt{1-\overline{\alpha}_t}\,\epsilon, \quad \epsilon \sim \mathcal{N}(0,I),

with αt=s=1t(1βs)\overline{\alpha}_t = \prod_{s=1}^t (1 - \beta_s). The reverse process is parameterized as:

pθ(xt1xt)=N(xt1;fθ(xt,t),σt2I)p_\theta(x_{t-1}\mid x_t) = \mathcal{N}(x_{t-1}\,;\, f_\theta(x_t, t),\, \sigma_t^2 I)

where fθf_\theta is a stacked sparse-attention and MoE Transformer denoiser.

A key addition is the introduction of a soft absorbing state mm, handled via a probability schedule γt\gamma_t. The forward process becomes:

q(xtxt1)=(1γt)N(xt;1βtxt1,βtI)+γtδm(xt)q(x_t \mid x_{t-1}) = (1 - \gamma_t) \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) + \gamma_t\, \delta_m(x_t)

Once a token enters the absorbing state, it is masked, reducing unnecessary noising and enabling rapid denoising on discrete representations. This facilitates acceleration in generating coherent sequences for long-form text (Christoforos et al., 23 Dec 2025).

3. Sparse Attention and MoE Mechanisms

The model’s sparse attention combines sliding-window and dilated window strategies. Dense self-attention has O(n2)O(n^2) time and memory complexity, whereas the sparse sliding-window mechanism reduces this to O(nw)O(nw) for window size ww, a significant advantage for sequences with nwn \gg w. The use of dilation extends the receptive field without an increase in window size, i.e., ReceptiveField=Ldw\mathrm{ReceptiveField} = L \cdot d \cdot w.

Global attention is reserved for a small number of global tokens, such as [CLS], that can attend to and from all positions, enabling information routing across distant parts of the sequence.

The MoE module uses a softmax-based gating function, commonly with top-kk selection for sparsity, routing each token's hidden representation to a subset of expert MLPs. Each expert is an independent feed-forward block, and the output is a weighted sum over selected experts:

MoE(h)=i=1KpiEi(h)\mathrm{MoE}(h) = \sum_{i=1}^K p_i \cdot E_i(h)

While the current version does not define an explicit auxiliary load-balancing loss, such a loss may be incorporated to enforce equitable usage of experts (e.g., Lossload=i[E[pi]]2\mathrm{Loss}_{\text{load}} = \sum_i [\mathbb{E}[p_i]]^2) (Christoforos et al., 23 Dec 2025).

4. Training and Inference Algorithms

Training minimizes a combined loss consisting of denoising mean squared error (MSE) and a regularizer RR (typically L2L_2) on the initial state:

L=t=2TEMB(wt)fθ(zt,t)2+R(z0)\mathcal{L} = \sum_{t = 2}^T \| \mathrm{EMB}(w_t) - f_\theta(z_t, t)\|^2 + R(\|z_0\|)

where EMB(wt)\mathrm{EMB}(w_t) is the token embedding at step tt and ztz_t is the diffused state.

Training Algorithm Outline:

1
2
3
4
5
6
7
8
9
for each minibatch of token sequences w_1w_n:
    z_0  EMB(w)
    sample t  Uniform({2,,T})
    sample ε  𝒩(0,I)
    z_t  ᾱ_t · z_0 + (1ᾱ_t) · ε
    ŷ  MoE-DiffuSeq(z_t, t)
    ℒ  z_0  ŷ² + R(z_0)
    backpropagate ℒ  update θ, W_g, expert params
end
Sampling (Inference) Pseudocode:

1
2
3
4
5
6
z_T  𝒩(0,I)
for t = T  1:
    μ, σ  f_θ(z_t,t)
    z_{t1}  μ + σ · ε_t  # ε_t∼𝒩(0,I) if t>1 else zero
end
tokens  DECODER(z_0)
In practice, sampling is accelerated using DPM-Solver++ to reduce step count (Christoforos et al., 23 Dec 2025).

5. Empirical Results and Evaluation

MoE-DiffuSeq has been exhaustively evaluated on multiple long-document and sequence tasks, including Arxiv abstract summarization, HotpotQA, commonsense conversations, and QQP paraphrasing. The following table summarizes key comparative results:

Task Metric Longformer DiffuSeq MoE-DiffuSeq
Arxiv Summarization ROUGE-1 41.44 39.12 44.41
Arxiv Summarization ROUGE-2 17.52 16.43 18.73
Arxiv Summarization ROUGE-L 38.70 37.88 39.89
HotpotQA (Answer) EM/F1 71.21/82.42 70.91/81.43 72.88/85.42
Commonsense Conv. BLEU 0.030 0.022 0.049
Commonsense Conv. ROUGE-L 0.139 0.119 0.233
Commonsense Conv. BERTScore 0.602 0.501 0.628
QQP Paraphrasing Accuracy (%) 92.3 91.7 95.3

Ablation studies illustrate the contribution of each module. For example, removing sparse attention leads to a ROUGE-1 decrease from 44.41 to 42.52, and increasing diffusion steps above 2048 continues to yield marginal improvements. Changing the window size ww also modulates the trade-off between speed and text quality.

MoE-DiffuSeq is part of a broader movement towards parameter-efficient large-scale diffusion models that incorporate MoE layers, as demonstrated by models such as LLaDA-MoE (Zhu et al., 29 Sep 2025). LLaDA-MoE replaces each MLP feed-forward network in a 16-layer Transformer with a pool of 64 experts, activates only 2 per token, and achieves state-of-the-art performance for its scale across diverse tasks by activating only \sim1.4B parameters (of 7B) per inference. LLaDA-MoE uses discrete token masking rather than embedding dequantization, additional load-balancing regularizers for expert utilization, and a semi-autoregressive inference scheme.

These developments indicate that the integration of sparse attention and MoE in the diffusion language modeling regime brings significant efficiency gains and flexible scaling. A plausible implication is that such hybrid architectures will be increasingly critical for practical deployment of diffusion-based sequence generation in settings with massive context windows or document lengths (Christoforos et al., 23 Dec 2025, Zhu et al., 29 Sep 2025).

7. Limitations and Future Directions

Current limitations of MoE-DiffuSeq include fixed expert count and static window sizes, with routing overheads that may increase for extreme scales. Potential enhancements suggested include:

  • Deployment of dynamic top-kk routing and more efficient load balancing to minimize expert over-specialization.
  • Extension to multimodal diffusion by introducing cross-modal sparse attention modules (e.g., for text+image).
  • Hierarchical MoE architectures to handle document- versus sentence-level routing decisions.
  • Learnable absorbing-state schedules γt\gamma_t for optimizing denoising speed per token or context.
  • Integration of retrieval augmentation for sequences exceeding 100,000 tokens.

This suggests that future research may focus on adaptivity, scaling, and the integration of additional modalities and retrieval mechanisms to further improve the applicability of MoE-augmented diffusion models for large-scale, high-quality sequence generation (Christoforos et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoE-DiffuSeq.