DiffuApriel: Masked Diffusion LM

Updated 22 November 2025

DiffuApriel is a masked diffusion language model that replaces Transformer denoisers with a bidirectional Mamba-2 state-space backbone.
It achieves linear-time inference with up to 4.4× throughput gains, reducing memory overhead in long-context text generation.
The hybrid DiffuApriel-H variant interleaves sparse attention layers, balancing global and local context modeling while preserving efficiency.

DiffuApriel is a masked diffusion LLM (DLM) architecture that replaces the conventional Transformer-based denoiser with a bidirectional state-space backbone, specifically the Mamba-2 architecture. Diffusion LMs, as opposed to standard autoregressive (AR) LMs, enable parallel denoising of entire corrupted sequences, eliminating the left-to-right decoding bottleneck and attendant quadratic attention costs. DiffuApriel achieves linear-time inference and significantly higher throughput in long-context regimes, offering a scalable, memory-efficient solution for iterative text generation. Additionally, a hybrid variant, DiffuApriel-H, interleaves sparse attention layers to balance global and local context modeling without compromising the core efficiency advantages.

1. Motivation and Theoretical Foundations

Diffusion-based text generation has gained traction as a non-autoregressive alternative to AR LMs. While AR LMs decode sequentially and incur increasing latency and $O(n^{2})$ attention costs with longer outputs, diffusion LMs parallelize generation via iterative denoising. Existing diffusion LMs, however, inherit the Transformer backbone's inefficiencies—primarily quadratic computation in attention and memory overhead from key-value (KV) caches.

DiffuApriel introduces a substantial architectural shift by employing a bidirectional Mamba-2 state-space model (SSM) as the denoiser. This backbone executes streaming state recurrences rather than full-sequence attention computations, yielding $O(BLd)$ inference complexity for batch size $B$ , sequence length $L$ , and hidden size $d$ . This linear scaling eliminates the Transformer’s quadratic attention bottleneck and obviates the need for full attention matrices or extensive KV caches. For long or infilled sequences, empirical results indicate up to 4.4× increase in throughput for models with 1.3B parameters, growing to 5.3× in extreme long-context scenarios (Singh et al., 19 Nov 2025).

2. Architecture and Diffusion Objectives

2.1 Bidirectional Mamba Backbone

Each Mamba mixer block processes the input embeddings $\mathbf{x}_1\dots\mathbf{x}_L$ in both forward and backward directions via SSM recurrences:

Forward recurrence:

$h_k^{\mathrm{forder}} = A_f \star h_{k-1}^{\mathrm{forder}} + B_f \star x_k$

for $k = 1\dots L$ .

Backward recurrence:

$h_k^{\mathrm{backward}} = A_b \star h_{k+1}^{\mathrm{backward}} + B_b \star x_k$

for $k = L\dots 1$ , where $\star$ denotes a causal or anti-causal 1D convolution with learned state kernels $A, B \in \mathbb{R}^{d \times k}$ . The two directions are fused additively:

$h_k = h_k^{\mathrm{forder}} + h_k^{\mathrm{backward}}$

Layer normalization and residual connections ensue, followed optionally by a two-layer MLP with reduced expansion ratio to maintain comparable parameter budgets.

2.2 Timestep Conditioning (AdaLN)

The denoiser is conditioned on the instantaneous noise level $t$ . Scalar $t$ is mapped via an MLP to embedding $\tau_t$ and incorporated into both the SSM mixer and MLP sublayers using adaptive layer normalization (AdaLN):

$\mathrm{AdaLN}(x; \tau_t) = \gamma_t \odot \frac{x-\mu(x)}{\sigma(x)} + \beta_t$

where $(\gamma_t, \beta_t) = W_{\mathrm{cond}} \tau_t$ .

2.3 Masked Diffusion Objective

Utilizing the absorbing-state masked diffusion process from DiffusionBERT, the forward noise process replaces tokens by $[\mathrm{MASK}]$ with probability $t$ :

$q_{t|0}(x_i^t \mid x_i^0) = \mathrm{Cat}\left(x_i^t; (1-t)\delta_{x_i^0} + t\delta_\mathrm{MASK}\right)$

Training minimizes the proxy objective:

$\mathcal{L} = \int_0^1 \frac{1}{t} \mathbb{E}_{q_{t|0}(x_t \mid x_0)}\left[ \sum_{i: x_i^t = M} -\log p_\theta(x_i^0 \mid x_t, t) \right] dt$

Inference proceeds via $S=128$ denoising steps using an MCMC-style reverse process.

2.4 Hybrid DiffuApriel-H

To capture global cross-token dependencies without forfeiting linear scaling, DiffuApriel-H interleaves a Transformer attention block every $K=5$ Mamba mixers. The resulting complexity is $O(BLd + (BL^2d)/K)$ , so the linear term dominates for moderate $L$ , with global attention capacity retained through sparse quadratic steps.

3. Complexity and Inference Throughput

Transformer diffusion LMs require $O(BL^2d)$ computation per denoising step due to attention, resulting in rapidly degraded throughput under long contexts. In contrast, DiffuApriel’s pure-Mamba backbone yields $O(BLd)$ per step, allowing performance to scale nearly linearly with sequence length. DiffuApriel-H introduces a secondary cost from periodic full attention but maintains dominant linear scaling with up to 2.6× throughput gains.

Empirical benchmarks, measured on NVIDIA H100 hardware (bf16, PyTorch, CUDA Graphs; batch size 1), confirm this scalability:

Transformer DLM: $\approx$ 2k tokens/sec at $L=1024$ , falling below 200 tok/s at $L=65$ k.
DiffuApriel: Peaks at 10k tok/s at $L=1024$ , sustaining $\approx$ 8k tok/s at $L=65$ k (4.4–5.3× faster).
DiffuApriel-H: Delivers 2.6–2.8× speedup across tested lengths.

4. Comparative Empirical Performance

4.1 Validation Perplexity

Validation perplexity (PPL) for 1.3B-parameter models under Chinchilla and Quokka data budgets is summarized:

Model	Chinchilla	Quokka
Transformer-DLM	25.01	22.72
DiffuApriel	23.36	21.29
DiffuApriel+MLP	22.89	20.17

The hybrid DiffuApriel+MLP variant achieves lower perplexity than both the pure-Mamba and Transformer DLMs.

4.2 Zero-Shot Generalization

On PTB, WikiText, LM1B, Lambada, AG News, PubMed, and ArXiv datasets, DiffuApriel+MLP yields the lowest perplexity, with the standard DiffuApriel consistently outperforming Transformer DLMs.

4.3 Commonsense and Reasoning Tasks

Performance on OpenBookQA, HellaSwag, PIQA, LogicQA, and ARC:

Model	Avg. Acc.
Transformer-DLM	33.8 %
DiffuApriel	37.9 %
DiffuApriel+MLP	38.2 %

DiffuApriel+MLP displays an absolute gain of ∼4% accuracy versus Transformer baselines on these tasks.

4.4 Ablation and Sensitivity Studies

MLP adapters consistently reduce perplexity by 0.5–1 point.
Hybrid architectures deliver the largest accuracy gains at 0.5B and 1.3B parameters.
All benchmarks fix denoising steps to $S=128$ ; analogous quality and efficiency trade-offs hold in both smaller (240M) and larger scales.

5. Limitations and Practical Implications

Performance remains suboptimal versus attention-based baselines for very short contexts ( $<256$ tokens), suggesting that dynamic mixtures of local convolution and SSM modules may further improve efficiency for shorter sequences. DiffuApriel’s linear inference cost and modest memory footprint make it particularly suitable for resource-constrained deployments, on-device inference, and long-context denoising or infilling.

Future research directions include integration of block-diffusion or grouped sampling methods to reduce denoising steps, scaling up SSM-based models to $>10$ B parameters for competitive open-ended generation, and combining DiffuApriel’s Mamba backbone with advanced approximate KV cache schemes for hybrid models.

6. Significance and Outlook

DiffuApriel demonstrates that bidirectional SSMs can fully supplant Transformers as denoisers in masked diffusion LMs, obviating the “quadratic attention tax.” The approach exemplifies how state-space architectures facilitate high-throughput, memory-efficient, and scalable iterative text generation. Hybridization with sparse attention blocks further enhances global context modeling without returning to full Transformer inefficiencies. These results point toward a new foundation for diffusion-based language generation, especially in workflows demanding long contexts, efficient infilling, and cost-sensitive deployment (Singh et al., 19 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone (2025)

DiffuApriel: Masked Diffusion LM

1. Motivation and Theoretical Foundations

2. Architecture and Diffusion Objectives

2.1 Bidirectional Mamba Backbone

2.2 Timestep Conditioning (AdaLN)

2.3 Masked Diffusion Objective

2.4 Hybrid DiffuApriel-H

3. Complexity and Inference Throughput

4. Comparative Empirical Performance

4.1 Validation Perplexity

4.2 Zero-Shot Generalization

4.3 Commonsense and Reasoning Tasks

4.4 Ablation and Sensitivity Studies

5. Limitations and Practical Implications

6. Significance and Outlook

Whiteboard

Follow Topic

Continue Learning

DiffuApriel: Masked Diffusion LM

1. Motivation and Theoretical Foundations

2. Architecture and Diffusion Objectives

2.1 Bidirectional Mamba Backbone

2.2 Timestep Conditioning (AdaLN)

2.3 Masked Diffusion Objective

2.4 Hybrid DiffuApriel-H

3. Complexity and Inference Throughput

4. Comparative Empirical Performance

4.1 Validation Perplexity

4.2 Zero-Shot Generalization

4.3 Commonsense and Reasoning Tasks

4.4 Ablation and Sensitivity Studies

5. Limitations and Practical Implications

6. Significance and Outlook

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics