DiffuApriel: Masked Diffusion LM
- DiffuApriel is a masked diffusion language model that replaces Transformer denoisers with a bidirectional Mamba-2 state-space backbone.
- It achieves linear-time inference with up to 4.4× throughput gains, reducing memory overhead in long-context text generation.
- The hybrid DiffuApriel-H variant interleaves sparse attention layers, balancing global and local context modeling while preserving efficiency.
DiffuApriel is a masked diffusion LLM (DLM) architecture that replaces the conventional Transformer-based denoiser with a bidirectional state-space backbone, specifically the Mamba-2 architecture. Diffusion LMs, as opposed to standard autoregressive (AR) LMs, enable parallel denoising of entire corrupted sequences, eliminating the left-to-right decoding bottleneck and attendant quadratic attention costs. DiffuApriel achieves linear-time inference and significantly higher throughput in long-context regimes, offering a scalable, memory-efficient solution for iterative text generation. Additionally, a hybrid variant, DiffuApriel-H, interleaves sparse attention layers to balance global and local context modeling without compromising the core efficiency advantages.
1. Motivation and Theoretical Foundations
Diffusion-based text generation has gained traction as a non-autoregressive alternative to AR LMs. While AR LMs decode sequentially and incur increasing latency and attention costs with longer outputs, diffusion LMs parallelize generation via iterative denoising. Existing diffusion LMs, however, inherit the Transformer backbone's inefficiencies—primarily quadratic computation in attention and memory overhead from key-value (KV) caches.
DiffuApriel introduces a substantial architectural shift by employing a bidirectional Mamba-2 state-space model (SSM) as the denoiser. This backbone executes streaming state recurrences rather than full-sequence attention computations, yielding inference complexity for batch size , sequence length , and hidden size . This linear scaling eliminates the Transformer’s quadratic attention bottleneck and obviates the need for full attention matrices or extensive KV caches. For long or infilled sequences, empirical results indicate up to 4.4× increase in throughput for models with 1.3B parameters, growing to 5.3× in extreme long-context scenarios (Singh et al., 19 Nov 2025).
2. Architecture and Diffusion Objectives
2.1 Bidirectional Mamba Backbone
Each Mamba mixer block processes the input embeddings in both forward and backward directions via SSM recurrences:
- Forward recurrence:
for .
- Backward recurrence:
for , where denotes a causal or anti-causal 1D convolution with learned state kernels . The two directions are fused additively:
Layer normalization and residual connections ensue, followed optionally by a two-layer MLP with reduced expansion ratio to maintain comparable parameter budgets.
2.2 Timestep Conditioning (AdaLN)
The denoiser is conditioned on the instantaneous noise level . Scalar is mapped via an MLP to embedding and incorporated into both the SSM mixer and MLP sublayers using adaptive layer normalization (AdaLN):
where .
2.3 Masked Diffusion Objective
Utilizing the absorbing-state masked diffusion process from DiffusionBERT, the forward noise process replaces tokens by with probability :
Training minimizes the proxy objective:
Inference proceeds via denoising steps using an MCMC-style reverse process.
2.4 Hybrid DiffuApriel-H
To capture global cross-token dependencies without forfeiting linear scaling, DiffuApriel-H interleaves a Transformer attention block every Mamba mixers. The resulting complexity is , so the linear term dominates for moderate , with global attention capacity retained through sparse quadratic steps.
3. Complexity and Inference Throughput
Transformer diffusion LMs require computation per denoising step due to attention, resulting in rapidly degraded throughput under long contexts. In contrast, DiffuApriel’s pure-Mamba backbone yields per step, allowing performance to scale nearly linearly with sequence length. DiffuApriel-H introduces a secondary cost from periodic full attention but maintains dominant linear scaling with up to 2.6× throughput gains.
Empirical benchmarks, measured on NVIDIA H100 hardware (bf16, PyTorch, CUDA Graphs; batch size 1), confirm this scalability:
- Transformer DLM: 2k tokens/sec at , falling below 200 tok/s at k.
- DiffuApriel: Peaks at 10k tok/s at , sustaining 8k tok/s at k (4.4–5.3× faster).
- DiffuApriel-H: Delivers 2.6–2.8× speedup across tested lengths.
4. Comparative Empirical Performance
4.1 Validation Perplexity
Validation perplexity (PPL) for 1.3B-parameter models under Chinchilla and Quokka data budgets is summarized:
| Model | Chinchilla | Quokka |
|---|---|---|
| Transformer-DLM | 25.01 | 22.72 |
| DiffuApriel | 23.36 | 21.29 |
| DiffuApriel+MLP | 22.89 | 20.17 |
The hybrid DiffuApriel+MLP variant achieves lower perplexity than both the pure-Mamba and Transformer DLMs.
4.2 Zero-Shot Generalization
On PTB, WikiText, LM1B, Lambada, AG News, PubMed, and ArXiv datasets, DiffuApriel+MLP yields the lowest perplexity, with the standard DiffuApriel consistently outperforming Transformer DLMs.
4.3 Commonsense and Reasoning Tasks
Performance on OpenBookQA, HellaSwag, PIQA, LogicQA, and ARC:
| Model | Avg. Acc. |
|---|---|
| Transformer-DLM | 33.8 % |
| DiffuApriel | 37.9 % |
| DiffuApriel+MLP | 38.2 % |
DiffuApriel+MLP displays an absolute gain of ∼4% accuracy versus Transformer baselines on these tasks.
4.4 Ablation and Sensitivity Studies
- MLP adapters consistently reduce perplexity by 0.5–1 point.
- Hybrid architectures deliver the largest accuracy gains at 0.5B and 1.3B parameters.
- All benchmarks fix denoising steps to ; analogous quality and efficiency trade-offs hold in both smaller (240M) and larger scales.
5. Limitations and Practical Implications
Performance remains suboptimal versus attention-based baselines for very short contexts ( tokens), suggesting that dynamic mixtures of local convolution and SSM modules may further improve efficiency for shorter sequences. DiffuApriel’s linear inference cost and modest memory footprint make it particularly suitable for resource-constrained deployments, on-device inference, and long-context denoising or infilling.
Future research directions include integration of block-diffusion or grouped sampling methods to reduce denoising steps, scaling up SSM-based models to B parameters for competitive open-ended generation, and combining DiffuApriel’s Mamba backbone with advanced approximate KV cache schemes for hybrid models.
6. Significance and Outlook
DiffuApriel demonstrates that bidirectional SSMs can fully supplant Transformers as denoisers in masked diffusion LMs, obviating the “quadratic attention tax.” The approach exemplifies how state-space architectures facilitate high-throughput, memory-efficient, and scalable iterative text generation. Hybridization with sparse attention blocks further enhances global context modeling without returning to full Transformer inefficiencies. These results point toward a new foundation for diffusion-based language generation, especially in workflows demanding long contexts, efficient infilling, and cost-sensitive deployment (Singh et al., 19 Nov 2025).