Diffusion Language Models: Iterative Denoising

Updated 1 February 2026

Diffusion Language Models are non-autoregressive generative models that recast sequence generation as an iterative denoising process, enabling full bidirectional context.
They employ Transformer-based denoisers with parallel token prediction and adaptive decoding strategies to balance speed, quality, and controllability.
Recent advances show DLMs matching or surpassing autoregressive models on language, multimodal, and reasoning benchmarks while improving scalability and inference efficiency.

A Diffusion LLM (DLM) is a non-autoregressive generative model for text or multimodal sequences that recasts sequence modeling as an iterative denoising (diffusion) problem. Rather than generating tokens sequentially in a left-to-right manner, DLMs iteratively refine a noisy or partially observed sequence using a learned reverse process, enabling full bidirectional context, extensive parallelism in generation, and flexible controllability. Recent research has established DLMs as competitive with autoregressive (AR) LLMs across a range of language, reasoning, and multimodal tasks, with advantages in speed, context modeling, and controllability, and ongoing improvements in scalability and practical deployment.

1. Foundational Principles and Mathematical Formulation

A DLM operates via two coupled stochastic processes:

Forward (noising) process: A clean token sequence $x_0$ is progressively corrupted by randomly replacing or masking tokens; in embedding-based models, Gaussian noise is added in each step, while in discrete DLMs, tokens are replaced or masked according to a fixed schedule.

$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \quad \text{(for embeddings)}$

$q(x_t \mid x_{t-1}) = \text{Cat}(x_t; \alpha_t x_{t-1} + (1-\alpha_t) \pi)$

where $\pi$ is typically a mask token or uniform distribution over the vocabulary for discrete models (2305.14671, Li et al., 14 Aug 2025, He et al., 2022, Zhu et al., 27 Oct 2025).
Reverse (denoising) process: A neural network (typically Transformer-based) parameterizes the reverse chain, predicting the clean sequence from the corrupted input at each step. Training minimizes a variational lower bound (ELBO) on the data likelihood, often reduced to a denoising loss (e.g., cross-entropy on unmasked positions for discrete models, MSE on noise prediction for continuous models):

$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \| \epsilon - \epsilon_\theta(x_t, t) \|^2$

$\mathcal{L}_\text{discrete} = -\mathbb{E}_{t, x_0, x_t} \left[ \frac{1}{t} \sum_{i=1}^L 1[x_t^i = M] \log p_\theta(x_0^i | x_t) \right]$

This probabilistic framework enables exact marginalization over diffusion steps, efficient sampling, and tight control over the trade-off between quality and inference speed (2305.14671, Tae et al., 19 Feb 2025, Li et al., 14 Aug 2025).

DLMs admit both continuous (embedding-space Gaussian) and discrete (categorical/absorbing state) variants. Hybrid discrete–continuous chains and hierarchical semantic coarse-graining have also been proposed to better capture linguistic structure (Jo et al., 17 Feb 2025, Zhou et al., 8 Oct 2025).

2. Architectural Design and Inference Mechanisms

The state-of-the-art in DLMs is dominated by Transformer-based denoisers, with key innovations for efficiency and expressiveness:

Bidirectional Attention: Unlike standard AR transformers, DLMs use full/model-specific bidirectional attention in each step, permitting prediction for every position with access to both past and future context, beneficial for infilling and context-dependent decoding (Zhou et al., 24 Jul 2025, Li et al., 14 Aug 2025).
Parallel Prediction: All tokens (except fixed/masked) are predicted in parallel at each step, and decoding can update tokens in arbitrary or blockwise orders. Advanced block or dynamic parallelization schemes (e.g., Seed Diffusion, CARD, FreeCache) support near-linear speedups over left-to-right AR decoding with marginal quality loss (Song et al., 4 Aug 2025, Ruan et al., 29 Jan 2026, Hu et al., 27 May 2025).
Adaptive Decoding: Strategies such as confidence-based unmasking, beam search, or AR-verifier–guided token selection enable flexible trade-offs between speed and sequence-level consistency. Recent work has derived formal lower bounds (the "bits-to-rounds principle") for achievable decoding efficiency and introduced exploration-based methods (ETE) to approach these bounds (Fu et al., 26 Nov 2025, Hu et al., 27 May 2025).
Guidance and Control: DLMs can incorporate reward, classifier, or AR guidance at inference, using gradients to steer generation toward desired goals (e.g., reward optimization, constraint satisfaction), often without additional finetuning (Tae et al., 19 Feb 2025, Hu et al., 27 May 2025).
Self-conditioning and Intermediate Soft States: Passing intermediate model predictions (softmax logits/distributions) as additional context in later iterations has been shown to improve both sample quality and convergence, e.g., "self-conditioning" in TESS 2 (Tae et al., 19 Feb 2025).

3. Training Strategies and Scalability

Training DLMs typically involves:

Pretraining: Either from scratch on large corpora or by diffusive adaptation of pretrained denoising or autoregressive LLMs (e.g., LLaDA, TESS 2, DiffusionBERT, Flan-XLM-R), leveraging their representations and scaling properties (Ye et al., 2023, Tae et al., 19 Feb 2025, He et al., 2022).
Noise Scheduling: Linear, cosine, or information-aware (e.g., spindle, mutual information) noise schedules manage the noising trajectory and balance between stepwise refinement and tractable denoising (He et al., 2022, Jin et al., 27 Dec 2025).
Objective Simplification: Recently, losses focusing only on corrupted positions or including contrastive terms (e.g., SDDLM) have been shown to stabilize training, match ELBO-level performance, and enable scaling to very large models and few-step regimes (Zhu et al., 27 Oct 2025).
Hierarchical and Semantically Informed Diffusion: Incorporation of intermediate semantic clusters allows for multi-scale, progressive denoising and improves both single-step quality and overall generation perplexity (Zhou et al., 8 Oct 2025).
Block-wise and Curriculum-Based Training: Reward-guided and on-policy curricula can accelerate convergence and inference, especially in large-model or code-generation settings (Song et al., 4 Aug 2025, Tae et al., 19 Feb 2025).

4. Empirical Performance, Applications, and Benchmarking

Modern DLMs match or exceed open-source AR baselines on several major language, multimodal, and code-generation benchmarks, and approach or sometimes surpass commercial systems in specific tasks:

Benchmark	Diffusion LM Example	Score/Accuracy	AR Baseline	Relative Standing
MMSU (Lang. Understanding)	DIFFA	56.0 %	Qwen2-Audio 53.3%	Outperforms largest open-source AR
MMAU (Audio Understanding)	DIFFA	49.7 %	SALMONN 34.9%	Substantially better than AR
VoiceBench	DIFFA	48.2 %	AR Cascade 87%	Below best AR, but competitive on open-source
LM1B OpenWebText (Gen. PPL)	HDLM, SDDLM, ADLM	≤20–25	GPT-2 Base >22	Parity or better at similar scale
AlpacaEval (instruction foll.)	TESS 2	62.2 %	Mistral AR 63.3%	Comparable on general instructions
GSM8k (math reasoning, finetune)	TESS 2	66.6 %	AR baseline 51%	Exceeds AR baseline in some domains
Codegen (MBXP, HumanEval)	Seed Diffusion	72.6 %	CodeLlama 7B 71%	Faster with similar code quality

Notably, recent models like ADLM surpass AR systems in text similarity (MAUVE), ranking as more "human-like" for long-form text generation (Rout et al., 24 May 2025). DLMs are prevalent in:

Text (open-ended generation, infilling, paraphrase, summarization, style transfer)
Multimodal (spoken language/ASR: DIFFA; code: Seed Diffusion; cross-lingual: XDLM)
Reasoning and chain-of-thought tasks, where DLMs serve as parallel thought-proposers and can outperform AR LMs in test-time collaborative frameworks (Shao et al., 31 Oct 2025, Zhou et al., 24 Jul 2025).

5. Theoretical and Structural Trade-Offs

DLMs present unique fundamental trade-offs:

Discreteness versus Smoothness: Discrete DLMs ensure valid token sequences but corrupt information in non-infinitesimal steps, whereas continuous models allow for smooth corruption but require post-hoc discretization, possibly introducing artifacts (Jin et al., 27 Dec 2025, Jo et al., 17 Feb 2025).
Parallelism–Dependency Gap: Parallel generation incurs a risk of coherence errors, as token-wise marginal sampling cannot enforce sequence-level structure. Hybrid blockwise schemes, context-adaptive masking, and sequence-level objectives are active areas of research to mitigate this (Fu et al., 26 Nov 2025, Jin et al., 27 Dec 2025).
Inference Cost: While parallel updates reduce wall-clock time, each step may remain computationally large ( $O(N^2)$ for sequence length $N$ ) due to bidirectional attention unless explicit KV-caching (CARD, FreeCache) is used. Blockwise and guided diffusion dramatically improve efficiency, up to $34\times$ compared to naïve DLM sampling (Hu et al., 27 May 2025, Ruan et al., 29 Jan 2026).
Controllability versus Latency: DLMs uniquely admit run-time tunability—using more or fewer denoising steps as needed per application—supporting dynamic compute/quality trade-offs (Tae et al., 19 Feb 2025, Song et al., 4 Aug 2025, Hu et al., 27 May 2025).

6. Extensions to Multimodal, Hierarchical, and Specialized Domains

Recent models have expanded DLMs into:

Audio-Language: DIFFA couples a frozen diffusion LM with an audio adapter stack, delivering competitive performance on speech understanding/ASR with minimal supervised data compared to AR baselines (Zhou et al., 24 Jul 2025).
Hierarchical and Semantically-Informed DLMs: HDLM replaces direct word-to-mask transitions with progressive semantic clustering, achieving lower validation/generative perplexity and greater flexibility in semantic manipulation (Zhou et al., 8 Oct 2025).
Hybrid AR–Diffusion: CARD reintroduces causal masking and KV-caching to recover AR-level training and inference speeds while retaining the intrinsic parallelism of diffusion (Ruan et al., 29 Jan 2026). Hybrid Block/AR-DLMs and approaches like SpecDiff, BD3-LM, combine the strengths of both paradigms (Li et al., 14 Aug 2025).
One-Step Generation: DLM-One applies score distillation to enable single-shot sequence generation, achieving $500\times$ faster inference with competitive quality (Chen et al., 30 May 2025).
Task Specialization/Control: Reward-guided inference, discrete-continuous hybrid noise paths, and anchored token prediction (for rare/structural tokens) provide improved sample quality, reasoning, and generalization (Tae et al., 19 Feb 2025, Rout et al., 24 May 2025).

7. Open Challenges and Research Directions

Key open problems and ongoing research include:

Scaling: Progressing beyond 8B–9B public models, tackling the infrastructure and stability requirements for training DLMs at LLM scale (Li et al., 14 Aug 2025).
Structural Dependency and Discrete Control: Designing diffusion processes and objectives that explicitly model higher-order and global sequence dependencies, avoiding "marginal trap" errors (Jin et al., 27 Dec 2025, Fu et al., 26 Nov 2025).
Long-Sequence and Early Stopping: Dynamic sequence-length modeling, context extrapolation, and termination heuristics remain challenging for current DLMs due to fixed input sizing and global attention costs (Li et al., 14 Aug 2025).
Efficient Sampling and Distillation: Accelerating inference via progressive or one-step distillation, caching, and hybrid guided approaches is active, with approaches such as FreeCache and guided fusion achieving $3\times$ – $34\times$ speedups in real-world scenarios (Hu et al., 27 May 2025, Chen et al., 30 May 2025).
Unified Multimodal Reasoning and Agents: Integrating visual, audio, and language signals in a single diffusion framework and exploiting DLM bidirectionality for agent-like reasoning and flexible task delegation (Li et al., 14 Aug 2025).
Benchmarking and Metrics: Moving beyond perplexity to diversity, controllability, and human-like generation metrics such as MAUVE (Rout et al., 24 May 2025), and developing benchmarks sensitive to the strengths and weaknesses of DLMs over ARMs.

In sum, diffusion LLMs constitute a rapidly evolving generative modeling paradigm that unifies parallelism, bidirectionality, rigorous probabilistic principles, and controllability. Ongoing advances in noise scheduling, decoding algorithms, training efficiency, and hybrid architectures are narrowing—if not surpassing—the gap to autoregressive models in both efficiency and performance across high-stakes NLP and multimodal tasks (2305.14671, Li et al., 14 Aug 2025, Ye et al., 2023, Zhou et al., 24 Jul 2025, Ruan et al., 29 Jan 2026, Zhu et al., 27 Oct 2025).