Diffusion-based LLMs

Updated 22 July 2025

Diffusion-based LLMs are a class of neural sequence models that generate text by iteratively refining entire spans using bidirectional attention and a discrete diffusion process.
They leverage techniques adapted from masked language models and specialized training protocols to enable high-throughput, editable, and controllable text generation.
Recent advances focus on accelerating inference and ensuring safe, structured output, positioning dLLMs as a robust alternative to autoregressive models in various applications.

Diffusion-based LLMs (dLLMs) are a class of neural sequence models that generate text through iterative, parallel denoising, leveraging bidirectional context and block-wise or full attention rather than the left-to-right causality of traditional autoregressive (AR) models. dLLMs formalize generation as a discrete diffusion process, allowing entire spans of text to be refined simultaneously, and enabling advanced attributes such as editability, fine-grained controllability, and high-throughput parallel decoding. Significant advances in large-scale model training, inference techniques, architectural foundation, downstream applications, and security analysis have positioned dLLMs as a competitive and flexible alternative to the AR paradigm in language modeling and multimodal reasoning.

1. Mathematical Foundations and Parallel Generation

The core of dLLMs lies in the discrete diffusion probabilistic model (D3PM) and its extensions, which model text as a sequence of categorical tokens undergoing a Markovian corruption and denoising process. The forward process iteratively corrupts a data sample (token sequence) $x_0$ over $T$ steps using a transition matrix $Q_t$ , resulting in a chain

$q(x_{1:T} \mid x_0) = \prod_{t=1}^T q(x_t \mid x_{t-1})$

where typically $q(x_t \mid x_{t-1}) = \mathrm{Cat}(x_t; p = x_{t-1}Q_t)$ . The reverse (generation/denoising) process is parameterized as

$p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t)$

The model is trained via an evidence lower bound on the data likelihood, often simplified to a weighted cross-entropy loss using absorbing states (mask tokens) for tractability (Yu et al., 16 Jun 2025).

Unlike AR models, where

$p(x_1, ..., x_L) = \prod_{l=1}^L p(x_l | x_1, ..., x_{l-1})$

and only past context is used (via strict causal attention), dLLMs operate with full bidirectional attention, iteratively denoising masked positions in parallel. This enables them to generate large segments of output simultaneously and to revise outputs at any position during generation.

2. Training Paradigms and Model Scaling

To train competitive dLLMs at scale, researchers have adapted and extended modeling and optimization protocols originally developed for BERT-style MLMs and AR LMs. Common strategies include:

Pretraining as MLMs: Initializing from masked LLMs (e.g., BERT, XLM-R) and leveraging the equivalence between MLM objectives and discrete diffusion absorbing-formulation objectives (Ye et al., 2023, Yu et al., 16 Jun 2025).
Diffusive adaptation: After MLM pretraining, reprogramming the model using a diffusion objective with carefully tuned masking schedules and reverse Markov kernels (Ye et al., 2023).
Hybrid two-phase training: Early layers or phases are trained autoregressively for stability and global alignment, with later phases or subsequent rounds switching to full diffusion-based denoising. This combination addresses issues of training instability, weak supervision, and length bias observed in pure diffusion training (Yu et al., 22 May 2025).
Instruction and task finetuning: To endow models with generalist capabilities and in-context learning, instruction tuning is performed on broad multi-task datasets, sometimes following diffusive adaptation, yielding improved zero-shot and few-shot generalization (Ye et al., 2023).
Architectural enhancements: Techniques such as self-conditioning (feeding model’s previous predictions as additional inputs), block-wise attention masking (Han et al., 2023), and staged denoising with specialized “sharded” denoisers for different diffusion phases improve efficiency and quality.

Scaling diffusion LMs from several hundred million to tens of billions of parameters, combined with training data sizes in the range of hundreds of billions of tokens, has brought dLLMs onto a competitive footing with major AR models across a suite of language modeling benchmarks (Han et al., 2023, Ye et al., 2023, Yu et al., 16 Jun 2025).

3. Inference Strategies and Acceleration

The principal practical challenge for dLLMs is inference latency, given the iterative nature of denoising. Recent advancements have focused on several orthogonal solutions:

Parallel Decoding and Confidence-based Token Selection: At each denoising step, multiple tokens are unmasked in parallel based on prediction confidence, selectively updating only “high-confidence” tokens and thereby reducing redundant updates (Yu et al., 22 May 2025, Wu et al., 28 May 2025).
Adaptive Sampling Algorithms: Algorithms such as Adaptive Parallel Decoding (APD) dynamically adjust the number of tokens sampled per step via a multiplicative mixture of diffusion marginal probabilities and an auxiliary AR joint model, balancing throughput and quality (Israel et al., 31 May 2025).
Block-wise and Structure-aware Decoding: Partitioning the output sequence into semantically meaningful blocks, dynamically adjusting block sizes (using reinforcement learning) to increase semantic efficiency and generation quality (Huang et al., 20 May 2025).
Hybrid Caching Mechanisms: Block-wise approximate KV caching enables reuse of key/value activations by leveraging the temporal stability of token representations during denoising, leading to up to 27x throughput improvements without substantial loss of accuracy (Wu et al., 28 May 2025, Liu et al., 17 May 2025).
SlowFast and Dynamic Sampling: Alternating between cautious “exploratory” and aggressive “accelerated” decoding spans, guided by local certainty and positional clustering properties, further increases speedup (up to 34x in some benchmarks when combined with caching) (Wei et al., 12 Jun 2025).
Training-free Long-context Extension: NTK-scaled modification of rotary position embeddings (RoPE) allows context window extension in diffusion models with stable perplexity and consistent retrieval accuracy, countering the sharp failure modes seen in AR models (Liu et al., 17 Jun 2025).

4. Controllability, Structure, and Multimodal Extensions

The inherent bidirectional attention and parallel decoding of dLLMs are well suited for controllable and structured text generation:

Global Context and Planning: dLLMs plan globally, enabling revision of prior and future output for better structural adherence and reducing error propagation typical in left-to-right AR generation (Xiong et al., 6 Jul 2025, Gong et al., 25 Jun 2025).
Self-Adaptive Schema Scaffolding (S³): By injecting fixed schema structures (e.g., JSON formats) into the output, dLLMs recast generation as a fill-in-the-blank problem, allowing mask tokens to indicate variable fields, which substantially improves structure validity, content fidelity, and reduces hallucination rates in structured generation tasks (Xiong et al., 6 Jul 2025).
Classifier-Guided and Reinforcement-based Control: Conditioning generation on classifier feedback or reinforcement rewards enables post-hoc controllability (e.g., sentiment, factuality) without retraining the dLLM, employing classifier guidance at each denoising step or classifier-free approaches tailored for diffusion (Huang et al., 20 May 2025).
Multimodal Integration: Recent dMLLMs, such as DEEM and Dimple, combine vision encoders, diffusion-based global denoisers, and LLM decoders. These systems use the generative feedback of diffusion models to regularize and enhance the structure and detail fidelity of image features, improving cross-domain robustness and perception (Luo et al., 24 May 2024, Yu et al., 22 May 2025, Yu et al., 16 Jun 2025).

5. Reasoning, RL Optimization, and Performance

dLLMs have demonstrated the ability to approach or surpass AR models in a range of language, reasoning, and multimodal tasks; however, specialized techniques are needed to unlock their reasoning capabilities:

Coarse-to-fine Reasoning and Chain-of-Thought: Unlike AR models that lock into a fixed reasoning path, dLLMs can backtrace, revise, and discover non-sequential reasoning orders, revealing “draft-and-revise” patterns correlated with causal dependencies among reasoning steps (Ye et al., 2023).
Masked SFT and Policy Gradient RL: Supervised finetuning (SFT) with explicit reasoning examples, followed by critic-free policy gradient RL adapted to masked decoding via group-relative policy optimization (GRPO) or coupling-based sampling, enables dLLMs to learn and reinforce more effective reasoning policies. Specialized RL objectives, such as diffu-GRPO and coupled-GRPO, address sampling variance and lack of direct log-likelihoods in the diffusion denoising process (Zhao et al., 16 Apr 2025, Gong et al., 25 Jun 2025).
Weighted Likelihood Optimization (wd1): To overcome the challenge of intractable sequence-level likelihoods and bias in importance-sampled policy ratios, weighted likelihood objectives (wd1) reinforce completions with high group-relative advantage without supervised data, offering higher accuracy and substantially reduced training cost over older RL methods (Tang et al., 7 Jul 2025).
Empirical Benchmarks: dLLM variants such as LLaDA, Dream, DiffuCoder, and Mercury have proven competitive on benchmarks for reasoning (GSM8K, MATH), code generation (HumanEval, MBPP), text comprehension, and even long-context retrieval tasks (LongBench, Needle-in-a-Haystack) (Zhao et al., 16 Apr 2025, Gong et al., 25 Jun 2025, Liu et al., 17 Jun 2025, Labs et al., 17 Jun 2025).

6. Security, Alignment, and Open Challenges

The distinctive bidirectional and parallel nature of dLLMs creates new alignment and safety risks not present in sequential AR models:

Emergent Safety Vulnerabilities: The bidirectional infilling mechanism means dLLMs can be coerced into contextually completing harmful prompts—even when the unmasked prompt contains explicit unsafe content—because the model must preserve global context consistency. Parallel decoding also precludes dynamic filtering and stepwise rejection sampling (Wen et al., 15 Jul 2025).
DIJA Attack Framework: Adversarial “interleaved mask-text” prompts have been shown to successfully jailbreak safety-aligned dLLMs by freezing harmful context in unmasked segments while forcing the model to recover malicious completions in masked regions. Benchmarks show up to 100% attack success rates in some cases, far exceeding prior methods for AR models (Wen et al., 15 Jul 2025).
Alignment Implications: Existing AR-centric defenses (content filters, rejection sampling, self-reminders) are inadequate for bidirectional models. There is an urgent research need for new alignment and decoding strategies that incorporate global safety checks, context-aware risk assessment, and robust adversarial prompt handling specifically tailored for dLLMs.

7. Emerging Directions and Future Prospects

dLLMs and their multimodal extensions are experiencing rapid development; key future research paths identified in the literature include:

Standardization and Scaling: The lack of a unified, robust infrastructure for large-scale dLLM training and evaluation is seen as a bottleneck. Leveraging established AR pipelines can accelerate the dissemination of standardized benchmarks and pretrained diffusion-based models (Yu et al., 16 Jun 2025).
Inference Acceleration and Latent Diffusion: Ongoing work seeks to further improve decoding throughput through block-efficient attention, latent-space diffusion, and dynamic step scheduling.
Architectural Advancement: Architectures built de novo for discrete diffusion, rather than borrowed from AR LLMs, may enhance performance, controllability, and scalability (Yu et al., 16 Jun 2025).
Broadening Application Scope: dLLMs are extending beyond language to unified reasoning across vision, code, and biology, handling tokenized multimodal representations and complex interleaved tasks (Yu et al., 22 May 2025, Luo et al., 24 May 2024).
Security and Privacy: Research into privacy-preserving training (e.g., differential privacy applied to diffusion models), robust alignment, and adversarial robustness is prioritized given the unique threat surfaces dLLMs introduce (Wen et al., 15 Jul 2025).
Unified Models and Long-Context Capabilities: Techniques such as NTK-based RoPE scaling, dynamic structure priors, and modular prompt expansion empower dLLMs with longer effective context windows, structured output control, and compositional generation across modalities (Liu et al., 17 Jun 2025, Yu et al., 22 May 2025).

In sum, diffusion-based LLMs represent an increasingly mature and versatile family of generative models capable of parallel, editable, and controllable sequence generation. With advancements in scaling, RL-based reasoning, efficient inference, and systematized alignment, dLLMs are positioned to play a major role in the future of language AI—provided that pressing challenges in safety, infrastructure, and multi-domain integration are addressed.