A Survey on Diffusion Language Models (2508.10875v1)

Published 14 Aug 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Diffusion LLMs (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked LLMs, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper presents an in-depth taxonomy of diffusion language models, categorizing them into continuous, discrete, and hybrid types.
It demonstrates efficient training and inference methods, including parallel decoding and iterative denoising, which reduce latency and enhance performance.
The survey highlights practical applications in NLP, code generation, and biological modeling while addressing scalability and infrastructure challenges.

Comprehensive Survey of Diffusion LLMs

Introduction and Historical Context

Diffusion LLMs (DLMs) have emerged as a non-autoregressive alternative to the dominant autoregressive (AR) paradigm in LLMing. DLMs generate tokens in parallel via iterative denoising, offering inherent advantages in inference latency and bidirectional context modeling. The survey systematically traces the evolution of DLMs, from early continuous-space models inspired by image diffusion to discrete-space and multimodal extensions. The historical trajectory reveals a shift from continuous DLMs to discrete DLMs, with recent years marked by rapid growth in both research activity and practical deployments.

Figure 1: Timeline of Diffusion LLMs, illustrating the transition from continuous to discrete and multimodal DLMs.

Figure 2: Trend of diffusion LLM papers, showing a marked increase in research interest, especially in discrete DLMs.

Modeling Paradigms: Continuous, Discrete, and Hybrid

DLMs are categorized by the space in which the diffusion process operates:

Continuous DLMs: Tokens are mapped to embeddings, and denoising occurs in continuous space. Early models (Diffusion-LM, SED) leveraged continuous diffusion for controllable text generation and infilling. Recent advances (TESS, TESS-2) diffuse over logit simplex representations and scale up via adaptation from AR models.
Discrete DLMs: The diffusion process is defined directly on token space. D3PM introduced structured token corruption, while masked DLMs (LLaDA, Dream) employ iterative mask-predict denoising with cross-entropy loss over masked positions. These models achieve competitive performance with AR baselines, especially when scaled to billions of parameters.
Hybrid AR-Diffusion Models: Block-wise semi-autoregressive models (BD3-LM) combine AR and diffusion, generating blocks autoregressively and tokens within blocks in parallel. This design enables variable-length generation and efficient caching.
Figure 3: Overview of training and inference procedures across AR, continuous, discrete, and block-wise diffusion paradigms.

Training and Post-Training Strategies

DLMs are typically pretrained using strategies analogous to AR models or image diffusion models. Initialization from pretrained AR models (DiffuLLaMA, Dream) accelerates training and enables competitive performance. Supervised fine-tuning (SFT) and reinforcement learning (RL) are adapted for DLMs, with unique challenges due to the non-factorized likelihood and parallel generation.

Post-training for reasoning capabilities is a focal area:

Chain-of-Thought Parallelization: DoT adapts CoT reasoning to parallel denoising, enabling self-correction and outperforming larger AR models on reasoning benchmarks.
Policy Gradient Methods: SEPO, diffu-GRPO, coupled-GRPO, and UniGRPO introduce efficient log-probability estimation and structured masking for RL fine-tuning, overcoming the intractability of sequence likelihood in DLMs.
Preference Optimization: VRPO adapts DPO to DLMs via variance reduction techniques, yielding strong improvements in alignment and reasoning.

Inference Techniques and Efficiency

Inference in DLMs is characterized by parallel decoding, adaptive unmasking/remasking, guidance, and caching:

Parallel Decoding: Confidence-aware and adaptive strategies (Fast-dLLM, APD, SlowFast Sampling, SpecDiff) achieve up to 34× speed-ups with minimal quality loss.
Unmasking/Remasking: Selective refinement of low-confidence tokens improves coherence and convergence.
Guidance: Classifier-free guidance and structural constraints steer generation toward desired attributes.
Caching and Distillation: KV and feature caches, along with step distillation (DLM-One), reduce per-step and total inference cost, bringing DLM latency close to AR models.
Figure 4: Inference techniques for DLMs, including parallel decoding, unmasking/remasking, guidance, KV/feature cache, and step distillation.

Multimodal and Unified DLMs

DLMs have been extended to multimodal and unified architectures:

Vision-LLMs: LLaDA-V, LaViDa, and Dimple integrate vision encoders and project image features into token space, supporting visual instruction tuning and multimodal reasoning.
Unified Multimodal Generation: MMaDA, D-DiT, UniDisc, Fudoki, and Muddit tokenize images and text into a shared vocabulary, enabling joint modeling and cross-modal inpainting. These models demonstrate competitive or superior performance to AR-based VLMs and image generators.

Performance Analysis

Empirical results across benchmarks (GenEval, MME, CQA, Hellaswag, PIQA, HumanEval, GSM8K, MMMU) indicate that DLMs match or exceed AR models of similar scale in math, science, code, and multimodal tasks. Notably, LLaDA and Dream outperform AR baselines on GSM8K and HumanEval, while MMaDA and LLaDA-V surpass AR-based VLMs in multimodal reasoning.

Figure 5: Performance comparison on eight benchmarks, showing DLMs (orange) competitive with AR models (blue) across tasks and scales.

Applications in NLP, Code, and Biology

DLMs have been applied to a wide range of downstream tasks:

NLP: Text classification, NER, sentiment analysis, summarization, style transfer, constrained generation, and machine translation.
Code Generation: DiffuCoder and Mercury Coder demonstrate strong performance and throughput, leveraging parallel planning and coupled sampling.
Biological Sequence Modeling: TransDLM, TGM-DLM, DRAKES, ForceGen, MeMDLM, DPLM, DPLM2, and CFP-GEN apply DLMs to molecular optimization, protein design, and multimodal sequence-structure co-generation.

Challenges and Future Directions

Key challenges for DLMs include:

Parallelism–Performance Trade-off: Increased parallelism can degrade output coherence due to inter-token dependency neglect, as illustrated by the parallel decoding curse.
Infrastructure: Lack of mature, open-source libraries and serving frameworks hinders practical deployment.
Long Sequence and Dynamic-Length Generation: Fixed-length training and cubic inference complexity limit scalability for long-context tasks.
Scalability: DLMs have not yet been scaled to the parameter counts of leading AR models.
Figure 6: Generation results of LLaDA and MMaDA under different denoising step settings, highlighting the trade-off between parallelism and output quality.

Future research directions include improving training efficiency, exploring quantization and pruning, advancing multimodal unified reasoning, and developing DLM-based agents.

Conclusion

This survey provides a comprehensive taxonomy and analysis of diffusion LLMs, covering foundational principles, training and inference strategies, multimodal extensions, performance, and applications. DLMs offer a compelling alternative to AR models, with strong empirical results and unique capabilities in parallel generation and bidirectional context modeling. Addressing current challenges in scalability, infrastructure, and long-sequence handling will be critical for realizing the full potential of DLMs in real-world AI systems.

PDF Markdown

Follow-up Questions

Related Papers

Authors (4)

GitHub

GitHub - VILA-Lab/Awesome-DLMs: The official GitHub repo for the survey paper "A Survey on Diffusion Language Models". (5 stars)

Tweets

https://twitter.com/sedielem/status/1957906664410984848

https://twitter.com/techwith_ram/status/1957179405014610432

https://twitter.com/levibuilds/status/1956984851422130198

https://twitter.com/arxivsanitybot/status/1957074694299127934

https://twitter.com/PapersInML/status/1956416397090718128

https://twitter.com/RevanthAtmakuri/status/1957287629419884640