Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 TPS
Gemini 2.5 Pro 50 TPS Pro
GPT-5 Medium 31 TPS
GPT-5 High 29 TPS Pro
GPT-4o 96 TPS
GPT OSS 120B 475 TPS Pro
Kimi K2 194 TPS Pro
2000 character limit reached

A Survey on Diffusion Language Models (2508.10875v1)

Published 14 Aug 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Diffusion LLMs (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked LLMs, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces diffusion language models as an alternative to autoregressive methods by leveraging iterative denoising for parallel token generation.
  • The paper details training and inference strategies, including complementary masking, caching, and step distillation, achieving competitive benchmarks in language, code, and multimodal tasks.
  • The paper outlines key open challenges such as scalability, infrastructure development, and long-sequence handling that must be addressed for broader adoption.

A Comprehensive Survey of Diffusion LLMs

Introduction and Motivation

Diffusion LLMs (DLMs) have emerged as a compelling alternative to the autoregressive (AR) paradigm for language generation, leveraging iterative denoising processes to enable parallel token generation and bidirectional context modeling. This survey systematically reviews the evolution, taxonomy, training and inference strategies, multimodal extensions, empirical performance, and open challenges of DLMs, providing a technical synthesis for researchers and practitioners. Figure 1

Figure 1: Timeline of Diffusion LLMs, highlighting the shift from continuous to discrete and multimodal DLMs.

Evolution and Taxonomy of DLMs

The development of DLMs can be categorized into three main groups: continuous DLMs, discrete DLMs, and multimodal DLMs. Early research focused on continuous-space models, where diffusion operates in the embedding or logit space. Discrete DLMs, which define the diffusion process directly over token vocabularies, have gained traction due to their scalability and compatibility with large-scale LLMing. Recent advances have extended DLMs to multimodal domains, enabling unified modeling of text, images, and other modalities. Figure 2

Figure 2: Research trend showing the increasing number of DLM papers, especially in discrete and multimodal settings.

Modeling Paradigms and Architectural Distinctions

DLMs are positioned within the broader landscape of LLMing paradigms, which include masked LLMs (MLMs), AR models, permutation LLMs, and sequence-to-sequence architectures. DLMs distinguish themselves by their iterative, non-sequential generation process, which allows for parallelism and bidirectional context utilization. Continuous DLMs operate in embedding or logit spaces, while discrete DLMs employ token-level corruption and denoising, often using masking strategies. Figure 3

Figure 4: Overview of training and inference procedures across AR, continuous DLM, discrete DLM, and block-wise hybrid models.

Training and Post-Training Strategies

Pre-training

DLMs are typically pre-trained using objectives analogous to those in AR or image diffusion models. Discrete DLMs often initialize from AR model weights (e.g., LLaMA, Qwen2.5), facilitating efficient adaptation and reducing training cost. Continuous DLMs may leverage pretrained image diffusion backbones for multimodal tasks.

Supervised Fine-Tuning and RL Alignment

Supervised fine-tuning (SFT) in DLMs mirrors AR approaches but must address the inefficiency of loss computation due to partial masking. Techniques such as complementary masking and improved scheduling have been proposed to enhance gradient flow and data utilization.

Post-training for reasoning capabilities is a critical area, with methods such as Diffusion-of-Thought (DoT), DCoLT, and various policy gradient adaptations (e.g., diffu-GRPO, UniGRPO, coupled-GRPO) enabling DLMs to perform complex reasoning and alignment tasks. Preference optimization methods (e.g., VRPO) have also been adapted to the diffusion setting, addressing the high variance of ELBO-based log-likelihood approximations.

Inference Techniques and Efficiency Optimizations

Inference in DLMs is characterized by a rich set of strategies aimed at balancing quality, controllability, and efficiency:

  • Parallel Decoding: Confidence-aware and adaptive parallel decoding methods enable substantial speed-ups (up to 34×) with minimal quality loss.
  • Unmasking/Remasking: Adaptive policies for token selection and remasking improve both convergence and output coherence.
  • Guidance: Classifier-free guidance and structural constraints steer generation toward desired attributes, with extensions for semantic and syntactic control.
  • Caching and Step Distillation: Innovations in KV and feature caching, as well as step distillation, have closed much of the inference latency gap with AR models, achieving up to 500× acceleration in some cases. Figure 5

    Figure 3: Inference techniques for DLMs, including parallel decoding, unmasking/remasking, guidance, caching, and step distillation.

Multimodal and Unified DLMs

Recent work has extended DLMs to multimodal and unified architectures, supporting both understanding and generation across text and vision. Approaches include:

  • Vision Encoders + DLMs: Models like LLaDA-V and LaViDa integrate vision encoders with DLM backbones, employing complementary masking and KV-caching for efficient training and inference.
  • Unified Token Spaces: MMaDA and UniDisc tokenize all modalities into a shared space, enabling joint modeling and cross-modal reasoning.
  • Hybrid Training: Dimple employs an autoregressive-then-diffusion training regime to stabilize multimodal learning and enable parallel decoding.

These models demonstrate competitive or superior performance to AR-based multimodal models, particularly in cross-modal reasoning and generation.

Empirical Performance and Benchmarking

DLMs have achieved performance on par with, and in some cases exceeding, AR models of similar scale across a range of benchmarks, including language understanding (PIQA, HellaSwag), code generation (HumanEval), mathematical reasoning (GSM8K), and multimodal tasks (MME, MMMU). Notably, DLMs exhibit stronger performance in math and science-related benchmarks and demonstrate superior throughput in code generation and multimodal settings. Figure 6

Figure 5: Performance comparison on eight benchmarks, showing DLMs (orange) competitive with AR models (blue) across tasks and scales.

Trade-offs, Limitations, and Open Challenges

Despite their promise, DLMs face several unresolved challenges:

  • Parallelism–Performance Trade-off: Increased parallelism can degrade output coherence due to inter-token dependency issues, especially with aggressive unmasking schedules. Figure 7

    Figure 6: Generation results illustrating the trade-off between parallelism and output quality in DLMs.

  • Infrastructure and Ecosystem: Lack of mature, open-source libraries and deployment frameworks hinders practical adoption.
  • Long-Sequence and Dynamic-Length Generation: DLMs are limited in context length and dynamic output sizing, with cubic inference complexity in sequence length.
  • Scalability: Public DLMs remain significantly smaller than state-of-the-art AR models, and scaling laws for DLMs are not yet fully established.

Applications and Implications

DLMs have been successfully applied to a broad spectrum of tasks, including robust text classification, NER, summarization, style transfer, code generation, and computational biology. Their global planning and iterative refinement capabilities are particularly advantageous for structured and logic-heavy domains. In code generation, DLMs have demonstrated competitive HumanEval performance and superior throughput compared to AR baselines.

Theoretically, DLMs offer a unified framework for generative modeling across modalities, with inherent advantages in controllability and bidirectional context. Practically, their parallelism and efficiency optimizations position them as viable candidates for latency-sensitive and large-scale applications, contingent on further advances in infrastructure and scalability.

Future Directions

Key areas for future research include:

  • Improving training efficiency and data utilization.
  • Adapting quantization, pruning, and distillation techniques to the diffusion paradigm.
  • Advancing unified multimodal reasoning and DLM-based agent architectures.
  • Developing robust infrastructure and deployment frameworks for DLMs.

Conclusion

This survey provides a comprehensive technical synthesis of the DLM landscape, highlighting the paradigm's modeling innovations, empirical strengths, and open challenges. DLMs have established themselves as a credible alternative to AR models, particularly in settings demanding parallelism, bidirectional context, and unified multimodal reasoning. Continued research into scalability, efficiency, and infrastructure will be critical for realizing the full potential of diffusion-based LLMing.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

alphaXiv

  1. A Survey on Diffusion Language Models (91 likes, 0 questions)