Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 73 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Kimi K2 190 tok/s Pro
2000 character limit reached

DPad: Efficient Diffusion Language Models with Suffix Dropout (2508.14148v2)

Published 19 Aug 2025 in cs.CL and cs.LG

Abstract: Diffusion-based LLMs (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces DPad which leverages suffix dropout to reduce computational redundancy in diffusion language models while maintaining output accuracy.
  • It employs a sliding window and distance-decay dropout strategy, achieving up to 61.4× speedup on benchmarks such as LLaDA-1.5.
  • DPad integrates seamlessly with existing architectures without requiring retraining, highlighting its potential for scalable and efficient LLM deployment.

DPad: Efficient Diffusion LLMs with Suffix Dropout

Overview

This paper introduces the Diffusion Scratchpad (DPad), an efficient approach for diffusion-based LLMs (dLLMs) that leverages suffix dropout to reduce computational redundancy. The core premise of DPad is to streamline the denoising process intrinsic to dLLMs by focusing on a limited subset of suffix tokens that act as a "scratchpad," thereby reducing computational overhead without sacrificing model accuracy.

Methodological Insights

Diffusion LLMs (dLLMs)

Unlike conventional autoregressive models, dLLMs eliminate sequential dependencies by framing text generation as a parallel denoising process. While this approach allows for parallel token generation, it incurs high computational costs due to the redundant prediction of suffix tokens that do not contribute significantly to the output.

Scratchpad Mechanism

Suffix tokens in dLLMs serve as an information reservoir, collecting signals from prefix tokens. This paper likens the function of these tokens to a "scratchpad," providing contextual cues that assist in generating the current block. The redundancy observed in suffix tokens increases with their distance from the current block.

DPad: Efficiency Enhancements

DPad proposes two strategies to efficiently utilize suffix attention:

  1. Sliding Window: This maintains a fixed-length suffix window, ensuring only nearby suffix tokens are considered, thus bounding the computational effort required.
  2. Distance-decay Dropout: This strategically prunes distant suffix tokens using a gaussian sampling process before computing attention scores, thereby reducing unnecessary calculations.

Both strategies complement existing optimization techniques, such as prefix caching, to deliver substantial performance improvements. Figure 1

Figure 1: Comparison of (a) autoregressive LLMs, (b) block-wise diffusion LLMs, and (c) our DPad. DPad restricts suffix attention via: (i) Sliding Window.

Evaluative Metrics and Results

The paper evaluates DPad across several benchmarks using models like LLaDA-1.5 and Dream with notable findings:

  • Speed Improvements: DPad achieves up to 61.4×61.4\times speedup over vanilla dLLMs while maintaining comparable accuracy.
  • Architecture Compatibility: By integrating seamlessly with existing architectures, DPad delivers considerable efficiency without the need for retraining.
  • Accuracy Maintenance: Despite significant speed improvements, DPad maintains the integrity of the model's output accuracy. Figure 2

    Figure 2: Attention score maps illustrating the Scratchpad mechanism in dLLMs. The maps were generated by the LLaDA-1.5 model.

Implications and Future Directions

The implications of DPad are significant for the scalability and deployment of dLLMs in practical applications. Its compatibility with current optimizations makes it a potent tool for enhancing text generation efficiency. Future research could focus on integrating similar dropout strategies during the training phase to naturally align model training and inference conditions, further enhancing accuracy and efficiency.

In conclusion, DPad contributes a critical component towards efficient, scalable LLMing, mitigating one of the major bottlenecks of diffusion-based approaches by harnessing the redundant nature of suffix tokens. Future work could involve extending these dropout mechanisms to training phases, potentially offering more robust and finely tuned models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube