Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs (2508.14896v1)

Published 20 Aug 2025 in cs.CL and cs.AI

Abstract: Recent advances in diffusion LLMs (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based LLMs. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.

Summary

The paper demonstrates that addressing activation outliers is crucial for effective post-training quantization of diffusion LLMs.
The paper shows that weight-only methods like GPTQ maintain minimal performance loss at 4-bit precision, while lower bit-widths lead to notable accuracy drops in complex tasks.
The paper finds that instruction-tuned models exhibit enhanced robustness against quantization, offering practical pathways for efficient deployment of dLLMs.

Quantization of Diffusion LLMs: Analysis and Insights

The paper "Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs" investigates the feasibility and effectiveness of post-training quantization (PTQ) on diffusion-based LLMs (dLLMs). This exploration is pivotal as dLLMs, leveraging bidirectional context encoding and denoising-based decoding strategies, offer an alternative to autoregressive LLMs but face deployment challenges due to their significant computational resource requirements.

Activation Outliers in dLLMs

A critical finding of the paper is the presence of activation outliers in dLLMs, characterized by abnormally large activation values (Figure 1). These outliers complicate low-bit quantization by skewing the dynamic range.

Figure 1: Visualizations of activation outliers in LLaDA-8B-Base (1) and LLaDA-8B-Instruct (2). Outliers are observed at the inputs of various linear layers and can be classified as Normal Outliers (a(1)–c(1)/a(2)–c(2)), and Massive Outliers (d(1), d(2)).

In LLaDA and Dream-7B models, these outliers are pervasive across layers, affecting linear transformation inputs. The paper categorizes these into normal and massive outliers, the latter presenting significant challenges for quantization due to their extreme values. Such insights necessitate advanced outlier handling mechanisms during model quantization.

Quantization Techniques: Evaluations and Recommendations

The paper evaluates multiple PTQ techniques, including weight-only quantization methods like GPTQ and AWQ, and weight-activation methods such as SmoothQuant, QuaRot, and DuQuant. Here are the insights:

Weight-Only Quantization: GPTQ consistently outperforms AWQ, offering minimal performance degradation at 4-bit precision across tasks. However, 3-bit configurations result in notable accuracy losses, suggesting 4-bit as the optimal compromise between compression and performance.
Weight-Activation Quantization: At 8-bit, both QuaRot and DuQuant maintain near-original performance, surpassing SmoothQuant, especially under aggressive 4-bit configurations where the latter collapses. This indicates the superiority of rotation-based techniques in preserving model capabilities under stringent quantization conditions.
Figure 2: Visualizations of activation outliers in Dream-7B-Base.

Task-Based Performance Analysis

Quantization impact varies significantly across task categories. The paper reveals that:

General QA Tasks: Less sensitive to quantization, with methods like GPTQ showing negligible degradation at 4-bit precision.
Math and Code Generation Tasks: These are highly impacted by quantization, evident from substantial accuracy drops under lower bit-widths. This is attributed to the tasks' complex reasoning and dependency requirements, which are more perturbation-sensitive.

Model Type Robustness

Instruction-tuned models like LLaDA-8B-Instruct demonstrate greater resilience against quantization than their base counterparts. This robustness manifests in smaller performance declines across settings, potentially attributed to adaptive finetuning during instruction phase enhancing model's innate adaptability to compression techniques.

Conclusion

This systematic paper underscores the challenges and opportunities post-training quantization presents for diffusion-based LLMs. Effective quantization of dLLMs demands addressing activation outliers and leveraging robust techniques like GPTQ and rotation-based methods. The paper's insights provide a foundational understanding for future research aiming for efficient deployment of dLLMs, especially under resource constraints, enhancing their applicability across diverse, real-world NLP tasks.