Papers
Topics
Authors
Recent
Search
2000 character limit reached

On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks

Published 22 Apr 2026 in cs.LG and cs.CL | (2604.20079v1)

Abstract: Auto-regressive LLMs achieve strong performance on coding tasks, but incur high memory and inference costs. Diffusion-based LLMs (d-LLMs) offer bounded inference cost via iterative denoising, but their behavior under post-training quantization (PTQ) has been sparsely explored. We investigate the application and robustness of PTQ techniques, specifically GPTQ and a modified Hessian-Aware Quantization (HAWQ) algorithm, on a diffusion-based coding LLM (CoDA) and observe that these methods applied to CoDA exhibit greater robustness at low bitwidths compared to Qwen3-1.7B, its auto-regressive counterpart, under a standardized evaluation pipeline. We find that in our setup, CoDA exhibits greater robustness at low bitwidths (2-4 bits), with smaller accuracy degradation across HumanEval and MBPP benchmarks. Additionally, mixed-precision configurations derived from HAWQ provide smooth trade-offs across accuracy, latency, and memory. The results suggest that diffusion LLMs may offer advantages for efficient deployment due to more quantization-resilience.

Summary

  • The paper demonstrates that diffusion LLMs (CoDA 1.7B) exhibit robust coding task performance under aggressive PTQ, with an accuracy drop of ~8% at 4-bit precision.
  • It compares quantization strategies, showing that GPTQ and modified HAWQ enable effective precision allocation and improved latency tradeoffs over AR LLMs.
  • The study highlights potential for efficient edge deployments and motivates further research into the interplay of architecture, training data, and quantization dynamics.

Quantization Robustness of Diffusion LLMs in Coding Benchmarks

Introduction

The paper "On the Quantization Robustness of Diffusion LLMs in Coding Benchmarks" (2604.20079) presents an empirical study of quantization robustness in diffusion-based LLMs (d-LLMs) as compared to autoregressive (AR) LLMs, specifically within the context of post-training quantization (PTQ) strategies for deployment in coding tasks. The investigation focuses on Salesforce's CoDA 1.7B (a diffusion LLM) and Qwen3-1.7B (an AR model) to assess accuracy, latency, and memory tradeoffs under various quantization schemes, including GPTQ and a modified Hessian-Aware Quantization (HAWQ), across HumanEval and MBPP coding benchmarks.

Background: Quantization in LLMs

Quantization remains a vital tool for LLM deployment, aiming to minimize memory and inference costs by reducing numerical precision of network weights and activations. Existing research in the AR LLM space has established the viability of aggressive quantization (e.g., 4-bit weight-only formats via GPTQ) while maintaining performance [gptq]. Mixed-precision approaches—such as HAWQ—exploit second-order sensitivity to allocate precision across layers, producing stronger accuracy-compression tradeoffs. However, quantization for d-LLMs is less explored, with recent work (e.g., DLLMQuant) targeting diffusion-specific training dynamics but offering limited perspective on mixed-precision and direct AR vs. d-LLM comparison.

Methodology

The authors execute a controlled empirical comparison using CoDA 1.7B and Qwen3-1.7B, holding architecture and evaluation pipeline fixed as much as possible but noting that training data divergence remains (Qwen3 trained on 36T tokens, CoDA on 180B). Quantization is performed via:

  • GPTQ: Per-group, Hessian-compensated weight quantization at 2/3/4/8 bits; model calibration with WikiText.
  • Modified HAWQ: Mixed-precision allocation to weights based on estimated sensitivity via power iteration, with substantial scalability adaptations to support billion-parameter models.

Benchmarks include HumanEval and MBPP (regular and "Plus" versions), evaluating pass@1, alongside per-step/token inference latency on NVIDIA L40S.

Results

Accuracy and Quantization Robustness

Across all quantization regimes, CoDA demonstrates superior robustness to accuracy loss at low precision as compared to Qwen3. At 4-bit precision, CoDA exhibits an average decrease of ~8% in coding task accuracy—contrasting sharply with Qwen3, whose performance collapses with average drops of ~40% or more across benchmarks.

This resilience persists at 3-bit quantization: Qwen3 accuracy falls catastrophically, while CoDA retains partial benchmark functionality. At 2-bit quantization, both architectures lose all function. Figure 1

Figure 1: Graphic of Performance-Precision Tradeoff with HAWQ Performance Labeled; d-LLMs display graceful performance degradation across quantization settings compared to AR LLMs.

The discrepancy corresponds with observations in scaling law literature, suggesting that over-trained models (high data:parameter ratio, as in Qwen3) show larger degradation under PTQ [scaling_laws_for_precision], which could partially explain Qwen3's collapse independent of generation paradigm.

Latency, Memory, and Pareto Tradeoffs

Baseline Qwen3 exhibits slightly lower forward latency than CoDA (26.8 ms vs. 28.3 ms on L40S), likely due to optimized AR kernels and architectural differences. After quantization, especially at 4- and 8-bit with GPTQ, CoDA matches and even outperforms Qwen3 in generation latency due to differences in system and framework-level overheads. Figure 2

Figure 2: Graphic of Latency-Precision Tradeoff between CoDA and Qwen3 GPTQ Models, illustrating latency convergence and eventual inversion at lower bitwidths.

Notably, CoDA's denoising forward computation—though initially denser—becomes relatively more efficient as arithmetic cost drops, while AR path overheads dominate Qwen3's runtime.

Mixed Precision with HAWQ

The modified HAWQ algorithm enables effective allocation of higher precision to sensitivity-critical weights, yielding very smooth accuracy-memory tradeoffs. Pareto frontier experiments indicate that, with two-precision splits (e.g., 50% of blocks at 16-bit, remainder 8-bit or 8/4-bit), CoDA demonstrates minimal performance loss even with substantially reduced memory footprints. Attempts at more complex multi-precision splits led to performance collapse, indicating the need for further post-quantization fine-tuning or QAT to stabilize low-precision models.

Implications and Future Work

The finding that diffusion-based LLMs (as exemplified by CoDA) maintain task performance at much lower precision than equivalent AR models under aggressive PTQ has concrete implications for deployment:

  • Efficient Edge and Resource-Constrained Deployment: Diffusion LLMs are promising candidates where aggressive quantization is needed, e.g., for on-device inference or cost-sensitive settings.
  • Design of Quantization Pipelines: HAWQ-style sensitivity-guided precision allocation is highly effective in d-LLMs; exploring automatic precision allocation and robustness recovery via QAT or supervised fine-tuning is indicated.
  • Isolation of Generation Paradigm Effects: Since differences in robustness may partially arise from training data (Qwen3's massive over-training), future controlled comparisons—identical data/task/architecture, varying only decoding paradigm—are necessary to verify whether the observed quantization resilience is a true function of the diffusion process.

Further, evaluation of quantization effects with code-focused calibration datasets (such as OpenCoder), and larger-scale mixed-precision HAWQ evaluations, could extend understanding of how model and task properties interact with quantization dynamics.

Conclusion

This work provides systematic evidence that diffusion-based LLMs, here represented by CoDA 1.7B, are significantly more robust to post-training quantization than their autoregressive counterparts on coding benchmarks. Quantized CoDA models retain high accuracy at 4-bit and even 3-bit precision, exhibit superior latency scaling under quantization, and achieve smooth accuracy-memory Pareto tradeoffs when optimized with HAWQ-derived mixed-precision. These results position d-LLMs as attractive targets for memory- and latency-constrained deployments. Theoretical implications call for further disambiguation of architecture, data, and algorithmic effects on quantization robustness, motivating future research in both empirical and foundational directions.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.