Q-Diffusion: Quantization for Diffusion Models
- The paper introduces a PTQ method, Q-Diffusion, which uses timestep-aware calibration and split shortcut quantization to compress diffusion models while limiting FID degradation.
- It details a calibration pipeline that selects diverse timesteps and applies per-timestep scale adjustments to manage temporal activation drift and bimodal shortcut statistics.
- The approach enables 4-bit and 8-bit quantization of diffusion models, significantly reducing model size and computational load without retraining.
Q-Diffusion refers to a family of post-training quantization (PTQ) techniques specifically developed for compressing and accelerating deep diffusion models used in generative image synthesis, notably Denoising Diffusion Probabilistic Models (DDPM), Latent Diffusion Models (LDM), and Diffusion Transformers (DiT). While PTQ works reliably for feedforward tasks, its naive application to diffusion architectures fails due to temporal instability in activation distributions and bimodal statistics within skip connections. Q-Diffusion corrects for these distinctive behaviors by introducing timestep-aware calibration and architectural modifications, enabling deployment of 4-bit and 8-bit quantized diffusion models with negligible sample quality degradation and significant computational advantages (Li et al., 2023).
1. Challenges in Quantizing Diffusion Models
Diffusion models present two core quantization challenges:
- Temporal activation drift: Unlike conventional CNNs, activations in diffusion models exhibit radically different distributional statistics at each reverse denoising step, evolving from high-variance noise at to structured content at (Li et al., 2023, So et al., 2023, Zeng et al., 8 May 2025).
- Bimodal shortcut statistics: U-Net architectures in diffusion employ skip-connection concatenations. The merged shortcuts display markedly bimodal activation distributions, which resist naive lumped quantization schemes (Li et al., 2023).
Conventional PTQ (e.g., min-max scaling, uniform quantization per layer) introduces catastrophic clipping errors for diffusion models. For example, when applied to DDPM and LDM, standard PTQ yields Fréchet Inception Distance (FID) degradation exceeding 100 in low-bit (≤4-bit) settings, compared to full precision (Li et al., 2023, Zeng et al., 8 May 2025).
2. Timestep-Aware Calibration
Q-Diffusion leverages timestep-aware calibration, exploiting the observation that activation statistics are highly correlated with the current step in the denoising schedule (Li et al., 2023, So et al., 2023, Ye et al., 2024). The approach proceeds as follows:
- Calibration sample selection: For PTQ, select a calibration set across a representative distribution of timesteps. A common strategy is Gaussian sampling, , with to focus on intermediate denoising levels.
- Per-timestep scale and zero-point computation: For each quantized operator, compute scale and zero-point separately for each :
using sampled from representative steps (Ye et al., 2024, So et al., 2023).
- Temporal dynamic quantization (TDQ): Optionally, predict from using a small MLP conditioned on Fourier time embeddings (So et al., 2023); this is training-free and incurs zero inference overhead.
This procedure captures the nonstationary activation ranges and enables robust quantization. Empirical results show that per-timestep or dynamic quantization narrows the FID gap at 4–8 bits by up to two orders of magnitude versus global PTQ (Li et al., 2023, So et al., 2023).
3. Split Shortcut Quantization
To mitigate quantization errors induced by bimodal activations in skip-connection concatenations, Q-Diffusion applies split shortcut quantization:
- Partition concatenated tensors: Prior to concatenation, independently quantize each component (e.g., skip and forward activations) (Li et al., 2023).
- Calibrate ranges separately: For each component, use its own min/max statistics over all calibration samples, preventing erroneous scaling due to outlier contamination.
This approach corrects for severe clipping artifacts in the middle and output blocks of U-Net models. It is especially critical when quantizing shortcut paths at 4 bits.
4. Quantization Operator and Configuration
The default quantizer in Q-Diffusion is uniform affine per-tensor:
where and are chosen via L2 minimization over calibration samples, and / are determined by -bit format. Per-channel quantization is less common due to added complexity and limited benefit for diffusion models (Ye et al., 2024, Li et al., 2023).
Typical configurations:
| Scheme | Weight Bitwidth | Activation Bitwidth | FID Degradation | Size Compression |
|---|---|---|---|---|
| W8/A8 PTQ | 8 | 8 | 2.3 | |
| W4/A8 PTQ | 4 | 8 | 2.3–4.0 | |
| W4/A32 | 4 | 32 | 50 |
Split shortcut quantization and timestep-aware calibration are required for W4/A8 and lower to prevent collapse.
5. Implementation Pipeline
The PTQ pipeline for Q-Diffusion proceeds as follows (Li et al., 2023, So et al., 2023, Ye et al., 2024):
- Select calibration dataset : Typically 5000 samples, distributed across timesteps.
- For each layer:
- For each sampled , collect full-precision outputs .
- Minimize calibration loss for quantization parameters .
- For skip-connections, quantize each branch independently.
- Construct quantized model: Apply learned quantization parameters at each step and layer.
- Inference integration: Replace FP32 ops with quantized kernels on hardware supporting INT4/INT8.
No retraining or weight updates are required, making Q-Diffusion a pure PTQ solution.
6. Experimental Results and Benchmarks
Q-Diffusion achieves competitive performance across unconditional and text-guided generation tasks (Li et al., 2023, Ye et al., 2024):
- ImageNet 64×64 (DDPM+, 250 DDIM steps)
- FP32: FID=21.63
- Q-Diffusion (W4/A8): FID=23.97
- Baseline PTQ: FID 100 at 4-bit
- Stable Diffusion, MS-COCO 512×512 prompts
- Q-Diffusion W4/A8 produces visually realistic samples, matching FP32 in perceptual quality—first such result reported.
- Memory and Throughput Gains
- %%%%2930%%%% reduction in model size (143 MB 35 MB at W8/A8)
- %%%%3233%%%% reduction in BitOps per inference step; speedup scales with hardware efficiency
No perceptible degradation occurs up to 8-bit settings; at 4 bits, degradation typically remains 4 FID points with specialized calibration.
7. Extensions and Evolution within Quantized Diffusion
Subsequent works generalize Q-Diffusion’s principles:
- Temporal Dynamic Quantization (TDQ): Dynamic scale prediction from with zero inference overhead (So et al., 2023)
- Timestep-Channel Grouping: Distribution-aware quantizer selection per step and channel (Wang et al., 2023, Huang et al., 2024)
- Sampling-Aware Quantization: Calibration is performed on sampling trajectories accounting for numerical integration errors (Zeng et al., 4 May 2025)
- Quantization Noise Correction: Inter- and intra-noise are dynamically estimated and compensated (Chu et al., 2024)
Empirical and benchmark studies continue to support the necessity of temporally adaptive quantization for diffusion architectures (Zeng et al., 8 May 2025).
Q-Diffusion and its descendants constitute the foundation for PTQ of large-scale diffusion models. By combining timestep-aware calibration and split quantization of bimodal shortcut layers, these methods enable low-bit deployment with minimal output degradation, facilitating practical generative model inference on resource-constrained devices (Li et al., 2023, Zeng et al., 8 May 2025).