Generalized Denoising Diffusion Compression

Updated 24 November 2025

gDDCM is a unified framework for image compression and tokenization using diffusion-based generative models and fixed Gaussian codebooks.
It replaces continuous noise injection with quantization from a pre-sampled Gaussian codebook, enabling lossless bit-stream representation and efficient discrete tokenization.
The model supports tasks like restoration and super-resolution, achieving state-of-the-art perceptual metrics on benchmarks such as CIFAR-10 and LSUN Bedroom.

The Generalized Denoising Diffusion Compression Model (gDDCM) is a unified framework for image compression and tokenization grounded in diffusion generative models. It extends the Denoising Diffusion Codebook Model (DDCM), originally designed for discrete-time denoising diffusion probabilistic models (DDPM), to encompass a broad family of diffusion model variants, including score-based generative models, consistency models, and rectified flow. The core innovation is replacing the continuous stochastic noise injection at each reverse diffusion step with quantization over a fixed, small Gaussian codebook, enabling lossless bit-stream representation and a discrete token sequence per image. This mechanism operates at inference time without retraining the underlying generative model and supports both unconditional compression and conditional image generation tasks such as restoration and super-resolution, with state-of-the-art perceptual metrics across datasets (Ohayon et al., 3 Feb 2025, Kong, 17 Nov 2025).

1. Theoretical Foundation and Model Architecture

gDDCM is built upon diffusion-based generative models where the marginal at step $t$ can be expressed as $x_t = s(t) x_0 + \sigma(t) \epsilon$ , with $\epsilon \sim \mathcal{N}(0, I)$ and $s(t), \sigma(t)$ denoting the known scalar schedules specific to the chosen diffusion paradigm (Kong, 17 Nov 2025). The forward process remains standard (e.g., for DDPM: $x_t = \sqrt{1-\beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon_t$ ).

The key architectural components are:

Pre-trained backbone: gDDCM utilizes any diffusion-family network (DDPM/DDIM, score-based SDE, consistency model, rectified flow) without modification or retraining, relying on the original score estimation (e.g., $\epsilon_\theta(x_t,t)$ or $s_\theta(x_t,t)$ ) learned under standard objectives (Ohayon et al., 3 Feb 2025, Kong, 17 Nov 2025).
Fixed Gaussian codebooks: At each of $N$ discrete or continuous time-steps, a codebook $\mathcal{S}_t = \{e^{(1)}, \ldots, e^{(M)}\}$ of i.i.d. Gaussian vectors is pre-sampled and fixed. These serve as the finite set from which noise is selected, replacing fresh $\mathcal{N}(0, I)$ draws at each reverse step (Ohayon et al., 3 Feb 2025, Kong, 17 Nov 2025).

2. Modified Sampling and Compression Mechanism

In classical DDPM reverse dynamics, a fresh noise draw is injected at each iteration:

$x_{t-1} = \mu_\theta(x_t,t) + \Sigma_\theta(x_t,t)\epsilon,\; \epsilon \sim \mathcal{N}(0, I)$

gDDCM amends this by substituting $\epsilon$ with the nearest codebook entry $z$ :

$x_{t-1} = \mu_\theta(x_t, t) + \Sigma_\theta(x_t, t) z,\quad z = \operatorname*{argmin}_{e\in \mathcal{S}_t} \|\epsilon' - e\|_2$

where $\epsilon'$ is the estimated noise at step $t$ (e.g., $(x_t - s(t)x_0)/\sigma(t)$ or from a model prediction) (Kong, 17 Nov 2025). The index $c_t$ of $z$ is stored, yielding a discrete token sequence.

For continuous-time or flow-based variants (score-based SDE, consistency, rectified flow), a de-noising and re-noising strategy integrates codebook quantization into the ODE or SDE update, maintaining the same discretization convention via the unified marginal form (Kong, 17 Nov 2025).

3. Bit-Stream Construction and Rate–Distortion Analysis

The compressed representation is the sequence of codebook indices $\{i_1, ..., i_N\}$ per image. Each index requires $\log_2 M$ bits; total bitrate is $N \log_2 M$ bits per image. Bitrate per dimension ( $\mathrm{bpd}$ ) is calculated as:

$\mathrm{bpd} = \frac{N \log_2 M}{H \cdot W \cdot C}$

where $H, W, C$ are height, width, and channel count. Entropy coding (e.g., Huffman, arithmetic) can further reduce bit-cost by leveraging non-uniform token probabilities (Kong, 17 Nov 2025).

Rate–distortion tradeoff is tunable via codebook size $M$ and step count $N$ :

Larger $M$ increases fidelity but also token entropy.
Gains saturate beyond $M \approx 512$ and $N \approx 300$ on typical datasets such as CIFAR-10 (Kong, 17 Nov 2025).
Matching pursuit and convex codebook combinations permit finer bitrate control at the cost of additional encoding bits (Ohayon et al., 3 Feb 2025).

4. Generalization to Conditional Generation and Tasks

gDDCM generalizes tokenized inference to arbitrary diffusion-based conditional generation by integrating a task-specific loss $\ell(y, x_i, z)$ into the codebook selection:

$k_i = \operatorname*{argmin}_{1\leq k \leq K} \ell(y, x_i, z_i^{(k)})$

With suitable $\ell$ , gDDCM recovers various tasks:

Lossy compression: Set $y = x_0$ and $\ell$ as a posterior-inspired quadratic (e.g., $\ell_P = \|z - \sigma_i \nabla_{x_i} \log p_i(x_i|y)\|^2$ ).
Restoration/Super-resolution: Losses such as $-\|(x_i + \sigma_i z) - g(y)\|^2$ bias toward reconstructions consistent with degraded inputs (Ohayon et al., 3 Feb 2025). As $K \to \infty$ , the quantized noise converges to the posterior score, and the procedure discretizes the probability-flow ODE, yielding approximate posterior sampling (Ohayon et al., 3 Feb 2025).

5. Practical Implementation and Experimental Findings

gDDCM is entirely inference-driven: no retraining, fine-tuning, or auxiliary loss terms are needed. The only computational overhead is per-step nearest neighbor search in the codebook (Ohayon et al., 3 Feb 2025, Kong, 17 Nov 2025). The encoding/decoding protocol involves re-generating Gaussian codebooks from a shared seed, quantizing estimated noise vectors, and reconstructing the reverse process.

On standard benchmarks:

CIFAR-10 ( $32\times32\times3, N=300, M=256$ ): gDDCM with $p=0$ achieves FID=3.2, LPIPS=0.060, IS=10.5, SSIM=0.98. Baseline DDCM with $p=0.5$ yields FID=7.7, SSIM=0.93 (Kong, 17 Nov 2025).
LSUN Bedroom ( $256\times256\times3, N=600, M=512$ ): gDDCM attains LPIPS $\approx0.052$ , SSIM $\approx0.96$ (Kong, 17 Nov 2025).
gDDCM achieves state-of-the-art performance at extreme low bitrates ( $\sim$ 0.05 BPP), surpassing methods such as BPG, HiFiC, PSC, and PerCo in FID and LPIPS metrics, while rates at higher bitrates are capped by the underlying VAE latent space (Ohayon et al., 3 Feb 2025).

The approach demonstrates uniform applicability across DDPM, score-based SDE (EDM), consistency, and flow-matching models, consistently improving FID and SSIM relative to baseline DDCM.

6. Limitations, Implications, and Open Directions

The only overhead introduced by gDDCM is the codebook search per step. The quality–bitrate tradeoff and codebook efficiency are currently limited by the choice of $M$ and $N$ , with no further reduction in bits after saturation. A plausible implication is that per-step adaptive or jointly learned codebooks could further compress image representations. Future research directions explicitly identified include:

Joint learning of codebooks to minimize $M$ without loss of quality.
Incorporating entropy-regularized or variational token selection.
Extension to conditional or structured data generation (e.g., multi-modal, class-conditional tasks).
Theoretical rate–distortion analysis specific to diffusion-based codecs (Kong, 17 Nov 2025).
Optimizing quantization strategies that align with human perceptual metrics.

gDDCM provides a principled, model-agnostic tokenization methodology for generative compression and conditional generation across the spectrum of diffusion-based models, achieving efficient and high-fidelity discrete representations on challenging visual benchmarks (Ohayon et al., 3 Feb 2025, Kong, 17 Nov 2025).