Compressed Image Generation with Denoising Diffusion Codebook Models (2502.01189v3)

Published 3 Feb 2025 in eess.IV, cs.AI, cs.CV, cs.IT, eess.SP, and math.IT

Abstract: We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces high-quality image samples along with their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termed Denoising Diffusion Codebook Model (DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered.

Summary

The paper introduces Denoising Diffusion Codebook Models (DDCM) to generate high-quality images while producing losslessly compressed bit-streams via fixed noise codebooks.
It demonstrates that small fixed codebooks (e.g., K=64) can achieve image generation performance comparable to standard diffusion models on benchmarks.
The method extends to compressed conditional generation and image restoration tasks, offering competitive perceptual quality and efficient compression.

This paper introduces Denoising Diffusion Codebook Models (DDCM), a novel generative approach that builds upon Denoising Diffusion Models (DDMs) to produce high-quality image samples along with their losslessly compressed bit-stream representations. The core idea is to replace the standard practice of sampling noise from a continuous Gaussian distribution at each step of the reverse diffusion process with selecting noise vectors from pre-defined, fixed codebooks. Each codebook $\mathcal{C}_i$ contains $K$ pre-sampled i.i.d. Gaussian noise vectors $[z_i^{(1)}, z_i^{(2)}, \ldots, z_i^{(K)}]$ . During generation, instead of drawing $z_i \sim \mathcal{N}(0, I)$ , an index $k_i$ is chosen, and $z_i(k_i)$ from the codebook $\mathcal{C}_i$ is used. The sequence of chosen indices $(k_{T+1}, \ldots, k_2)$ forms the lossless bit-stream for the generated image. This modification can be applied to any pre-trained DDM without further training.

The authors demonstrate that DDCM surprisingly retains the sample quality and diversity of standard DDMs even with very small codebook sizes ( $K$ ). For instance, experiments show that DDCM with $K=64$ achieves FID scores comparable to standard DDPM (which is equivalent to DDCM with $K=\infty$ ) on class-conditional ImageNet ( $256 \times 256$ ) and text-conditional Stable Diffusion 2.1 ( $768 \times 768$ ) generation.

Image Compression with DDCM

Leveraging the inherent compressed representation, DDCM is adapted into a highly effective lossy image codec. To compress a real image $x_0$ , the noise selection at each timestep $i$ is guided to reconstruct $x_0$ . Specifically, the codebook entry $z_i(k_i)$ is chosen to maximize the inner product with the residual error between the target image and the current prediction $\hat{x}_{0|i}$ : $k_i = \argmax_{k\in\{1,\hdots,K\}} \langle z_i(k), x_0-\hat{x}_{0|i}\rangle$. The initial codebook $\mathcal{C}_{T+1}$ has $K=1$ . Decompression involves running the DDCM generation process using the stored sequence of indices.

The bit rate is $(T-1)\log_2(K)$ bits. For higher bit rates, a matching pursuit (MP) inspired approach is proposed. Instead of selecting a single noise vector, the chosen noise at step $i$ is a convex combination of $M$ elements from $\mathcal{C}_i$ , greedily selected to correlate with $x_0-\hat{x}_{0|i}$ . This involves $M-1$ quantized scalar coefficients from a set of $C$ values, increasing the bit-stream length to $(T-1)(\log_2(K)M + C(M-1))$ .

Experiments on Kodak24, DIV2K, ImageNet, and CLIC2020 show that DDCM achieves state-of-the-art perceptual image compression, particularly at lower bit rates. It often surpasses existing methods in FID while maintaining competitive PSNR and LPIPS. For example, on ImageNet $256 \times 256$ using a pixel-space DDM, and on other datasets using Stable Diffusion 2.1 ( $512 \times 512$ ), DDCM shows superior rate-perception-distortion performance compared to methods like BPG, HiFiC, PSC, ILLM, and PerCo (SD). A limitation noted is that at very high bit rates, the performance of latent space DDCM is capped by the VAE's reconstruction quality.

Compressed Conditional Generation

DDCM is extended to a general framework for compressed conditional generation, where images are generated directly in their compressed form based on a condition $y$ . The noise indices $k_i$ are chosen by minimizing a loss function $\mathcal{L}(y, x_i, \tilde{x}_{i-1}, k)$ that encourages the generated sample to match $y$ . A specific loss is proposed: $\mathcal{L}_{\text{score}}(y, x_i, \tilde{x}_{i-1}, k) = \|z_i(k) - \sigma_i \nabla_{x_i} \log p_i(y|x_i)\|^2$ . Proposition 5.1 states that if $k_i$ is chosen using this loss, as $K \rightarrow \infty$ , the DDCM generative process becomes a discretization of a probability flow ODE over the posterior $p_0(x_0|y)$ . This framework allows generating conditional samples whose bit-streams can be decoded without access to the original condition $y$ . The image compression scheme described earlier is shown to be a special case where $y=x_0$ .

Applications:

Compressed Posterior Sampling for Image Restoration:

For linear inverse problems ( $y=Ax_0$ ), DDCM is used for zero-shot image restoration. The noise $z_i(k)$ is chosen to minimize $\|y - \mathcal{A}(\mu_i(x_i) + \sigma_i z_i(k))\|^2$ , where $\mu_i(x_i)$ is the deterministic part of the DDPM update. This approximates posterior sampling. Using an unconditional ImageNet $256 \times 256$ DDM with $K=4096$ (approx. $0.183$ BPP), DDCM is compared against DPS and DDNM for colorization and $4\times$ super-resolution. DDCM achieves superior perceptual quality (FID) compared to these methods and their naively compressed outputs, with competitive PSNR.
Compressed Real-World Face Image Restoration:

A method for blind face restoration is proposed that optimizes a no-reference image quality assessment (NR-IQA) measure at test time. At each step $i$ :
- $k_{i,D}$ is chosen to direct towards an MMSE estimate $x_{\text{MMSE}}(y)$ , promoting low distortion: $k_{i,D} = \argmax_k \langle z_i(k), x_{\text{MMSE}}(y) - \hat{x}_{0|i} \rangle$ .
- $k_{i,P}$ is chosen randomly from $\{1, \ldots, K\}$ , promoting high perceptual quality.
- The final $k_i$ is selected from $\{k_{i,D}, k_{i,P}\}$ to minimize $\text{MSE}(x_{\text{MMSE}}(y), \hat{x}_{0|i-1}^{(k)}) + \lambda Q(\hat{x}_{0|i-1}^{(k)})$ , where $Q(\cdot)$ is an NR-IQA measure (e.g., NIQE, CLIP-IQA+). Experiments using an FFHQ $512 \times 512$ DDM with $K=4096$ show that this approach effectively optimizes various NR-IQA measures (NIQE, CLIP-IQA+, TOPIQ-FACE) and generalizes well (FD_DINOv2), producing high-quality restorations with fewer artifacts than methods like PMRF, DifFace, and BFRffusion on datasets like CelebA-Test, LFW-Test, and WIDER-Test.

Implementation Considerations and Pseudocode Sketch

DDCM Core Sampling Step:

def ddcm_step(x_i, i, DDM_denoiser, codebook_i, sigma_i, selection_rule, guidance_info=None):
    # Denoise x_i to get an estimate of x_0 (part of mu_i(x_i))
    # This involves DDM_denoiser(x_i, i) to get epsilon_theta or x_0_theta
    # From epsilon_theta or x_0_theta, compute mu_i(x_i) as in Eq. 3
    mu_i_val = compute_mu_i(x_i, i, DDM_denoiser)

    if selection_rule == "random":
        k_i = random_integer(1, K) # K is codebook_i.size
        selected_noise = codebook_i[k_i-1]
    elif selection_rule == "compression":
        # guidance_info should contain target_image x_0 and predicted_x_0_hat
        target_image = guidance_info["target_image"]
        predicted_x_0_hat = guidance_info["predicted_x_0_hat"] # This is x_hat_{0|i} from Eq. 4
        
        best_k = -1
        max_inner_product = -infinity
        for k_idx, noise_vec in enumerate(codebook_i):
            inner_prod = dot_product(noise_vec, target_image - predicted_x_0_hat)
            if inner_prod > max_inner_product:
                max_inner_product = inner_prod
                best_k = k_idx + 1
        k_i = best_k
        selected_noise = codebook_i[k_i-1]
    # Add other selection rules for conditional generation, restoration etc.
    # ...

    x_prev = mu_i_val + sigma_i * selected_noise
    return x_prev, k_i

Overall DDCM Generation/Compression:

def ddcm_generate_or_compress(T, codebooks, initial_noise_T_plus_1, DDM_denoiser, noise_schedule, selection_rule, guidance_info_func=None):
    indices = []
    # Initialize x_T (Eq. 6, x_T = z_{T+1}(k_{T+1}) for random generation, or just noise for compression start)
    if selection_rule == "random": # Unconditional generation
        # x_T is typically z_{T+1}(k_{T+1}) where k_{T+1} is random from C_{T+1}
        # If C_{T+1} has K entries, k_{T+1} is chosen, noise_T_plus_1 = C_{T+1}[k_{T+1}-1]
        # For simplicity, assume codebooks[0] is C_{T+1}
        k_T_plus_1 = random_integer(1, codebooks[0].size) 
        x_t = codebooks[0][k_T_plus_1-1] 
        indices.append(k_T_plus_1)
    else: # For compression/conditional, x_T is typically pure noise N(0,I) or derived from target
        x_t = sample_gaussian_noise(image_dimensions) 
        # For compression, the first "noise" z_{T+1} is from a K=1 codebook, so no bits needed for it.
        # The paper states: "the size of the first codebook C_{T+1} is K=1" for compression (Sec 4)
        # And "initialized with x_T = z_{T+1}(k_{T+1})" for generation (Sec 3)
        # This needs careful alignment with paper's indexing (i from T+1 down to 2)

    # Loop from i = T down to 1 (or T+1 down to 2 as per paper's z_i(k_i) indexing for Eq. 6)
    # Let's align with Eq. 6: noises z_i(k_i) are for steps i=T+1 down to 2
    # The loop should be for i from T+1 down to 2 for selecting k_i
    # x_{i-1} = mu_i(x_i) + sigma_i * z_i(k_i)
    # mu_i(x_i) depends on DDM_denoiser(x_i, i)
    
    current_x = x_t # This is x_T
    
    # Paper uses z_i for noise added from x_i to x_{i-1}. So loop i from T down to 1 for x_i.
    # The noise z_i(k_i) is selected at step i.
    # The paper's Equation 6: x_{i-1} = mu_i(x_i) + sigma_i * z_i(k_i)
    # with indices k_{T+1}, ..., k_2. So T total indices.
    # Step i=1 does not involve noise addition.
    
    # For generation (Eq. 6): x_T = z_{T+1}(k_{T+1}). Loop for i from T down to 1 for x_i.
    # So, first k_{T+1} selects initial x_T. Then T-1 more noises k_T ... k_2.
    
    # Let's follow Fig 2 and Sec 3: initialize x_T = z_{T+1}(k_{T+1}).
    # For random generation, k_{T+1} is random. Codebook C_{T+1} has K entries.
    k_T_plus_1_index = random_integer(1, codebooks[T].size) # Assuming codebooks[i] is C_{i+1}
    current_x = codebooks[T][k_T_plus_1_index-1] # This is x_T
    indices.append(k_T_plus_1_index)

    for step_val_i in range(T, 1, -1): # Corresponds to x_i, for i=T, T-1, ..., 2
        # current_x is x_i
        # We need to select z_i(k_i) from codebook C_i (i.e., codebooks[step_val_i-1])
        # guidance_info is specific to the current x_i
        guid_info = None
        if guidance_info_func:
            # DDM_denoiser(current_x, step_val_i) gives x_hat_{0|i}
            # This requires access to DDM internals or a wrapper.
            # predicted_x_0_hat = DDM_denoiser.predict_x0_from_xt(current_x, step_val_i)
            # guid_info = guidance_info_func(current_x, step_val_i, predicted_x_0_hat)
            pass # Placeholder for actual calculation of guid_info

        # The codebook index for z_i is step_val_i. In 0-indexed array, codebooks[step_val_i-1]
        # However, paper has k_i for z_i, and z_i is used to go from x_i to x_{i-1}.
        # Codebooks C_i, i=2..T+1. Total T codebooks.
        # indices are k_{T+1}, k_T, ..., k_2.
        # For step_val_i (from T down to 2), we use noise z_{step_val_i}(k_{step_val_i}) from C_{step_val_i}
        
        # current_x is x_{step_val_i}
        # mu_val is mu_{step_val_i}(x_{step_val_i})
        mu_val = compute_mu_i(current_x, step_val_i, DDM_denoiser, noise_schedule)
        
        # For compression, predicted_x_0_hat for selection rule comes from mu_val
        # predicted_x_0_hat can be derived from the DDM_denoiser's output at current_x, step_val_i
        if selection_rule == "compression":
            # This needs DDM's x0 prediction ability
            # x0_pred = DDM_denoiser.predict_x0(current_x, step_val_i) #
            # guid_info = {"target_image": guidance_info["target_image"], "predicted_x_0_hat": x0_pred}
            pass

        # Select k_{step_val_i} from C_{step_val_i} (codebooks[step_val_i-1])
        # Note: paper uses C_i for i=2...T+1. So C_{step_val_i} is for noise z_{step_val_i}.
        # Codebook C_i corresponds to noise z_i.
        # k_i is selected for z_i.
        # Codebooks are indexed from 2 to T+1.
        # So codebook for z_{step_val_i} is codebooks_map[step_val_i]
        
        selected_noise_vec, k_selected_idx = select_noise_from_codebook(
            current_x, mu_val, step_val_i, DDM_denoiser, codebooks_map[step_val_i], 
            noise_schedule.sigma[step_val_i], selection_rule, guid_info
        )
        
        current_x = mu_val + noise_schedule.sigma[step_val_i] * selected_noise_vec # This is x_{step_val_i-1}
        indices.append(k_selected_idx)

    # Final step i=1, x_0 = mu_1(x_1) (no noise addition)
    final_x0 = compute_mu_i(current_x, 1, DDM_denoiser, noise_schedule)
    
    return final_x0, indices # indices are [k_{T+1}, k_T, ..., k_2]

Note: The pseudocode above is a sketch and needs careful mapping of timesteps and codebook indexing to match the paper's formulation precisely.

Computational Requirements:

Storage for codebooks: $T \times K \times \text{noise_vector_dim} \times \text{precision}$.
Generation/Decompression: Similar to standard DDM sampling ( $T$ denoiser passes) plus $T \times K$ operations for the selection rule if it involves iterating through the codebook (e.g., argmax). For random selection, it's just $T$ denoiser passes and $T$ lookups.
Compression: $T$ denoiser passes, and for each step, $K$ evaluations of the inner product (or other selection criteria). If MP is used, this increases.

Limitations:

Theoretical understanding of why small codebooks work well is lacking.
Performance of latent space DDCM is bounded by the VAE, especially at high bit rates.
The matching pursuit for higher bit rates might not be optimal.
Entropy coding of indices could further improve compression.
Codebooks are fixed; optimizing them (e.g., via dictionary learning) could yield improvements.

The paper concludes that DDCM is a promising direction, achieving strong empirical results across various tasks and opening avenues for future theoretical and practical enhancements.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (4)

Tweets

https://twitter.com/SignalPapers/status/1887303760159846905