CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation (2310.01407v2)

Published 2 Oct 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs while significantly reducing the sampling steps required to achieve high-quality results. Our method can leverage architectures such as ControlNet to incorporate conditioning inputs without compromising the model's prior knowledge gained during large scale pre-training. Additionally, a conditional consistency loss enforces consistent predictions across diffusion steps, effectively compelling the model to generate high-quality images with conditions in a few steps. Our conditional-task learning and distillation approach outperforms previous distillation methods, achieving a new state-of-the-art in producing high-quality images with very few steps (e.g., 1-4) across multiple tasks, including super-resolution, text-guided image editing, and depth-to-image generation.

PDF Abstract

Conditional Diffusion Distillation for Enhanced Image Generation

The paper "CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation" addresses a critical limitation in the deployment of large generative diffusion models, particularly in real-time applications. These models exhibit promise for diverse tasks such as image enhancement, editing, and restoration. However, the computational expense associated with their iterative refinement processes severely restricts their practicality. The authors propose a novel method named CoDi, which introduces conditional inputs into a pre-trained diffusion model, significantly reducing the requisite sampling steps for high-quality image generation.

Key Contributions

Conditional Diffusion Model Distillation: The research transitions a pre-trained latent diffusion model to a conditional framework capable of handling various image conditioning inputs. The methodology integrates architectures like ControlNet to incorporate these inputs without eroding the model's pre-trained knowledge.
Sampling Efficiency: By employing a conditional consistency loss across the diffusion steps, CoDi efficiently generates high-quality images in substantially fewer steps (1-4), as opposed to the conventional 20-200 steps required by preceding models. This approach marks a notable advancement in the speed and practicality of diffusion models for real-time applications.
Parameter-Efficient Distillation: The authors introduce a parameter-efficient distillation paradigm. By incorporating only a minimal number of additional learnable parameters, the framework successfully administers conditional transformations while preserving the original model priors.

Methodological Insights

The CoDi framework revolves around distilling an unconditional diffusion model to integrate conditional inputs. The distillation process employs a unique consistency that enforces coherent predictions across diffusion steps, ensuring that the model adapts to the new conditions with minimal iterative updates. This approach involves the following notable features:

Encoder Adaptation: Utilizing a zero-initialization strategy, CoDi adapts encoder layers from the unconditional framework to incorporate conditional data, thus optimally blending pre-trained models with task-specific enhancements.
Loss Formulation: The loss function balances self-consistency in the noise prediction space and guidance from conditional inputs, thereby harmonizing between leveraging pre-training priors and adapting new tasks.
Accelerated Training and Inference: By iteratively distilling the model in fewer steps with optimized prediction rules for the latent variables, CoDi maximizes training efficiency and accelerates inference without significantly compromising generated image quality.

Experimental Validation

The evaluation of CoDi spans multiple tasks showcasing its versatility and performance. The results from super-resolution, inpainting, and text-to-image depth generation tasks substantiate its robust performance across different conditional generation scenarios. The model not only outperformed prior distillation methods but also achieved comparable results to state-of-the-art models, employing a fraction of the typical sampling steps.

Implications and Future Work

Practically, CoDi has profound implications for the real-time application of diffusion models in diverse fields, improving accessibility and operational efficiency. Theoretically, it challenges existing paradigms on model pre-training and task adaptation, providing compelling evidence for more streamlined, resource-efficient learning processes.

Future research prospects could explore broader applications of conditional diffusion distillation across different domains such as video generation or 3D content synthesis. Additionally, investigating lightweight model architectures in conjunction with this distillation approach could further reduce computational demands, enhancing the scalability and integration of these models into consumer-grade devices.

In sum, CoDi marks a significant stride in the evolution of diffusion models, bolstering their adaptability and efficiency—a critical advancement for the widespread adoption of conditional generative tasks.