eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers (2211.01324v5)

Published 2 Nov 2022 in cs.CV and cs.LG

Abstract: Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/

PDF Abstract

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

The paper "eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers" presents a significant advancement in text-conditioned image synthesis, utilizing large-scale diffusion-based generative models. The primary innovation in this work is the introduction of an ensemble of diffusion models, each specialized for different stages of the image synthesis process. This approach, termed eDiff-I, addresses a critical observation that the synthesis behavior of diffusion models shifts over time. Early sampling phases are heavily influenced by text prompts, while later stages focus on refining visual fidelity, often disregarding text conditioning.

Methodology and Model Architecture

The authors propose a strategy to maintain training efficiency by initially training a single model. This model is then progressively split into specialized models trained for specific synthesis stages. This approach is distinct from typical practices where a single model handles all synthesis stages, which may not optimally capture the diverse requirements throughout the generation process.

The eDiff-I model comprises a base diffusion model producing images at 64×64 resolution, followed by two super-resolution models that upscale to 256×256 and 1024×1024. Conditioning on both T5 text and CLIP embeddings, eDiff-I achieves improved text alignment while retaining high visual quality at a constant inference computational cost.

Numerical Results and Impact

The paper demonstrates eDiff-I's superiority over previous models, including GLIDE, DALL-E2, and Stable Diffusion. Specifically, it surpasses these models in zero-shot FID scores on the COCO dataset, showcasing enhanced synthesis capability. This improvement is attributed to the ensemble model's ability to capture complex temporal dynamics through dedicated expert denoisers.

Theoretical and Practical Implications

The multi-expert framework allows for better capacity scaling without additional inference time, which is a crucial concern when dealing with large-scale models. This innovation might prompt future research to explore similar ensemble strategies in other generative and conditional models beyond image synthesis, such as in language or video generation.

Future Developments in AI

eDiff-I's incorporation of multiple embeddings points to a future where generative models are conditioned on diverse and rich sources of information. This versatility could enable even more sophisticated and user-specific content generation applications, enhancing fields such as digital art, content creation, and entertainment.

In conclusion, eDiff-I represents a notable contribution to text-to-image generation, offering insights into efficient use of model capacity and multi-stage specialization for handling complex synthesis tasks. This work not only sets a new benchmark in image generation but also opens new avenues for exploring ensemble models across various AI domains.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Yogesh Balaji (22 papers)
Seungjun Nah (17 papers)
Xun Huang (29 papers)
Arash Vahdat (69 papers)
Jiaming Song (78 papers)
Qinsheng Zhang (28 papers)
Karsten Kreis (50 papers)
Miika Aittala (22 papers)
Timo Aila (23 papers)
Samuli Laine (21 papers)
Bryan Catanzaro (123 papers)
Tero Karras (26 papers)
Ming-Yu Liu (87 papers)

Citations (708)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/samarjhu/status/1864787157451166017

YouTube

Show All Videos