- The paper introduces an image-to-image translation method that leverages pretrained diffusion models to simplify downstream adaptation.
- The methodology employs a two-stage training strategy with task-specific encoder adaptation and hierarchical generation to enhance image quality.
- Empirical results demonstrate significantly improved FID scores, achieving superior realism and fidelity in complex translation tasks.
Overview
This paper introduces a new approach for image-to-image translation by adapting a pretrained diffusion model. This method contrasts with traditional techniques that require specifically designed architectures and training each model from scratch. By treating image-to-image translation as a downstream task, the model leverages a generative prior learned from massive image datasets. The authors posit that a strong pretrained synthesis network simplifies downstream training, as it merely requires adapting user input to the latent representation recognized by the pretrained model.
Architecture and Training Strategy
The model at the heart of this work is built upon diffusion models, which have demonstrated exceptional capabilities in synthesizing a variety of images. Specifically, the paper uses GLIDE, a text-conditioned diffusion model, which generates high-quality images from a large and varied dataset. To prepare the GLIDE model for different image-to-image translation tasks, a task-specific encoder is trained to project translation inputs, like segmentation masks, to the latent space of the pretrained model. This process involves a two-stage training strategy where initially the encoder is updated while keeping the pretrained decoder frozen and subsequently, the entire network is finetuned.
Improving Generation Quality
To enhance generation quality, the authors introduce two techniques. First, they employ a hierarchical generation strategy across two stages: coarse image generation followed by super-resolution. Second, they tackle the problem of oversmoothing by applying adversarial training during the denoising step. Additionally, they address the challenge of maintaining detailed textures by introducing a normalized classifier-free guidance method, which ameliorates mean and variance shifts that can degrade image quality.
Empirical Results
Extensive experiments across various datasets demonstrate that this approach delivers images with remarkable realism and faithfulness. The proposed method, abbreviated as PITI, significantly outperforms prior techniques and shows potential for few-shot learning tasks. The paper validates these findings with both qualitative visual results and quantitative measures such as the Frechét Inception Distance (FID), where it consistently achieves superior scores compared to existing methods and the base model without pretraining.
Conclusion
The paper concludes that pretraining a diffusion model on a large, diverse dataset provides a powerful foundation for image-to-image translation tasks. This pretrained knowledge leads to high-quality image synthesis, especially in scenarios where images have complex structures and diverse object interactions. The approach proves general enough to adapt to various input modalities without requiring task-specific customization. Notably, the paper acknowledges that despite its strengths, the approach may struggle with aligning the generated images precisely with intricate input details—a challenge left for future research to address.