Pretraining is All You Need for Image-to-Image Translation (2205.12952v1)

Published 25 May 2022 in cs.CV

Abstract: We propose to use pretraining to boost general image-to-image translation. Prior image-to-image translation methods usually need dedicated architectural design and train individual translation models from scratch, struggling for high-quality generation of complex scenes, especially when paired training data are not abundant. In this paper, we regard each image-to-image translation problem as a downstream task and introduce a simple and generic framework that adapts a pretrained diffusion model to accommodate various kinds of image-to-image translation. We also propose adversarial training to enhance the texture synthesis in the diffusion model training, in conjunction with normalized guidance sampling to improve the generation quality. We present extensive empirical comparison across various tasks on challenging benchmarks such as ADE20K, COCO-Stuff, and DIODE, showing the proposed pretraining-based image-to-image translation (PITI) is capable of synthesizing images of unprecedented realism and faithfulness.

Citations (166)

View on Semantic Scholar

Summary

The paper introduces an image-to-image translation method that leverages pretrained diffusion models to simplify downstream adaptation.
The methodology employs a two-stage training strategy with task-specific encoder adaptation and hierarchical generation to enhance image quality.
Empirical results demonstrate significantly improved FID scores, achieving superior realism and fidelity in complex translation tasks.

Overview

This paper introduces a new approach for image-to-image translation by adapting a pretrained diffusion model. This method contrasts with traditional techniques that require specifically designed architectures and training each model from scratch. By treating image-to-image translation as a downstream task, the model leverages a generative prior learned from massive image datasets. The authors posit that a strong pretrained synthesis network simplifies downstream training, as it merely requires adapting user input to the latent representation recognized by the pretrained model.

Architecture and Training Strategy

The model at the heart of this work is built upon diffusion models, which have demonstrated exceptional capabilities in synthesizing a variety of images. Specifically, the paper uses GLIDE, a text-conditioned diffusion model, which generates high-quality images from a large and varied dataset. To prepare the GLIDE model for different image-to-image translation tasks, a task-specific encoder is trained to project translation inputs, like segmentation masks, to the latent space of the pretrained model. This process involves a two-stage training strategy where initially the encoder is updated while keeping the pretrained decoder frozen and subsequently, the entire network is finetuned.

Improving Generation Quality

To enhance generation quality, the authors introduce two techniques. First, they employ a hierarchical generation strategy across two stages: coarse image generation followed by super-resolution. Second, they tackle the problem of oversmoothing by applying adversarial training during the denoising step. Additionally, they address the challenge of maintaining detailed textures by introducing a normalized classifier-free guidance method, which ameliorates mean and variance shifts that can degrade image quality.

Empirical Results

Extensive experiments across various datasets demonstrate that this approach delivers images with remarkable realism and faithfulness. The proposed method, abbreviated as PITI, significantly outperforms prior techniques and shows potential for few-shot learning tasks. The paper validates these findings with both qualitative visual results and quantitative measures such as the Frechét Inception Distance (FID), where it consistently achieves superior scores compared to existing methods and the base model without pretraining.

Conclusion

The paper concludes that pretraining a diffusion model on a large, diverse dataset provides a powerful foundation for image-to-image translation tasks. This pretrained knowledge leads to high-quality image synthesis, especially in scenarios where images have complex structures and diverse object interactions. The approach proves general enough to adapt to various input modalities without requiring task-specific customization. Notably, the paper acknowledges that despite its strengths, the approach may struggle with aligning the generated images precisely with intricate input details—a challenge left for future research to address.

PDF Markdown