Toward Multimodal Image-to-Image Translation (1711.11586v4)

Published 30 Nov 2017 in cs.CV, cs.GR, and stat.ML

Abstract: Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a \emph{distribution} of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.

Authors (7)

Jun-Yan Zhu (80 papers)
Richard Zhang (61 papers)
Deepak Pathak (91 papers)
Trevor Darrell (324 papers)
Oliver Wang (55 papers)
Eli Shechtman (102 papers)
Alexei A. Efros (100 papers)

Citations (1,318)

View on Semantic Scholar

Summary

The paper proposes a BicycleGAN framework that addresses mode collapse to generate diverse and realistic images from a single input.
It combines cVAE-GAN and cLR-GAN objectives to enforce a bijective mapping between the latent code and output, ensuring fidelity.
Experimental results highlight improved LPIPS metrics and higher human realism scores across tasks like edges to photos and maps to satellite imagery.

Toward Multimodal Image-to-Image Translation: An Overview

In the paper "Toward Multimodal Image-to-Image Translation," the authors, Jun-Yan Zhu et al., tackle the problem of generating diverse and realistic images from a single input image within a conditional generative modeling framework. This research focuses on modeling a distribution of possible outputs rather than producing a single deterministic result.

Problem Statement and Objectives

Traditional image-to-image translation techniques have made significant advancements in various applications such as inpainting, colorization, and generating photorealistic images from sketches. However, most existing techniques focus on generating a single output for each input. This is often not reflective of the natural variability present in many image-to-image translation tasks. The authors address this limitation by proposing a method that generates multiple plausible outputs from a single input image, thereby modeling the underlying distribution of possible outcomes. The main objectives of this work are to produce outputs that are perceptually realistic and diverse while remaining faithful to the input image.

Methodology

The paper builds upon the pix2pix framework and proposes a novel solution to the problem of mode collapse, wherein the generator produces a limited variety of outputs. The authors explore multiple methods to enforce a bijective mapping between a latent code and the output, ensuring diverse and realistic results. The primary techniques investigated are:

cVAE-GAN (Conditional Variational Autoencoder GAN): This method introduces variational inference into the conditional GAN framework. The ground truth image is encoded into a latent space to give the generator a "peek" into the desired output, ensuring that the latent space can be stochastically sampled during inference. The latent distribution is regularized using KL-divergence to approximate a standard normal distribution.
cLR-GAN (Conditional Latent Regressor GAN): In contrast to cVAE-GAN, this technique begins by sampling a latent vector independently of the output image. The generator then creates an output that looks realistic, and an encoder tries to recover the latent vector from the generated output. This method is akin to the latent regressor model and seeks to reconstruct the latent code through the generated image.
BicycleGAN: This hybrid model combines cVAE-GAN and cLR-GAN objectives, incorporating cycles in both directions (i.e., from the output to the latent space and vice versa). By jointly optimizing these objectives, BicycleGAN achieves improved performance in terms of generating both diverse and visually appealing results.

Experimental Setup

The authors evaluated their approach on multiple image-to-image translation tasks, including edges to photos, day to night images, and maps to satellite imagery. They used metrics such as LPIPS (Learned Perceptual Image Patch Similarity) for measuring diversity and a human judgment-based realism score for perceptual quality. The BicycleGAN model demonstrated superior performance compared to baseline methods such as pix2pix+noise, cAE-GAN, and individual cVAE-GAN and cLR-GAN approaches.

Results and Implications

The empirical results highlight several key findings:

The BicycleGAN model consistently produced more diverse results compared to baseline methods.
It maintained a higher realism score, as measured by human evaluators, indicating that the generated images were perceptually convincing.
The combination of objectives in BicycleGAN effectively mitigated the mode collapse issue, providing a better approximation of the true conditional distribution.

Future Directions

The implications of this research extend to various applications in computer vision where generating multiple plausible results for a given input is essential. Future work could explore more sophisticated methods of latent space manipulation, enabling controllable and interpretable transformations between the input and output images. Additionally, further investigations could be conducted into different network architectures and training paradigms to enhance the quality and diversity of the generated images even further.

Conclusion

The paper "Toward Multimodal Image-to-Image Translation" presents a significant contribution to the field of image synthesis by addressing the critical issue of generating diverse outputs. Through the innovative BicycleGAN framework, the authors demonstrate a robust method for producing varied and realistic images, thereby setting the stage for future advancements in multimodal image generation tasks.

PDF Markdown

Related Papers

YouTube

Show All Videos