Multimodal Unsupervised Image-to-Image Translation (1804.04732v2)

Published 12 Apr 2018 in cs.CV, cs.LG, and stat.ML

Abstract: Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT

Citations (2,393)

View on Semantic Scholar

Summary

The paper introduces a novel framework that separates image content and style, enabling multimodal translation outputs.
It employs GANs with style-augmented cycle consistency to ensure both high-quality and diverse image generation.
Experimental results validate that MUNIT outperforms previous methods, achieving superior realism and diversity in translations.

Multimodal Unsupervised Image-to-Image Translation

The paper presents a novel framework for Multimodal Unsupervised Image-to-Image Translation (MUNIT). It addresses the limitation in existing methods that often generate deterministic one-to-one mappings, thus failing to capture the multimodal nature of the conditional distributions required for effective cross-domain image translations. The researchers propose a sophisticated model that decomposes images into domain-invariant content codes and domain-specific style codes, allowing for the generation of diverse outputs through the recombination of content and random style codes from the target domain.

Key Contributions and Methodology

The core innovation of this work lies in its ability to effectively separate content and style in images. The framework consists of two main components for each domain: an encoder and a decoder. The encoder splits an image into a content code and a style code, while the decoder generates images by recombining a content code with a style code. Specifically:

Content Encoder: Contains several strided convolutional layers followed by residual blocks.
Style Encoder: Uses strided convolutional layers, global average pooling, and fully connected layers.
Decoder: Utilizes Adaptive Instance Normalization (AdaIN) to modulate the content code with dynamically generated affine parameters from the style code, followed by upsampling and convolutional layers to generate the image.

To ensure the translation outputs are indistinguishable from real images in the target domain, the framework employs Generative Adversarial Networks (GANs) alongside a bidirectional reconstruction loss which includes image reconstruction and latent reconstruction (for both content and style codes). This effectively trains the encoders and decoders to be inverses of each other while preserving the multimodal nature of image translations.

Theoretical Insights

Several theoretical properties of the MUNIT framework are established:

Latent Distribution Matching: The framework ensures that the encoded style distributions match predefined Gaussian priors and that the content space becomes domain-invariant.
Joint Distribution Matching: At optimality, the joint distributions between translated images and original images are consistent across domains.
Style-Augmented Cycle Consistency: A weaker form of cycle consistency, termed style-augmented cycle consistency, is maintained, meaning that an image translated to the target domain and back (using the original style) should approximate the original image.

These propositions are mathematically substantiated, providing robust justifications for the model's efficacy.

Experimental Results

The experimental validation encompasses various datasets, including:

Edges to Shoes/Handbags: Demonstrates the generation of varied outputs capturing different styles for a given edge map.
Animal Image Translation: Translations between house cats, big cats, and dogs, showcasing the ability to generate different animal species from an input image.
Street Scene Images: Translations between synthetic and real-world street scenes and between summer and winter landscapes.
High-Resolution Yosemite Dataset: High-definition translations between summer and winter scenes in Yosemite.

Quantitative evaluations utilizing metrics such as human preference scores, LPIPS (Learned Perceptual Image Patch Similarity) distances, Conditional Inception Scores (CIS), and Inception Scores (IS) confirm the superior performance of MUNIT compared to existing unsupervised and supervised methods. Notably, the diversity and quality of outputs generated by MUNIT are equivalent to or better than those generated by the supervised BicycleGAN.

Practical and Theoretical Implications

The MUNIT framework notably enhances the capability of unsupervised image-to-image translation models by incorporating multimodality, thus enabling the generation of diverse and realistic images. This holds significant implications for various applications in computer vision, from creative industries to data augmentation for machine learning models. The separation of content and style also opens avenues for user-guided image translation, where users can control the style of the output via example images.

Future developments in AI could leverage this foundational work to create even more sophisticated models, potentially extending to applications in video-to-video translation and beyond. The theoretical insights provided also contribute to the broader understanding of GANs and their application in generative tasks, promoting further innovations in this rapidly evolving field.

In summary, this paper makes a substantial contribution to the domain of unsupervised image-to-image translation by addressing a critical limitation and providing a robust, theoretically grounded framework capable of generating high-quality, diverse outputs.

PDF Markdown

Related Papers

GitHub

GitHub - NVlabs/MUNIT: Multimodal Unsupervised Image-to-Image Translation (2,682 stars)

YouTube

Show All Videos