Diverse Image-to-Image Translation via Disentangled Representations (1808.00948v1)

Published 2 Aug 2018 in cs.CV

Abstract: Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.

Authors (5)

Hsin-Ying Lee (60 papers)
Hung-Yu Tseng (31 papers)
Jia-Bin Huang (106 papers)
Maneesh Kumar Singh (5 papers)
Ming-Hsuan Yang (377 papers)

Citations (870)

View on Semantic Scholar

Summary

Diverse Image-to-Image Translation via Disentangled Representations

The paper "Diverse Image-to-Image Translation via Disentangled Representations" by Hsin-Ying Lee et al. addresses two primary challenges in the field of image-to-image (I2I) translation. These challenges are the lack of aligned training pairs and the generation of multiple possible outputs from a single input image. To tackle these challenges, the paper proposes a method for producing diverse outputs without requiring paired training images. This is achieved through a technique involving disentangled representations, which separates the latent space into a domain-invariant content space and a domain-specific attribute space.

Methodology

The approach detailed in the paper involves embedding images into two distinct spaces:

Domain-Invariant Content Space: Captures the shared information across visual domains.
Domain-Specific Attribute Space: Models variations within each domain.

The generator in the proposed model leverages an encoded content feature from an input image and attribute vectors sampled from the attribute space to produce diverse outputs at the test time. The ability to generate diverse images is enforced through the introduction of a novel cross-cycle consistency loss that operates based on these disentangled representations.

In conjunction with the cross-cycle consistency loss, the method also employs a content discriminator to ensure that the content features are unbiased toward any specific domain. This helps maintain the domain-invariant nature of the content space and thus effectively aids in disentangling content from the domain-specific attributes.

Experimental Results

The authors evaluate the proposed approach through extensive qualitative and quantitative experiments. They validate the model's ability to generate diverse and realistic images across different I2I tasks showing translations such as:

Converting images from a summer to a winter scene
Transforming photographs into artistic styles like van Gogh and Monet
Enhancing photographs to higher resolutions

The paper also presents comparisons with existing methods such as CycleGAN, UNIT, and BicycleGAN. The proposed methodology demonstrates superior performance in terms of diversity and realism of the generated images. Specifically, the introduced content discriminator was shown to significantly enhance the quality of disentangled representations, as evidenced by comparative metrics.

Practical Implications and Future Developments

The practical implications of this research are substantial. By enabling diverse I2I translations without the necessity for paired training data, the method finds potential applications in various domains:

Artistic Style Transfer: Assisting artists by automatically transforming digital photos into various artistic styles.
Seasonal Adjustments: Facilitating applications in digital content creation and gaming by altering scenes to reflect different seasons or times of day.
Data Augmentation: Enhancing domain adaptation tasks by generating diverse versions of existing datasets.

Further developments could extend this methodology to more complex multi-domain translations and incorporate additional factors such as temporal consistency for video I2I translations. The exploration of alternative architectures for the content and attribute encoders may also yield improvements in speed and performance.

Conclusion

In essence, the paper by Hsin-Ying Lee et al. successfully introduces a novel framework for achieving diverse image-to-image translation without the need for paired training datasets. By disentangling the content and attribute spaces and introducing a cross-cycle consistency loss, the method presents a significant step forward in generating diverse and realistic images. The represented numerical evaluations and qualitative visualizations underscore the model’s robust performance, paving the way for further advancements in unsupervised image translation and domain adaptation.

This research opens new avenues for generating diverse visual content, which could transform fields ranging from digital art to practical applications in synthetic training data generation.

PDF Markdown