DRIT++: Diverse Image-to-Image Translation via Disentangled Representations (1905.01270v2)

Published 2 May 2019 in cs.CV

Abstract: Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for this task: 1) lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for generating diverse outputs without paired training images. To synthesize diverse outputs, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and attribute vectors sampled from the attribute space to synthesize diverse outputs at test time. To handle unpaired training data, we introduce a cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative evaluations, we measure realism with user study and Fr\'{e}chet inception distance, and measure diversity with the perceptual distance metric, Jensen-Shannon divergence, and number of statistically-different bins.

Citations (777)

View on Semantic Scholar

Summary

The paper introduces a novel I2I framework that disentangles content and attribute spaces for unpaired image translation.
It employs cross-cycle consistency along with adversarial and mode-seeking losses to enhance output diversity and realism.
Experimental results on multiple datasets demonstrate superior FID and LPIPS scores compared to state-of-the-art methods.

DRIT++: Diverse Image-to-Image Translation via Disentangled Representations

Introduction

The paper "DRIT++: Diverse Image-to-Image Translation via Disentangled Representations" proposes an advanced framework for image-to-image (I2I) translation. Traditional I2I methods face two prominent challenges: the lack of aligned training pairs between visual domains and the inherent multimodality wherein a single input can map to multiple plausible outputs. To address these issues, this work leverages disentangled representations, dividing input images into a domain-invariant content space and domain-specific attribute space. This framework allows for unpaired image translation with diverse outputs, distinguishing itself from previous models that require paired datasets or fail to produce multimodal outputs effectively.

Methodology

The proposed method employs several key components:

Disentangled Representations: Images are embedded into a content space (\mathcal{C}) capturing domain-invariant features and an attribute space (\mathcal{A}) encapsulating domain-specific characteristics.
Content Adversarial Loss: By applying a content adversarial loss, the model ensures that content encoders from different domains map inputs to a shared, consistent content space.
Cross-Cycle Consistency Loss: To handle unpaired datasets, the cross-cycle consistency loss is introduced. This loss enforces that after two stages of translation (crossing domains and returning), the reconstructed images should closely match the original inputs.
Additional Loss Functions: The model integrates domain adversarial loss, self-reconstruction loss, latent regression loss, and mode-seeking regularization to enhance training stability and output diversity.

Experimental Results

The model was evaluated extensively on several datasets, including the Yosemite (summer and winter scenes), pets (cats and dogs), and various artwork datasets. The evaluation metrics include:

Fréchet Inception Distance (FID): Measures the realism of generated images.
Learned Perceptual Image Patch Similarity (LPIPS): Assesses the diversity among generated images.
Jensen-Shannon Divergence (JSD) and Number of Statistically-Different Bins (NDB): Quantify the similarity between the distributions of generated and real images.

Qualitative Evaluation

The qualitative results highlight the capacity of the model to generate diverse and realistic images across various translation tasks. The introduction of mode-seeking regularization significantly enhances the diversity of generated images while maintaining their visual quality. The attribute transfer experiments demonstrate that the disentangled content and attribute representations allow for both inter-domain and intra-domain translations, offering flexibility and robustness.

Quantitative Evaluation

The quantitative evaluations reveal that DRIT++ surpasses previous methods in both realism and diversity metrics. For instance, on the Yosemite and pets datasets, DRIT++ achieved superior FID and LPIPS scores compared to other state-of-the-art I2I models. A user paper also corroborated the enhanced realism of the images generated by DRIT++.

Practical Implications and Future Work

The successful implementation of DRIT++ has significant implications for various applications in computer vision and graphics, including but not limited to artistic style transfer, domain adaptation, and photorealistic image synthesis. The disentangled representation framework shows promise in improving the quality and diversity of generated images without relying on paired training data, thus broadening the applicability of I2I translation.

Future developments could focus on optimizing the model for higher resolution images, addressing memory limitations in training, and further exploring the potential of multi-domain translations. Additionally, fine-tuning the model to handle vastly different domain characteristics more effectively can be another promising research direction.

Conclusion

"DRIT++: Diverse Image-to-Image Translation via Disentangled Representations" advances the field of I2I translation by introducing a robust framework that handles unpaired training data and produces diverse outputs. Its methodological innovations, supported by qualitative and quantitative results, confirm its efficacy and potential for various practical applications. This work lays a strong foundation for further exploration and enhancement in the domain of diverse image-to-image translation.