Image-to-image translation for cross-domain disentanglement (1805.09730v3)

Published 24 May 2018 in cs.CV

Abstract: Deep image translation methods have recently shown excellent results, outputting high-quality images covering multiple modes of the data distribution. There has also been increased interest in disentangling the internal representations learned by deep methods to further improve their performance and achieve a finer control. In this paper, we bridge these two objectives and introduce the concept of cross-domain disentanglement. We aim to separate the internal representation into three parts. The shared part contains information for both domains. The exclusive parts, on the other hand, contain only factors of variation that are particular to each domain. We achieve this through bidirectional image translation based on Generative Adversarial Networks and cross-domain autoencoders, a novel network component. Our model offers multiple advantages. We can output diverse samples covering multiple modes of the distributions of both domains, perform domain-specific image transfer and interpolation, and cross-domain retrieval without the need of labeled data, only paired images. We compare our model to the state-of-the-art in multi-modal image translation and achieve better results for translation on challenging datasets as well as for cross-domain retrieval on realistic datasets.

Authors (3)

Abel Gonzalez-Garcia (18 papers)
Joost van de Weijer (133 papers)
Yoshua Bengio (601 papers)

Citations (233)

View on Semantic Scholar

Summary

Analysis of Cross-Domain Disentanglement in Image-to-Image Translation

The paper "Image-to-image translation for cross-domain disentanglement" explores the advanced integration of two key objectives in computer vision: the quality of deep image translation and the disentanglement of internal representations. The authors introduce a novel concept they term as "cross-domain disentanglement," where the internal representation in deep networks is partitioned into three distinct parts. Through this method, the shared part of the representation contains information common to both image domains, while the exclusive parts are responsible for domain-specific variations.

The authors employ Generative Adversarial Networks (GANs) and a novel component termed cross-domain autoencoders to achieve this form of disentanglement in image-to-image translation. The methodology is implemented to allow diverse sampling that covers multiple modes of the data distributions in both domains, facilitating tasks such as domain-specific image transfers, interpolation, and cross-domain retrieval without labeled datasets—requiring only paired images.

Key Methodological Contributions

The model consists of bidirectional image translation modules and cross-domain autoencoders. The translation mechanism focuses on enforcing a disentangled structure for the learned representation. Specifically, the representation is split into shared and exclusive components, with the shared elements capturing universal features across both domains, facilitating their use in image translation and cross-domain applications.

Image Translation Modules: The modules implement GANs with an encoder-decoder architecture. Unique to this design is the exclusion of domain-specific representations during translation, with random noise used to account for domain-exclusive parts that might be missing during translation.
Cross-Domain Autoencoders: These components align latent distributions and enforce disentanglement further. A critical aspect of network connectivity here is the use of a Gradient Reversal Layer (GRL) to prevent the exclusive domain features from contaminating the shared domain representation.

Computational and Experimental Validation

The paper provides an extensive experimental validation detailing the robustness of the proposed approach:

Disentanglement on MNIST Variations: The experiments utilize variations of the MNIST dataset to evaluate and demonstrate how the disentangled representation enables diverse image generation while maintaining control over shared and exclusive features. The model successfully generates varied samples consistent with the input digit information when provided with input noise, showcasing effective multi-modal image translation.
Cross-Domain Retrieval: The disentangled representation facilitates cross-domain retrieval by prioritizing shared representations to maximize recall from merged domain image databases. This performance is quantitatively superior to approaches using images' pixel distances.
Many-to-Many Image Translation: Using datasets like 3D car models and chairs, the model demonstrates state-of-the-art results in maintaining and translating domain-exclusive content across diverse orientations, outperforming baseline models such as pix2pix and BicycleGAN in learned perceptual similarity metrics.

Theoretical and Practical Implications

The proposed model's ability to separate and effectively harness shared versus exclusive domain factors introduces a refined control over image translation tasks, with promising results in maintaining cross-domain consistency. Practically, this allows the development of applications in fields such as identity preservation in image translation tasks, semantic attribute transfer, and content-specific image manipulation.

Theoretically, it expands the understanding of how disentangled representations can offer enhanced control and flexibility in the traditionally complex domain of deep learning models that learn multimodal data distributions. It paves the way for future exploration into the scalability of these architectures across more complex or varied datasets, domain-specific applications, and continuous domain adaptation.

Ultimately, this work adds to the rich tapestry of research that seeks to bridge high-quality image translation and domain-specific information separation, offering nuanced insights and potential technological advancements in the AI and computer vision arenas. Future research may explore extending these principles to more complex, real-world datasets and investigating further improvements in architectural efficiency and scalability.

PDF Markdown