Unified Generative Adversarial Networks for Controllable Image-to-Image Translation (1912.06112v2)

Published 12 Dec 2019 in cs.CV, cs.LG, and eess.IV

Abstract: We propose a unified Generative Adversarial Network (GAN) for controllable image-to-image translation, i.e., transferring an image from a source to a target domain guided by controllable structures. In addition to conditioning on a reference image, we show how the model can generate images conditioned on controllable structures, e.g., class labels, object keypoints, human skeletons, and scene semantic maps. The proposed model consists of a single generator and a discriminator taking a conditional image and the target controllable structure as input. In this way, the conditional image can provide appearance information and the controllable structure can provide the structure information for generating the target result. Moreover, our model learns the image-to-image mapping through three novel losses, i.e., color loss, controllable structure guided cycle-consistency loss, and controllable structure guided self-content preserving loss. Also, we present the Fr\'echet ResNet Distance (FRD) to evaluate the quality of the generated images. Experiments on two challenging image translation tasks, i.e., hand gesture-to-gesture translation and cross-view image translation, show that our model generates convincing results, and significantly outperforms other state-of-the-art methods on both tasks. Meanwhile, the proposed framework is a unified solution, thus it can be applied to solving other controllable structure guided image translation tasks such as landmark guided facial expression translation and keypoint guided person image generation. To the best of our knowledge, we are the first to make one GAN framework work on all such controllable structure guided image translation tasks. Code is available at https://github.com/Ha0Tang/GestureGAN.

Authors (3)

Hao Tang (379 papers)
Hong Liu (396 papers)
Nicu Sebe (271 papers)

Citations (39)

View on Semantic Scholar

Summary

The paper introduces a unified GAN model that integrates controllable structural guidance for flexible image-to-image translation across multiple tasks.
It employs innovative loss functions—color, cycle-consistency, and self-content preserving losses—to enhance image quality and reduce artifacts.
The work further presents the FRechet ResNet Distance (FRD) metric, aligning with human visual perception for improved quantitative evaluation of generated images.

An Overview of "Unified Generative Adversarial Networks for Controllable Image-to-Image Translation"

The paper by Tang, Liu, and Sebe introduces a unified framework for controllable image-to-image translation using Generative Adversarial Networks (GANs). This framework addresses several challenges inherent in existing models, such as scalability, efficiency, and adaptability to various domains and tasks. At its core, the proposed method integrates conditional images with controllable structures, offering a flexible and potent approach to image synthesis under varied structural constraints.

Key Contributions

Unified GAN Model: The authors present a novel adversarial model that efficiently handles multiple tasks within a single architecture, differing from traditional methods that typically require task-specific tuning and multiple networks. This unified framework employs a generator and a discriminator that utilize both appearance information from the input images and structure guidance from controllable structures, such as object keypoints, human skeletons, and semantic maps.
Innovative Loss Functions: The paper introduces three novel losses:
- Color Loss: Designed to combat the 'channel pollution' issue, it computes losses on each color channel independently, ensuring cleaner and artifact-free outputs.
- Controllable Structure Guided Cycle-Consistency Loss: This enforces a bidirectional mapping between domains, preserving structural integrity across translations.
- Controllable Structure Guided Self-Content Preserving Loss: This maintains the overall content fidelity between the input and generated images, focusing on preserving color and layout.
FRechet ResNet Distance (FRD): Proposed as a novel metric for evaluating the quality of generated images, FRD aligns closely with human visual assessment by considering semantic distances at the feature level.

Empirical Results

The efficacy of the proposed GAN model is demonstrated on challenging tasks, including hand gesture-to-gesture translation and cross-view image translation. The experiments show significant improvements over state-of-the-art techniques, measured by various metrics including PSNR, IS, FID, and the newly introduced FRD. The model proves capable of generating high-quality images with complex transformations, handling arbitrary poses, sizes, and configurations.

Practical and Theoretical Implications

The practical value of this research lies in its capability to generalize across various applications using a single model framework. This versatility is particularly advantageous for tasks like facial expression synthesis or person image generation, where variations are immense, and structuring such transformations through multiple bespoke models could be infeasible. Theoretically, the proposed framework opens discussions on the unification of adversarial training and conditional structures into a cohesive architecture, suggesting pathways for further integration of contextual and structural data in deep learning models.

Future Directions

Future research may explore enhancing the scalability of the proposed model further by integrating additional data types and structural guidance beyond the current scope. Another potential direction includes refining the FRD metric to capture more nuanced quality assessments attuned to diverse application-specific requirements. Additionally, extending the model's capacity to operate effectively with reduced data dependencies could yield innovative contributions to the field of unsupervised and semi-supervised learning.

In conclusion, this work offers a comprehensive approach to enhancing the flexibility and robustness of GANs for controllable image translation tasks, contributing significantly to the fields of computer vision and deep learning.

PDF Markdown

Related Papers

GitHub

GitHub - Ha0Tang/GestureGAN: [ACM MM 2018 Oral] GestureGAN for Hand Gesture-to-Gesture Translation in the Wild (174 stars)