Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StarGAN v2: Diverse Image Synthesis for Multiple Domains (1912.01865v2)

Published 4 Dec 2019 in cs.CV and cs.LG

Abstract: A good image-to-image translation model should learn a mapping between different visual domains while satisfying the following properties: 1) diversity of generated images and 2) scalability over multiple domains. Existing methods address either of the issues, having limited diversity or multiple models for all domains. We propose StarGAN v2, a single framework that tackles both and shows significantly improved results over the baselines. Experiments on CelebA-HQ and a new animal faces dataset (AFHQ) validate our superiority in terms of visual quality, diversity, and scalability. To better assess image-to-image translation models, we release AFHQ, high-quality animal faces with large inter- and intra-domain differences. The code, pretrained models, and dataset can be found at https://github.com/clovaai/stargan-v2.

Citations (1,607)

Summary

  • The paper introduces a dual-module framework that uses a mapping network and style encoder to generate domain-specific style codes for diverse image outputs.
  • It integrates AdaIN in the generator and a multi-task discriminator to enhance image quality and fidelity, as validated by FID and LPIPS metrics.
  • Comprehensive ablation studies and comparisons with methods like MUNIT and DRIT demonstrate the model's scalability and practical impact in digital artistry and facial recognition.

StarGAN v2: Diverse Image Synthesis for Multiple Domains

The paper presents StarGAN v2, a sophisticated approach to address the dual challenges in image-to-image translation: generating diverse images within a single domain and supporting multiple target domains. The authors propose an innovative framework that integrates several advanced components to enhance scalability and diversity in synthesized images across multiple domains.

Methodology

The core innovation in StarGAN v2 lies in two newly introduced modules: the mapping network and the style encoder. These modules generate and manage domain-specific style codes that guide the image generation process. The framework is composed of four main components: Generator, Mapping Network, Style Encoder, and Discriminator.

  1. Generator: Translates an input image into an output image that reflects a given style code. The generator employs adaptive instance normalization (AdaIN) to blend style codes into the generated images.
  2. Mapping Network: Generates style codes from a latent code for each domain. It consists of an MLP with multiple output branches, each corresponding to a particular domain, facilitating the efficient learning of domain-specific style representations.
  3. Style Encoder: Extracts style codes from reference images. Similar to the mapping network, it has multiple branches corresponding to different domains and assists the generator in synthesizing images that incorporate the style from reference images.
  4. Discriminator: A multi-task discriminator with several output branches, each classifying images into real or fake for a specific domain.

Training Objectives

The training of StarGAN v2 focuses on several key objectives:

  • Adversarial Loss (Eq. 2): Ensures that the generated images are indistinguishable from real images within the target domain.
  • Style Reconstruction Loss (Eq. 3): Encourages the generator to utilize the style code during image generation, fostering diverse outputs for each domain.
  • Diversity Regularization: Explicitly fosters the generation of diverse images by penalizing the network for producing similar outputs given different style codes.
  • Cycle Consistency Loss: Ensures that the generated images preserve domain-invariant characteristics (e.g., pose) of the input images.

Experimental Setup and Results

Extensive experiments were conducted on the CelebA-HQ and the newly introduced AFHQ datasets. The paper evaluates visual quality and diversity through metrics such as Frechét Inception Distance (FID) and Learned Perceptual Image Patch Similarity (LPIPS).

Component Analysis

Each component's contribution was quantified through ablation studies. Replacing the ACGAN discriminator with a multi-task discriminator and incorporating AdaIN notably improved the generator's ability to transform global structures. The adoption of domain-specific style codes via the mapping network and style encoder enhanced diversity significantly. The finalized StarGAN v2 configuration outperformed previous models in terms of FID and LPIPS.

Comparative Evaluation

StarGAN v2 achieved superior outcomes over several leading methods (e.g., MUNIT, DRIT, MSGAN) in both latent-guided and reference-guided synthesis scenarios. The method excelled in rendering high-quality images while maintaining diverse styles, evidenced by the lowest FID scores and highest LPIPS values across both datasets. Human evaluation on Amazon Mechanical Turk supported these quantitative findings, showing a strong preference for StarGAN v2 in terms of visual quality and accurate style reflection.

Theoretical and Practical Implications

StarGAN v2 makes significant strides in the image-to-image translation domain by harmonizing multimodal output with scalability across various domains. The disentangling of style and content through domain-specific style codes signifies a substantial enhancement in generating diverse and coherent images. Practically, the framework's ability to manipulate images based on style codes or reference images has potential applications in fields ranging from digital artistry to advanced facial recognition systems.

Future Directions

Potential future research avenues include refining the scalability and diversity further, exploring additional domains and higher resolutions, and investigating more complex and varied datasets. Additionally, integrating human feedback mechanisms could adjust and optimize style generation in real-time against dynamically evolving criteria.

In conclusion, StarGAN v2 represents a substantial advancement in the field of image synthesis, offering effective means to tackle the intricate problem of high-quality, diverse image generation across multiple domains.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com