Improving Shape Deformation in Unsupervised Image-to-Image Translation (1808.04325v2)

Published 13 Aug 2018 in cs.CV

Abstract: Unsupervised image-to-image translation techniques are able to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to train a more context-aware generator. This is coupled with a multi-scale perceptual loss that is better able to represent error in the underlying shape of objects. We demonstrate that this design is more capable of representing shape deformation in a challenging toy dataset, plus in complex mappings with significant dataset variation between humans, dolls, and anime faces, and between cats and dogs.

Citations (75)

View on Semantic Scholar

Summary

The paper introduces GANimorph, which enhances shape deformation in tasks by leveraging dilated discriminators and a multi-scale perceptual loss.
It employs an augmented generator with residual blocks and skip connections to capture both detailed textures and holistic structural changes.
Experimental results on tasks like cat-to-dog and face-to-anime conversions demonstrate its superior performance over traditional methods.

Improving Shape Deformation in Unsupervised Image-to-Image Translation

The paper "Improving Shape Deformation in Unsupervised Image-to-Image Translation" addresses a prevalent limitation in existing unsupervised image-to-image translation models, such as DiscoGAN and CycleGAN. These models are adept at texture transfer between domains but falter when it comes to tasks requiring substantial shape deformation. This work introduces an enhanced framework, termed GANimorph, which focuses on overcoming these shortcomings through architectural and training advancements.

Problem Outline and Motivation

Unsupervised image-to-image translation provides a framework for transferring features between image domains without explicit pairings or labels. While previous works like DiscoGAN and CycleGAN have demonstrated success in transferring local textures, they struggle with tasks that necessitate global shape transformations, such as converting an image of a cat into a dog. This can be attributed to the limited spatial awareness in these models' discriminators, which operate primarily on local patches of images, leading to restricted generative capabilities in context-dependent transformations.

Proposed Method: GANimorph

GANimorph introduces several key innovations to address these challenges:

Dilated Discriminator Architecture: By employing dilated convolutions, the framework reframes the discriminator's task from simple patch-level determination of image authenticity to a more global task akin to semantic segmentation. This allows each discriminative decision to incorporate information from across the entire image, thereby providing the generator with richer, more contextually-informed feedback.
Generator with Enhanced Capacity: The generator's architecture is augmented by using residual blocks at multiple layers and incorporating skip connections. This design choice boosts the network’s ability to manage both high and low-frequency features crucial for capturing detailed spatial variations across different scales.
Multi-scale Perceptual Loss: To better navigate the trade-offs in cyclic reconstruction—where preserving too much information may preclude necessary shape changes—the authors incorporate a multi-scale structural similarity (MS-SSIM) loss. This perceptual loss places emphasis on area-based rather than pixel-based similarities, facilitating the proper maintenance of global structural changes.
Improved Training Stability: The paper also introduces a feature matching loss to promote stability and quality in both the generator and discriminator, alongside a novel method of scheduled loss normalization (SLN) that dynamically adjusts loss function weights during training.

Results

The effectiveness of GANimorph is demonstrated through various experiments involving both synthetic and real datasets. Quantitatively, the model outperforms existing techniques in tasks involving non-trivial shape deformation. For example, a novel toy dataset where the translation involved 2D dot and polygon deformations revealed GANimorph's superior capability in processing spatial information compared to DiscoGAN and CycleGAN.

On real-world image-to-image translation tasks, such as transforming human faces to anime or doll faces, and translating between cats and dogs, GANimorph outshines baseline models. The framework achieves superior visual coherence in shape transformations while preserving pertinent details like color and pose.

Implications and Future Directions

This work opens several avenues for future inquiry and application. Practically, it can dramatically enhance applications in virtual character design, automated art, and even contribute to augmented reality transformations. On a theoretical level, the introduction of dilated discriminators and multi-scale perception suggests potential improvements across other domains of GAN architecture, such as in high-resolution image synthesis or context-aware video generation.

Future exploration could extend these concepts to multi-domain translation, hybridizing with techniques like StarGAN, or further optimizing trade-offs in perceptual detail versus shape alteration. Moreover, evaluating the scalability and adaptability of GANimorph on richer datasets or in real-time applications could provide deeper insights into its robustness and generalizability.

In conclusion, the paper presents a structured advance in unsupervised image-to-image translation, enhancing the role of discriminative and generative operations in capturing complex shape deformations, a significant step forward over traditional models focused predominantly on texture detail.

PDF Markdown

Related Papers

YouTube

Show All Videos