Fader Networks: Manipulating Images by Sliding Attributes (1706.00409v2)

Published 1 Jun 2017 in cs.CV

Abstract: This paper introduces a new encoder-decoder architecture that is trained to reconstruct images by disentangling the salient information of the image and the values of attributes directly in the latent space. As a result, after training, our model can generate different realistic versions of an input image by varying the attribute values. By using continuous attribute values, we can choose how much a specific attribute is perceivable in the generated image. This property could allow for applications where users can modify an image using sliding knobs, like faders on a mixing console, to change the facial expression of a portrait, or to update the color of some objects. Compared to the state-of-the-art which mostly relies on training adversarial networks in pixel space by altering attribute values at train time, our approach results in much simpler training schemes and nicely scales to multiple attributes. We present evidence that our model can significantly change the perceived value of the attributes while preserving the naturalness of images.

Citations (536)

View on Semantic Scholar

Summary

The paper introduces Fader Networks, a novel framework that disentangles image features from attribute values to enable precise image manipulation.
It employs latent space adversarial training to enforce attribute invariance, reducing complexity compared to traditional pixel-space methods.
Experimental results on CelebA and Oxford-102 demonstrate superior image realism and controlled attribute modification without compromising identity.

Fader Networks: Manipulating Images by Sliding Attributes

The paper presents a novel encoder-decoder architecture intended for image manipulation through attribute variation, an area of substantial interest within the domain of conditional generative models and automatic image editing. This approach, titled Fader Networks, utilizes an innovative process of disentangling salient image features from attribute values in the latent space, thereby facilitating controlled manipulation of image attributes while maintaining image realism.

Architecture Overview

The core of Fader Networks lies in an encoder-decoder framework. Given an input image $x$ alongside its attributes $y$ , the encoder transforms $x$ into a latent representation $z$ . Concurrently, the decoder reconstructs the image from $(z, y)$ . During inference, users adjust the attributes, now treated as continuous variables, to modulate their presence in the output image.

To achieve this, the architecture enforces attribute invariance in the latent space, akin to domain-adversarial training paradigms. A classifier is trained to predict attributes based on $z$ while the encoder-decoder strives to confuse the classifier, thereby ensuring $z$ becomes attribute-agnostic. This motivates the decoder to rely on explicit attribute values for reconstruction purposes, enhancing control over the attribute-modulated output.

Training Dynamics and Implementation

The architecture diverges from established methods that generally rely on pixel-space adversarial networks by embedding adversarial training in the latent space. This reduces training complexity and scales efficiently across multiple attributes, as demonstrated by the experiments on the CelebA dataset. The model effectively alters perceived attributes without compromising the naturalness of the images.

For implementation, the encoder uses a series of Convolution-BatchNorm-ReLU layers, while the decoder benefits from symmetric deconvolutional layers augmented with attribute codes. Dropout is strategically applied to the discriminator to stabilize training, and a gradual schedule is used for the adversarial objective's weight.

Experimental Evaluation

The empirical evaluation of Fader Networks was carried out on the CelebA and Oxford-102 datasets. The method showcases high-quality image manipulations, altering attributes like age, gender, and the presence of accessories, all while preserving the identity and natural appearance of individuals. Comparisons with existing models like IcGAN highlight Fader Networks' superior performance, both in terms of reconstruction fidelity and perceptual evaluation by human subjects on attributes like mouth position and ocular adornments.

Implications and Future Directions

Fader Networks introduce substantial improvements in the domain of conditional image generation, particularly in applications requiring subtle yet controlled attribute manipulation. The attribute-invariant latent space establishes a robust framework for scalable and versatile image editing applications, reducing reliance on adversarial training at the pixel level, which often intensifies computational demands and complicates training.

The practical implications extend beyond image domains, suggesting potential adaptability to speech and text applications where differentiability is more challenging. While this paper focuses primarily on visual transformations, the approach could inspire future research into cross-domain generative tasks, aiming to incorporate similar disentanglement and control for various data types.

Ultimately, Fader Networks assert themselves as a promising technique in the repertoire of generative modeling, offering clear paths for scalable and controllable manipulations of complex data structures.

PDF Markdown