Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing (2104.14754v2)

Published 30 Apr 2021 in cs.CV and cs.LG

Abstract: Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. Although manipulating the latent vectors controls the synthesized outputs, editing real images with GANs suffers from i) time-consuming optimization for projecting real images to the latent vectors, ii) or inaccurate embedding through an encoder. We propose StyleMapGAN: the intermediate latent space has spatial dimensions, and a spatially variant modulation replaces AdaIN. It makes the embedding through an encoder more accurate than existing optimization-based methods while maintaining the properties of GANs. Experimental results demonstrate that our method significantly outperforms state-of-the-art models in various image manipulation tasks such as local editing and image interpolation. Last but not least, conventional editing methods on GANs are still valid on our StyleMapGAN. Source code is available at https://github.com/naver-ai/StyleMapGAN.

Authors (5)

Hyunsu Kim (27 papers)
Yunjey Choi (15 papers)
Junho Kim (57 papers)
Sungjoo Yoo (25 papers)
Youngjung Uh (32 papers)

Citations (140)

View on Semantic Scholar

Summary

Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing

The paper introduces StyleMapGAN, a novel approach leveraging spatial dimensions in GAN latent spaces for real-time image editing. Unlike traditional GANs that struggle with real-time image projection due to optimization constraints or inaccurate embeddings, this work proposes an innovative style representation called "stylemap." This spatially structured stylemap allows precise, fast, and spatially aware image manipulations.

Methodology

The authors replace the vector-based intermediate latent representation of traditional GANs with a tensor, incorporating explicit spatial dimensions. This modification facilitates the encoding of local semantics within the latent space, enhancing the fidelity of image inverse mapping through an encoder. The stylemap undergoes resizing via convolutional layers to match the synthesis network’s spatial resolutions, permitting fine adjustments to the style. A spatially varying modulation generates affine parameters, which modulate the feature maps for image synthesis.

Training and Loss Functions

The training scheme leverages multiple losses, including adversarial, domain-guided, and perceptual losses, ensuring the generated images remain realistic and semantically consistent with real-world images. Joint training of the encoder and generator is emphasized, which contrasts with sequential training, leading to superior model performance.

Experimental Results

Resolution Impact: Increasing stylemap resolution improves reconstruction accuracy. An 8×8 resolution emerges as optimal for balancing seamless blending and identity preservation in edited images, as higher resolutions cause issues with detecting edited regions.

Comparison with Baselines: StyleMapGAN achieves superior real-time reconstruction accuracy when compared to competitors. MSE and LPIPS metrics confirm the high fidelity of projections, while low lerp scores indicate robust interpolation capabilities. Runtime efficiency vastly surpasses optimization-based methods, operating over 100 times faster.

Local Editing: The stylemap's spatial dimensions facilitate unaligned image transplantations. The proposed method consistently outperforms others in terms of detectability and quality of locally edited images, with quantifiable metrics like AP, MSE\textsubscript{src}, and MSE\textsubscript{ref} supporting these findings.

Implications

The explicit spatial dimensions in the latent space significantly enhance GAN-based image manipulation, marking a step forward in real-time editing tasks. This methodological pivot aligns GAN's capabilities with higher dimensional image semantics, offering practical solutions for applications requiring interactive and localized image editing.

Future Directions

The paper suggests applying the spatial latent representation to conditional GANs or VAEs, which could broaden the applicability of StyleMapGAN in scenarios demanding more flexible semantic adjustments. Further exploration could address current limitations, such as handling diverse poses and target semantic sizes.

In conclusion, StyleMapGAN introduces a meaningful advancement in GANs for real-time image editing, providing a viable pathway for improved semantic control and operational efficiency. The integration of spatial awareness in latent spaces could set the stage for future developments in creative and practical digital content generation.

PDF Markdown