Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MaskGAN: Towards Diverse and Interactive Facial Image Manipulation (1907.11922v2)

Published 27 Jul 2019 in cs.CV, cs.GR, and cs.LG

Abstract: Facial image manipulation has achieved great progress in recent years. However, previous methods either operate on a predefined set of face attributes or leave users little freedom to interactively manipulate images. To overcome these drawbacks, we propose a novel framework termed MaskGAN, enabling diverse and interactive face manipulation. Our key insight is that semantic masks serve as a suitable intermediate representation for flexible face manipulation with fidelity preservation. MaskGAN has two main components: 1) Dense Mapping Network (DMN) and 2) Editing Behavior Simulated Training (EBST). Specifically, DMN learns style mapping between a free-form user modified mask and a target image, enabling diverse generation results. EBST models the user editing behavior on the source mask, making the overall framework more robust to various manipulated inputs. Specifically, it introduces dual-editing consistency as the auxiliary supervision signal. To facilitate extensive studies, we construct a large-scale high-resolution face dataset with fine-grained mask annotations named CelebAMask-HQ. MaskGAN is comprehensively evaluated on two challenging tasks: attribute transfer and style copy, demonstrating superior performance over other state-of-the-art methods. The code, models, and dataset are available at https://github.com/switchablenorms/CelebAMask-HQ.

Citations (980)

Summary

  • The paper presents MaskGAN's novel framework that integrates a Dense Mapping Network and an Editing Behavior Simulated Training module to enable diverse and interactive facial manipulation.
  • It leverages a Spatial-Aware Style Encoder with AdaIN layers and dual-editing consistency to maintain high image fidelity even with user-modified masks.
  • Extensive experiments on the CelebAMask-HQ dataset demonstrate MaskGAN's superior performance over state-of-the-art methods in attribute transfer and style copy tasks.

MaskGAN: Towards Diverse and Interactive Facial Image Manipulation

The paper "MaskGAN: Towards Diverse and Interactive Facial Image Manipulation" by Cheng-Han Lee et al. presents a novel framework for manipulating facial images using semantic masks as intermediate representations. This work aims to address the limitations of prior methods that either focus on predefined attributes or offer limited user interaction. MaskGAN, composed of a Dense Mapping Network (DMN) and an Editing Behavior Simulated Training (EBST) module, provides an interactive and diverse means of facial manipulation while preserving image fidelity.

Dense Mapping Network

The Dense Mapping Network (DMN) utilizes a Spatial-Aware Style Encoder and an Image Generation Backbone. The Spatial-Aware Style Encoder captures both the style information from a target image and the corresponding spatial information from its semantic mask, employing Spatial Feature Transform (SFT) layers to modulate features effectively. Adaptive instance normalization (AdaIN) layers in the Image Generation Backbone further ensure the generation of high-quality, context-aware images. Notably, DMN is capable of synthesizing faces by mapping the users' modified masks to the style of a target image, thus enabling realistic and coherent facial transformations.

Editing Behavior Simulated Training

The EBST module simulates realistic user editing behaviors using dual-editing consistency as an auxiliary supervision signal. This method leverages a pre-trained MaskVAE to model the manifold of geometrical structures and incorporates an alpha blending sub-network to maintain appearance consistency. As a result, EBST significantly enhances the DMN’s robustness to variations in user-modified masks during inference.

CelebAMask-HQ Dataset

The research introduces CelebAMask-HQ, a high-resolution face dataset with detailed mask annotations. Consisting of over 30,000 images at a 512x512 resolution, this dataset provides semantic masks that delineate 19 facial components. CelebAMask-HQ is pivotal for extensive studies and evaluations of facial manipulation techniques.

Evaluation and Results

MaskGAN is evaluated on two primary tasks: attribute transfer and style copy. The results demonstrate that MaskGAN surpasses existing state-of-the-art methods such as Pix2PixHD, SPADE, and StarGAN in several metrics. Specifically, MaskGAN achieves higher classification accuracy for transferred attributes, better segmentation accuracy, and competitive Fréchet Inception Distance (FID) scores. The inclusion of EBST further improves these metrics, especially in maintaining manipulation consistency and preserving fine details.

Implications and Future Work

The theoretical implications of MaskGAN include the potential for real-time, user-interactive facial manipulation applications in fields such as virtual reality, digital entertainment, and forensic reconstruction. The practical deployment can benefit from the interactive capabilities and fidelity preservation of MaskGAN, making it a valuable tool for commercial image editing software.

Future work could explore integrating MaskGAN with image completion techniques to enhance detail preservation, particularly in non-edited regions. Additionally, leveraging advancements in 3D facial modeling and incorporating temporal consistency in video sequences could broaden the application spectrum of this framework.

Conclusion

MaskGAN represents a significant advancement in the paradigm of facial image manipulation, providing both diverse and interactive editing capabilities while maintaining high fidelity. The synergistic approach of employing semantic masks and robust training techniques underscores its potential for practical applications and sets a new benchmark for future research in this domain. The CelebAMask-HQ dataset notably enriches the resources available for such studies, paving the way for further developments in realistic and user-friendly facial manipulation technologies.