Disentangled Person Image Generation (1712.02621v4)

Published 7 Dec 2017 in cs.CV

Abstract: Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task.

Citations (428)

View on Semantic Scholar

Summary

The paper presents a two-stage method that disentangles person images into distinct components: foreground, background, and pose.
It employs a multi-branch architecture and adversarial learning to effectively reconstruct images and map embedding features.
Experimental results on Market-1501 and DeepFashion demonstrate improved person re-identification and potential for real-time applications.

Disentangled Person Image Generation: A Methodological and Experimental Analysis

The paper "Disentangled Person Image Generation" introduces an innovative approach for generating realistic images of human figures by disentangling various contributing factors of an image. The core idea is to separate an image into foreground, background, and pose elements, allowing for more granular control in the generation process. This approach utilizes a two-stage pipeline, operating across disentangled image reconstruction and embedding feature mapping, with outcomes examined on datasets like Market-1501 and DeepFashion.

Key Methodological Contributions

The framework consists of multiple partially independent modules, neatly addressing the complexity found in migrating various aspects of an image independently:

Disentangled Image Reconstruction: The initial stage uses a multi-branched architecture to disentangle and then encode the image into three separate factors. The foreground involves regional features of key body parts, while the background is handled with a dedicated encoder. The pose is captured using heatmaps of keypoints, and each of these factors can then be reconstructed back to an image.
Embedding Feature Mapping: This stage employs adversarial learning to map Gaussian noise onto learned embedding feature spaces, facilitating novel image synthesis. This adversarial training is innovative as it seeks to match real and generated embedding feature distributions, thereby ensuring that synthesized images remain true to real-life distribution attributes.

Experimental Findings and Implications

The authors present empirical validations on Market-1501 and DeepFashion datasets, demonstrating the effectiveness of the method. Noteworthy outcomes include:

The model not only generates new images with altered foregrounds, backgrounds, or poses but also displays the capability of interpolating intermediate forms, encouraging potential applications in animation and predictive modeling.
The approach is particularly resourceful for person re-identification tasks, leveraging generated image pairs to expand datasets artificially—showing a significant rank-1 and mAP increase for models trained with the generated data compared to traditionarily labeled datasets.

Future Directions

The paper opens pathways for further exploration:

Enhanced Detail and Diversity: While current disentangled components focus on macro elements like foreground and background, future research could delve into finer components like texture detail and complex clothing patterns.
Integration with Larger Scale Models: The current framework could see additional benefit from integration with larger models, such as transformers or larger convolutional neural networks, which could potentially handle more sophisticated disentanglement.
Real-Time Generation: The efficiency of the proposed model suggests possible real-time applications in virtual environments and augmented reality setups, areas that would benefit greatly from quick and editable image generation processes.

In conclusion, this paper's methodological innovation lays a solid groundwork for disentangled person image generation, allowing for enhanced control over image creation that could be pivotal across numerous AI-driven applications.

PDF Markdown