Pose Guided Person Image Generation (1705.09368v6)

Published 25 May 2017 in cs.CV

Abstract: This paper proposes the novel Pose Guided Person Generation Network (PG$^2$) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose. Our generation framework PG$^2$ utilizes the pose information explicitly and consists of two key stages: pose integration and image refinement. In the first stage the condition image and the target pose are fed into a U-Net-like network to generate an initial but coarse image of the person with the target pose. The second stage then refines the initial and blurry result by training a U-Net-like generator in an adversarial way. Extensive experimental results on both 128$\times$64 re-identification images and 256$\times$256 fashion photos show that our model generates high-quality person images with convincing details.

Citations (790)

View on Semantic Scholar

Summary

The paper introduces PG², a novel framework that generates person images from a reference image and target pose via a two-stage process.
It employs a U-Net-like network with pose heatmaps and a pose mask loss for effective pose integration and initial image synthesis.
The second stage applies a DCGAN variant to refine details, yielding high SSIM and favorable Inception Scores on benchmark datasets.

Pose Guided Person Image Generation (PG $^2$ )

The paper "Pose Guided Person Image Generation" by Liqian Ma et al. introduces an innovative generative framework named the Pose Guided Person Generation Network (PG $^2$ ). This network allows for the synthesis of person images in novel poses based on a reference image and a target pose. The framework elegantly divides the problem into two primary stages: pose integration and image refinement, enhancing the generation of high-quality person images with realistic details.

Framework and Methods

The PG $^2$ framework begins with pose integration. Here, the reference image and target pose are integrated using a U-Net-like network architecture to generate an initial, albeit blurry, image. The target pose is embedded using pose heatmaps, which straightforwardly represent keypoint information from the pose estimator. This embedding aids in avoiding the learning complexities associated with directly mapping keypoint coordinates to image positions. The image generation at this stage is supervised using a novel pose mask loss, which focuses on the human body in the target pose and diminishes the background's influence.

The second stage involves image refinement where the coarse output from the first stage is refined using a variant of a Deep Convolutional Generative Adversarial Network (DCGAN). This stage aims to enhance the high-frequency details by training the network to generate a difference map between the initial output and the target image. This strategy significantly accelerates convergence and improves the generation quality by concentrating on the missing details rather than generating the image from scratch.

Experimental Evaluation

The proposed PG $^2$ network was evaluated extensively on two datasets: Market-1501 and DeepFashion. The former is a re-identification dataset that presents substantial challenges due to variations in poses, viewpoints, illumination, and backgrounds, while the latter consists of high-resolution fashion images. The experimental results indicate that the PG $^2$ network performs favorably in generating realistic person images across diverse poses.

The evaluation incorporated several quantitative metrics, including Structural Similarity Index (SSIM) and Inception Score (IS). Additionally, a user paper on Amazon Mechanical Turk (AMT) provided insights into the perceived realism of the generated images. Notably, the PG $^2$ network's two-stage approach outperformed alternative methods and embeddings in generating photorealistic images that adhered closely to the specified poses.

Contributions and Future Directions

The contributions of this work are threefold. Firstly, it introduces a novel task of conditioning the generation on a reference image and a target pose, enabling explicit control over the generated image. Secondly, the paper explores multiple methods for pose embedding and introduces a pose mask loss to emphasize the person over the background during the generation process. Lastly, the two-stage generation framework effectively captures global structures at the first stage and refines details at the second stage, improving both the training stability and output quality.

The implications of this research are significant both practically and theoretically. Practical applications include movie-making, virtual try-on systems, and the generation of synthetic training data for rare human poses, which could enhance pose estimation models. Theoretically, the modular approach of the PG $^2$ framework provides a foundation for future research on controlled and conditional image generation.

Future work could expand on this foundation by integrating additional control signals, such as attributes or environmental contexts, to generate more diverse and controllable person images. Exploring these extensions could further bridge the gap between desired controllability and the realism of generated images, pushing the boundaries of generative models in computer vision.

In conclusion, the Pose Guided Person Generation Network (PG $^2$ ) presents an advanced framework for synthesizing person images in novel poses. By leveraging a divide-and-conquer strategy and focusing on explicit pose conditioning, the PG $^2$ network demonstrates superior performance in generating high-quality and realistic person images, thereby contributing useful insights and methodologies to the field of generative models.