Pose-Guided Person Image Synthesis

Updated 1 January 2026

The model unites dense attention matching with flow-based warping to preserve textures and ensure precise pose alignment.
It uses a dual-module architecture with deformation and synthesis components, leveraging specialized losses for photorealism and identity preservation.
Benchmark studies show superior metrics like SSIM, FID, and fooling rates, underscoring its efficacy in challenging pose transfer scenarios.

Pose-Guided Person Image Synthesis (PGPIS) refers to the class of models that synthesize photorealistic human images by taking a reference person image and re-rendering it under a specified target pose. The goal is to preserve both the identity and appearance details (clothing, face, textures) of the source image while coherently adapting the structural cues from the target pose. This task combines challenges from geometric transformation, texture mapping, and high-fidelity photorealism. Recent approaches have focused on combining spatial transformation blocks, such as flow-based warping, with deep attention mechanisms, to achieve more accurate pose alignment and sharper texture detail. Hybrid pipelines have been developed for person image synthesis, portrait editing, and related human-centric generative tasks (Ren et al., 2021).

1. Core Principles and Model Architecture

PGPIS models seek to transform a reference image $I_r$ into a new image matching a target pose $S_t$ , using both the appearance information from the source and structural information from the intended pose. One influential architecture organizes the process into two main modules:

Deformation Estimation Module:
- Attention Correlation Estimator: Extracts key features from $I_r$ and query features from $S_t$ , then computes a scaled softmax correlation matrix $C$ ( $C^{i,j} = \exp(\alpha\,k^i q^j) / \sum_{i'} \exp(\alpha\,k^{i'} q^j)$ , $\alpha=100$ ), enabling dense, non-local matching for accurate spatial remapping.
- Flow Field Estimator: Predicts a 2D flow field $w$ using a U-Net, enabling traditional spatial warping via bilinear sampling.
- Combination Map Generator: Outputs a soft mask $m \in [0,1]^{H\times W\times 1}$ distinguishing, pixel-wise, whether to rely on attention- or flow-based deformation.
Image Synthesis Module:
- Extracts a “neural texture” feature $f_r = E_x'(I_r)$ and a skeleton feature $f_t = E_s'(S_t)$ .
- Applies attention-warping ( $o_a = W_a(f_r, C)$ ) and flow-warping ( $o_f = W_f(f_r, w)$ ), then fuses them using the mask: $f_a = m \cdot o_a + (1-m) \cdot o_f$ .
- Integrates structural (pose) and appearance features ( $f_{out} = f_t + f_a$ ), then decodes to the output image ( $\hat{y} = G_{dec}(f_{out})$ ) (Ren et al., 2021).

This joint approach leverages attention's capability for correct global pose alignment while utilizing flow fields for local texture preservation, dynamically balanced at each pixel.

2. Mathematical Formulations and Training Objective

PGPIS requires both rigorous pose transformation and appearance fidelity. The architecture employs two sets of losses:

Deformation Losses:
- Attention reconstruction loss: $L_{attn} = \|W_a(I_r, C) - I_t\|_1$
- Flow sampling correctness: cosine similarity–based, promoting accurate texture warping via flow [Ren et al.].
- Flow regularization: imposes a local affine prior per spatial patch to keep deformations smooth and plausible.
Synthesis Losses:
- Perceptual loss: VGG feature matching.
- Face reconstruction loss: focused VGG matching for the face region.
- Style loss: Gram matrix matching for style consistency.
- Adversarial (hinge GAN) loss: supports global photorealism and prevents over-smooth outputs.

The full objective is a weighted sum: $L_{total} = \lambda_{attn}L_{attn} + \lambda_{flow}L_{flow} + \lambda_{regu}L_{regu} + \lambda_{perc}L_{perc} + \lambda_{face}L_{face} + \lambda_{style}L_{style} + \lambda_{adv}L_{adv}$ (Ren et al., 2021).

3. Comparative Benchmarks and Ablations

PGPIS models are benchmarked on standard datasets like DeepFashion In-Shop Clothes Retrieval (pose transfer split: 52,712 images, 101,966 train pairs, 8,570 test pairs), using structural similarity (SSIM), perceptual similarity (LPIPS), realism (FID), and MTurk-based fooling rates.

Method	SSIM↑	LPIPS↓	FID↓	Fool Rate (%)↑
VU-Net	0.6738	0.2637	23.669	4.12
Def-GAN	0.6836	0.2330	18.460	14.40
Pose-Attn	0.6714	0.2533	20.728	9.56
Intr-Flow	0.6968	0.1875	13.014	16.80
ADGAN	0.6736	0.2250	14.546	29.08
GFLA	0.7074	0.1962	9.9125	18.88
Ours	0.7113	0.1813	9.4502	30.00

Ablation studies isolate effects:

Attention-only: best structure, blurs fine texture.
Flow-only: details recovered but misaligned under large pose gaps.
Full model: achieves both accurate global pose and sharp local detail.
w/o face loss: lower realism in facial regions, more artifacts.

Mask $m$ usually selects flow for textured regions (clothes, hair) and attention for smooth areas (background, limbs) (Ren et al., 2021).

4. Extensions and Applications

The same methodology applies to portrait editing, where semantic inputs (68 facial landmarks from 3DMM) guide pose and expression changes. Results demonstrate accurate head orientation, consistent skin/hair textures, and faithful manipulation of facial features (lip, eyebrow movement) with minimal artifacts, highlighting adaptability to face-oriented synthesis tasks (Ren et al., 2021).

5. Strengths, Limitations, and Practical Implications

By integrating pixel-wise selection between attention-based structure mapping and flow-based local warping, PGPIS models synthesize images with stronger pose alignment and sharper texture preservation. The hybrid design generalizes beyond pose transfer to semantic face editing and potentially other human-centered generative tasks.

Principal limitations include:

Dependency on accurate pose and heatmap extraction; poor estimates degrade synthesis.
Extreme poses or occlusions present challenges for both spatial correspondence and texture inpainting.
Texture artifacts may occur when neither branch suffices (e.g., rare patterns, unseen body regions).

Qualitative and quantitative results consistently demonstrate state-of-the-art fidelity, particularly under challenging poses, validating the combined attention-flow approach as fundamental for high-quality pose-guided person synthesis (Ren et al., 2021).

6. Research Context and Future Directions

The architecture represents a transition from solely spatial warping (flow) or global matching (attention) to adaptive, hybrid models that partition responsibility for spatial deformation and local appearance. The methodology provides a template for future PGPIS work:

Possible integration with 3D pose/mesh priors.
End-to-end extension to video, multi-person, or broader scene synthesis.
Application to robust human portrait editing where pose, expression, and identity all require precise control.

The explicit joint use of dense attention and local flow fields, with dynamic selection, sets a new standard, influencing subsequent models and benchmarks in PGPIS research (Ren et al., 2021).

PDF Markdown Chat (Pro)

References (1)

Combining Attention with Flow for Person Image Synthesis (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Pose-Guided Person Image Synthesis (PGPIS).