GarmentGAN: Photo-realistic Garment Transfer

Updated 10 March 2026

GarmentGAN is a generative adversarial network designed for photo-realistic garment transfer in virtual try-on, clearly separating shape and appearance synthesis.
It utilizes a two-stage pipeline with a U-Net based shape transfer network and an appearance transfer network employing TPS warping and SPADE normalization for enhanced realism.
Innovative techniques like hand-aware masking and integrated geometric warping improve occlusion handling and boundary preservation in challenging poses.

GarmentGAN refers to a generative adversarial network framework tailored for photo-realistic garment transfer in the virtual try-on domain. Its primary goal is to synthesize images of a person wearing arbitrary garments, given only a source image of the person and a product image of the garment. GarmentGAN specifically addresses challenges posed by occlusions, pose variation, and the need to preserve fine garment and identity details, employing a two-stage approach that separates garment shape/layout prediction from RGB appearance synthesis (Raffiee et al., 2020).

1. Problem Formulation and Two-Stage Framework

GarmentGAN decomposes the garment transfer task into two sequential modules: (i) a Shape Transfer Network and (ii) an Appearance Transfer Network. The pipeline inputs a target garment image $I_c$ and an image of the person $I_{person}$ . The system outputs a synthetic image $\hat I_{person}$ wherein the target person is rendered wearing the new garment in their original pose and background.

Shape Transfer Network: Predicts a semantic segmentation map $\hat I_{seg}$ for the person, indicating the placement of torso, limbs, top clothes, etc., now accommodated for the new garment and possible self-occlusion.
Appearance Transfer Network: Synthesizes the full RGB output $\hat I_{person}$ by warping the garment and composing it into the person image under guidance from the predicted segmentation.

This staged design decouples coarse spatial layout and semantic parsing from the complexities of photorealistic rendering, yielding sharper boundaries and robust handling of body occlusions (e.g., arms crossing the torso).

2. Architectural Details

2.1 Shape Transfer Network

Inputs: Masked segmentation $I_{m,seg}\in\{0,1\}^{H\times W\times 10}$ , pose/shape representation $P_s\in\mathbb{R}^{H\times W\times 18}$ , and garment image $I_c$ .
Generator: U-Net encoder–decoder with 5 down-sampling layers, 4 bottleneck residual blocks, and up-sampling via nearest-neighbor followed by convolutions. Instance normalization and LeakyReLU are used throughout.
Discriminator: PatchGAN with convolution, instance normalization, and LeakyReLU activations.

The generator forward pass is: $\hat I_{seg} = G_{shape}(I_{m,seg}, P_s, I_c)$ At inference, preserved regions (hair, face, lower body) are copied from the original segmentation map to retain identity.

2.2 Appearance Transfer Network

Inputs: Masked person image $I_{m,person}$ , original and warped garment ( $I_c, I_{warped,c}$ ), person representation $P_s$ , predicted segmentation map $\hat I_{seg}$ .
Geometric Alignment Module: Estimates TPS parameters $\theta$ from $(P_s, I_c)$ ; warps $I_c$ to $I_{warped,c}$ using a thin-plate-spline transform.
Generator: Encoder-decoder with spectral and instance normalization in the encoder; decoder uses SPADE-style normalization, conditioning on $[\hat I_{seg}, I_{warped,c}]$ .
Discriminator: Multi-scale SN-PatchGAN for high-fidelity output.

The generator forward pass is: $\hat I_{person} = G_{appearance}(I_{m,person}, I_c, P_s, \hat I_{seg})$

3. Input Representations

Semantic Segmentation Maps: Obtained via a human parser (10 classes including torso, arms, top clothes, etc.), encoded as one-hot tensors.
Hand-aware Masking: During masking, hand pixels are preserved based on keypoint estimation using arm, elbow, and wrist joints (17-channel heatmaps), mitigating loss of hand details in occluded scenarios.
Pose/Body Shape: $P_s$ concatenates 17 keypoint heatmaps with a blurred binary mask of the body region to encode pose and body shape.
Masked Images: Both the person image and segmentation maps have rectangles (torso/arms/top clothes) zeroed out inside a computed bounding box to focus synthesis on garment regions while minimizing identity loss.

4. Loss Functions and Training Objectives

4.1 Shape Transfer

Generator Loss:

$L_{G}^{shape} = \gamma_1 L_{parsing} + \gamma_2 L_{per\text{-}pixel} - \mathbb{E}_{\hat I_{seg}}[D(\hat I_{seg})]$

with

$L_{per\text{-}pixel} = \frac{1}{N}\|I_{seg} - \hat I_{seg}\|_1$

Discriminator Loss:

$L_{D}^{shape} = \mathbb{E}_{I_{seg}}[\max(0,1 - D(I_{seg}))] + \mathbb{E}_{\hat I_{seg}}[\max(0,1 + D(\hat I_{seg}))] + \gamma_3 L_{GP}$

where $L_{GP}$ denotes the gradient penalty for 1-Lipschitz constraint.

4.2 Appearance Transfer

Generator Loss:

$L_{G}^{app} = \alpha_1 L_{TPS} + \alpha_2 L_{per\text{-}pixel} + \alpha_3 L_{percept} + \alpha_4 L_{feat} - \mathbb{E}_{\hat I_{person}}[D(\hat I_{person})]$

where

$L_{TPS}$ is the L1 distance between warped and ground truth worn garment,
$L_{percept}$ is a VGG-based perceptual loss,
$L_{feat}$ is a SN-PatchGAN feature-matching loss.
Discriminator has a corresponding loss with gradient penalty $\beta L_{GP}$ .

Hyperparameters are set as: $\gamma_1=15, \gamma_2=20, \gamma_3=10$ , $\alpha_1=\alpha_2=\alpha_3=\alpha_4=10, \beta=10$ .

5. Novel Modules and Occlusion Handling

Key innovations for robust synthesis:

Hand-aware Masking: Explicit recovery and masking of hand regions prevents loss of plausible articulations when arms occlude the torso, which prior methods frequently mishandle.
End-to-End TPS Warper: Integrates geometric warping of the garment as a differentiable module within the appearance generator, learned jointly rather than through separate training, facilitating adaptive garment alignment.
SPADE Normalization: SPADE in the RGB synthesis's decoder conditions on both predicted segmentation and warped garment, improving localized texture realism and boundary sharpness in the output.

6. Training Protocol and Hyperparameters

Optimization: Adam optimizer ( $\beta_1=0.5, \beta_2=0.999$ ), learning rate $2\times 10^{-4}$ .
Batch Size: 4–8 images typical for high-res GANs.
Normalization and Stability: Instance + spectral normalization; gradient penalty for discriminators.
Training Schedule: Alternating generator/discriminator updates. The shape network is first trained to convergence. The fully assembled network, including the TPS module, is then trained end-to-end for 50–100 epochs on VITON.

7. Evaluation: Metrics, Results, and Limitations

7.1 Dataset and Metrics

Dataset: VITON (front-view women, garment images).
Split: 14,221 training, 2,032 validation pairs.
Metrics: Inception Score (IS; higher is better), Fréchet Inception Distance (FID; lower is better).

7.2 Quantitative Results

Model	IS	FID
CP-VTON	2.636 ± 0.077	23.085
GarmentGAN w/o TPS module	2.723 ± 0.083	17.408
GarmentGAN (full)	2.774 ± 0.082	16.578

GarmentGAN demonstrates improved performance relative to prior methods and its own ablations, especially in FID.

7.3 Qualitative Findings

Boundary and Detail: GarmentGAN produces outputs with crisp anatomical and clothing boundaries; logos and surface patterns are retained even in challenging poses.
Occlusion Robustness: Generated hands and arms remain realistic and garment–skin interfaces are plausible, even when limbs self-occlude.
Identity Preservation: Non-garment identity features (face, hair, bottom clothing, background) are preserved untouched in the output.

7.4 Limitations

Demonstrated only for upper-body try-on; extension to pants/shoes is untested but architecturally feasible.
Performance depends on accuracy of human parsing and pose keypoint estimation (failure in these steps degrades synthesis quality).
Operates on single, static images; video or temporal coherence remains an open area.
No integration of multi-view constraints or 3D priors, which could further enhance rare pose coverage and synthesis realism (Raffiee et al., 2020).

GarmentGAN’s design contrasts with approaches such as:

TryOnGAN: Employs latent interpolation in a StyleGAN2 architecture with pose conditioning and segmentation heads, offering seamless garment–body blending and superior generalization but a distinct strategy from the explicit two-stage pipeline (Lewis et al., 2021).
Poly-GAN: Collapses the alignment, stitching, and refinement steps of virtual try-on into a multi-conditioned encoder–decoder pipeline with coarse skip connections, permitting end-to-end optimization for multi-task synthesis (alignment, inpainting, stitching) (Pandey et al., 2019).
Design-AttGAN: Targets garment attribute editing rather than try-on, using an encoder–decoder with attribute vector conditioning for semantically controlled garment manipulation (Yuan et al., 2020).
GarmentAligner: Addresses text-to-garment generation rather than image-based virtual try-on, focusing instead on retrieval-augmented latent diffusion for component-level semantic alignment (Zhang et al., 2024).

GarmentGAN—through its two-stage structure, hand-preserving masking, integrated geometric warping, and spatially modulated normalization—sets a reference point for robust photo-realistic garment transfer under challenging pose and occlusion conditions (Raffiee et al., 2020).

Markdown Report Issue Upgrade to Chat

References (5)

GarmentGAN: Photo-realistic Adversarial Fashion Transfer (2020)

TryOnGAN: Body-Aware Try-On via Layered Interpolation (2021)

Poly-GAN: Multi-Conditioned GAN for Fashion Synthesis (2019)

Garment Design with Generative Adversarial Networks (2020)

GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GarmentGAN.

GarmentGAN: Photo-realistic Garment Transfer

1. Problem Formulation and Two-Stage Framework

2. Architectural Details

2.1 Shape Transfer Network

2.2 Appearance Transfer Network

3. Input Representations

4. Loss Functions and Training Objectives

4.1 Shape Transfer

4.2 Appearance Transfer

5. Novel Modules and Occlusion Handling

6. Training Protocol and Hyperparameters

7. Evaluation: Metrics, Results, and Limitations

7.1 Dataset and Metrics

7.2 Quantitative Results

7.3 Qualitative Findings

7.4 Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GarmentGAN: Photo-realistic Garment Transfer

1. Problem Formulation and Two-Stage Framework

2. Architectural Details

2.1 Shape Transfer Network

2.2 Appearance Transfer Network

3. Input Representations

4. Loss Functions and Training Objectives

4.1 Shape Transfer

4.2 Appearance Transfer

5. Novel Modules and Occlusion Handling

6. Training Protocol and Hyperparameters

7. Evaluation: Metrics, Results, and Limitations

7.1 Dataset and Metrics

7.2 Quantitative Results

7.3 Qualitative Findings

7.4 Limitations

8. Context within Virtual Try-On and Related Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research