Papers
Topics
Authors
Recent
Search
2000 character limit reached

GarmentGAN: Photo-realistic Garment Transfer

Updated 10 March 2026
  • GarmentGAN is a generative adversarial network designed for photo-realistic garment transfer in virtual try-on, clearly separating shape and appearance synthesis.
  • It utilizes a two-stage pipeline with a U-Net based shape transfer network and an appearance transfer network employing TPS warping and SPADE normalization for enhanced realism.
  • Innovative techniques like hand-aware masking and integrated geometric warping improve occlusion handling and boundary preservation in challenging poses.

GarmentGAN refers to a generative adversarial network framework tailored for photo-realistic garment transfer in the virtual try-on domain. Its primary goal is to synthesize images of a person wearing arbitrary garments, given only a source image of the person and a product image of the garment. GarmentGAN specifically addresses challenges posed by occlusions, pose variation, and the need to preserve fine garment and identity details, employing a two-stage approach that separates garment shape/layout prediction from RGB appearance synthesis (Raffiee et al., 2020).

1. Problem Formulation and Two-Stage Framework

GarmentGAN decomposes the garment transfer task into two sequential modules: (i) a Shape Transfer Network and (ii) an Appearance Transfer Network. The pipeline inputs a target garment image IcI_c and an image of the person IpersonI_{person}. The system outputs a synthetic image I^person\hat I_{person} wherein the target person is rendered wearing the new garment in their original pose and background.

  • Shape Transfer Network: Predicts a semantic segmentation map I^seg\hat I_{seg} for the person, indicating the placement of torso, limbs, top clothes, etc., now accommodated for the new garment and possible self-occlusion.
  • Appearance Transfer Network: Synthesizes the full RGB output I^person\hat I_{person} by warping the garment and composing it into the person image under guidance from the predicted segmentation.

This staged design decouples coarse spatial layout and semantic parsing from the complexities of photorealistic rendering, yielding sharper boundaries and robust handling of body occlusions (e.g., arms crossing the torso).

2. Architectural Details

2.1 Shape Transfer Network

  • Inputs: Masked segmentation Im,seg{0,1}H×W×10I_{m,seg}\in\{0,1\}^{H\times W\times 10}, pose/shape representation PsRH×W×18P_s\in\mathbb{R}^{H\times W\times 18}, and garment image IcI_c.
  • Generator: U-Net encoder–decoder with 5 down-sampling layers, 4 bottleneck residual blocks, and up-sampling via nearest-neighbor followed by convolutions. Instance normalization and LeakyReLU are used throughout.
  • Discriminator: PatchGAN with convolution, instance normalization, and LeakyReLU activations.

The generator forward pass is: I^seg=Gshape(Im,seg,Ps,Ic)\hat I_{seg} = G_{shape}(I_{m,seg}, P_s, I_c) At inference, preserved regions (hair, face, lower body) are copied from the original segmentation map to retain identity.

2.2 Appearance Transfer Network

  • Inputs: Masked person image Im,personI_{m,person}, original and warped garment (Ic,Iwarped,cI_c, I_{warped,c}), person representation PsP_s, predicted segmentation map I^seg\hat I_{seg}.
  • Geometric Alignment Module: Estimates TPS parameters θ\theta from (Ps,Ic)(P_s, I_c); warps IcI_c to Iwarped,cI_{warped,c} using a thin-plate-spline transform.
  • Generator: Encoder-decoder with spectral and instance normalization in the encoder; decoder uses SPADE-style normalization, conditioning on [I^seg,Iwarped,c][\hat I_{seg}, I_{warped,c}].
  • Discriminator: Multi-scale SN-PatchGAN for high-fidelity output.

The generator forward pass is: I^person=Gappearance(Im,person,Ic,Ps,I^seg)\hat I_{person} = G_{appearance}(I_{m,person}, I_c, P_s, \hat I_{seg})

3. Input Representations

  • Semantic Segmentation Maps: Obtained via a human parser (10 classes including torso, arms, top clothes, etc.), encoded as one-hot tensors.
  • Hand-aware Masking: During masking, hand pixels are preserved based on keypoint estimation using arm, elbow, and wrist joints (17-channel heatmaps), mitigating loss of hand details in occluded scenarios.
  • Pose/Body Shape: PsP_s concatenates 17 keypoint heatmaps with a blurred binary mask of the body region to encode pose and body shape.
  • Masked Images: Both the person image and segmentation maps have rectangles (torso/arms/top clothes) zeroed out inside a computed bounding box to focus synthesis on garment regions while minimizing identity loss.

4. Loss Functions and Training Objectives

4.1 Shape Transfer

  • Generator Loss:

LGshape=γ1Lparsing+γ2Lper-pixelEI^seg[D(I^seg)]L_{G}^{shape} = \gamma_1 L_{parsing} + \gamma_2 L_{per\text{-}pixel} - \mathbb{E}_{\hat I_{seg}}[D(\hat I_{seg})]

with

Lper-pixel=1NIsegI^seg1L_{per\text{-}pixel} = \frac{1}{N}\|I_{seg} - \hat I_{seg}\|_1

  • Discriminator Loss:

LDshape=EIseg[max(0,1D(Iseg))]+EI^seg[max(0,1+D(I^seg))]+γ3LGPL_{D}^{shape} = \mathbb{E}_{I_{seg}}[\max(0,1 - D(I_{seg}))] + \mathbb{E}_{\hat I_{seg}}[\max(0,1 + D(\hat I_{seg}))] + \gamma_3 L_{GP}

where LGPL_{GP} denotes the gradient penalty for 1-Lipschitz constraint.

4.2 Appearance Transfer

  • Generator Loss:

LGapp=α1LTPS+α2Lper-pixel+α3Lpercept+α4LfeatEI^person[D(I^person)]L_{G}^{app} = \alpha_1 L_{TPS} + \alpha_2 L_{per\text{-}pixel} + \alpha_3 L_{percept} + \alpha_4 L_{feat} - \mathbb{E}_{\hat I_{person}}[D(\hat I_{person})]

where

  • LTPSL_{TPS} is the L1 distance between warped and ground truth worn garment,
  • LperceptL_{percept} is a VGG-based perceptual loss,
  • LfeatL_{feat} is a SN-PatchGAN feature-matching loss.
  • Discriminator has a corresponding loss with gradient penalty βLGP\beta L_{GP}.

Hyperparameters are set as: γ1=15,γ2=20,γ3=10\gamma_1=15, \gamma_2=20, \gamma_3=10, α1=α2=α3=α4=10,β=10\alpha_1=\alpha_2=\alpha_3=\alpha_4=10, \beta=10.

5. Novel Modules and Occlusion Handling

Key innovations for robust synthesis:

  • Hand-aware Masking: Explicit recovery and masking of hand regions prevents loss of plausible articulations when arms occlude the torso, which prior methods frequently mishandle.
  • End-to-End TPS Warper: Integrates geometric warping of the garment as a differentiable module within the appearance generator, learned jointly rather than through separate training, facilitating adaptive garment alignment.
  • SPADE Normalization: SPADE in the RGB synthesis's decoder conditions on both predicted segmentation and warped garment, improving localized texture realism and boundary sharpness in the output.

6. Training Protocol and Hyperparameters

  • Optimization: Adam optimizer (β1=0.5,β2=0.999\beta_1=0.5, \beta_2=0.999), learning rate 2×1042\times 10^{-4}.
  • Batch Size: 4–8 images typical for high-res GANs.
  • Normalization and Stability: Instance + spectral normalization; gradient penalty for discriminators.
  • Training Schedule: Alternating generator/discriminator updates. The shape network is first trained to convergence. The fully assembled network, including the TPS module, is then trained end-to-end for 50–100 epochs on VITON.

7. Evaluation: Metrics, Results, and Limitations

7.1 Dataset and Metrics

  • Dataset: VITON (front-view women, garment images).
  • Split: 14,221 training, 2,032 validation pairs.
  • Metrics: Inception Score (IS; higher is better), Fréchet Inception Distance (FID; lower is better).

7.2 Quantitative Results

Model IS FID
CP-VTON 2.636 ± 0.077 23.085
GarmentGAN w/o TPS module 2.723 ± 0.083 17.408
GarmentGAN (full) 2.774 ± 0.082 16.578

GarmentGAN demonstrates improved performance relative to prior methods and its own ablations, especially in FID.

7.3 Qualitative Findings

  • Boundary and Detail: GarmentGAN produces outputs with crisp anatomical and clothing boundaries; logos and surface patterns are retained even in challenging poses.
  • Occlusion Robustness: Generated hands and arms remain realistic and garment–skin interfaces are plausible, even when limbs self-occlude.
  • Identity Preservation: Non-garment identity features (face, hair, bottom clothing, background) are preserved untouched in the output.

7.4 Limitations

  • Demonstrated only for upper-body try-on; extension to pants/shoes is untested but architecturally feasible.
  • Performance depends on accuracy of human parsing and pose keypoint estimation (failure in these steps degrades synthesis quality).
  • Operates on single, static images; video or temporal coherence remains an open area.
  • No integration of multi-view constraints or 3D priors, which could further enhance rare pose coverage and synthesis realism (Raffiee et al., 2020).

GarmentGAN’s design contrasts with approaches such as:

  • TryOnGAN: Employs latent interpolation in a StyleGAN2 architecture with pose conditioning and segmentation heads, offering seamless garment–body blending and superior generalization but a distinct strategy from the explicit two-stage pipeline (Lewis et al., 2021).
  • Poly-GAN: Collapses the alignment, stitching, and refinement steps of virtual try-on into a multi-conditioned encoder–decoder pipeline with coarse skip connections, permitting end-to-end optimization for multi-task synthesis (alignment, inpainting, stitching) (Pandey et al., 2019).
  • Design-AttGAN: Targets garment attribute editing rather than try-on, using an encoder–decoder with attribute vector conditioning for semantically controlled garment manipulation (Yuan et al., 2020).
  • GarmentAligner: Addresses text-to-garment generation rather than image-based virtual try-on, focusing instead on retrieval-augmented latent diffusion for component-level semantic alignment (Zhang et al., 2024).

GarmentGAN—through its two-stage structure, hand-preserving masking, integrated geometric warping, and spatially modulated normalization—sets a reference point for robust photo-realistic garment transfer under challenging pose and occlusion conditions (Raffiee et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GarmentGAN.