M2E-Try-On Net: Realistic Virtual Try-On
- The paper introduces a three-stage network (PAN, TRN, FTN) that synthesizes realistic try-on images from unstructured model images without needing catalog shots.
- It employs self-supervised paired and unpaired GAN training with pose-conditional and perceptual losses to ensure accurate pose alignment and detail preservation.
- The study notes limitations in complex pattern handling and suggests improvements in semantic parsing and extending the method to full-body apparel.
M2E-Try-On Net is a virtual try-on system designed to synthesize realistic images of a person () wearing garments from a different source “model” image (), without requiring a clean, catalog-style product photograph of the clothing. Unlike prior paradigms that rely heavily on segmented or well-aligned product shots, M2E-Try-On Net operates directly on in-the-wild model images, addressing key challenges in pose alignment, texture preservation, and seamless compositing between heterogeneous identities. The method is structured as a sequential network pipeline, employs self-supervised learning from multi-pose datasets, and introduces adversarial and perceptual constraints tailored to the unpaired nature of fashion data (Wu et al., 2018).
1. Architectural Design and Subnetwork Workflow
M2E-Try-On Net decomposes the model-to-everyone try-on problem into three main stages:
- Pose Alignment Network (PAN): Aligns the model garment to the desired pose of the person using DensePose-derived per-pixel UV correspondences (). The model image is warped to match via barycentric interpolation across a body-part mesh, producing a coarse pose-warped image (). This, together with the original model and DensePose fields, is input to a conditional generator performing further feature encoding, residual transformation, and upsampling to yield an aligned model image . Training is adversarial, guided by a pose-conditional GAN loss:
where is the discriminator receiving both RGB and pose cues.
- Texture Refinement Network (TRN): Directly warping and aligning the model image can blur garment textures. TRN addresses this by fusing the warped garment () and the synthesized pose-aligned output () using an automatically generated garment mask . The initial composite is passed to a lightweight U-Net which refines garment boundaries, restores high-frequency details, and outputs the blended image .
- Fitting Network (FTN): To synthesize the final try-on image that preserves the person’s face, background, and identity, FTN composites the refined, re-posed garment into . A region-of-interest mask is derived by fusing semantic segmentations, covering the upper body and arms. The masked regions are concatenated and processed through an encoder–residual–decoder architecture to yield . The FTN is also trained adversarially and, in paired settings, with additional reconstruction and perceptual/style losses.
2. Training Paradigms: Self-supervised and Paired Strategies
Due to the absence of direct supervision —i.e., triplets where a person is imaged both in new and previous garments—the method exploits two key training modalities:
- Unpaired GAN Training: Pairs of are randomly sampled, and only the pose-conditional GAN loss () is imposed, fostering pose correctness and realism without access to ground-truth outputs.
- Self-supervised Paired Training: Many fashion datasets feature individuals photographed in the same outfit but with different poses. Such pairs are treated as pseudo ground-truth, enabling supervised losses:
- and
- Perceptual loss, derived from differences in features extracted from a pre-trained VGG-16,
- Style loss, based on Gram matrices computed from each feature layer,
- The total joint loss is a weighted combination of all components. The relative weights are set to emphasize adversarial and perceptual contributions, mitigating over-smoothing in the output.
3. Implementation Specifications
Image preprocessing: All inputs are resized to resolution.
Optimization: Adam optimizer with , and a learning rate of is used. The training schedule involves 80 epochs for PAN and 50 epochs each for TRN and FTN.
Subnetwork architectures:
- PAN/FTN employ standard encoder (three conv stride-2 layers), 6–8 residual blocks, and decoder (two deconvs, one conv layer, tanh).
- TRN is a compact U-Net with four downsampling/upsampling stages and skip connections.
- RoI mask production: RoI for compositing is computed by fusing LIP_SSF parsing and DensePose, and refined by an additional -supervised CNN.
4. Experimental Evaluation
Experiments are conducted on the following datasets:
- DeepFashion Women Tops (mini): 3,256 unpaired, 7,764 paired, 4,064 test images.
- MVC Women Tops (mini): 2,606 unpaired, 4,000 paired, 1,498 test images.
- MVC Women Pants: 3,715 unpaired, 5,423 paired, 650 test images.
Standard quantitative metrics such as SSIM or Inception Score are not reported. Instead, performance is evaluated via a user study on randomly sampled test images from the mini-DeepFashion dataset, with preference rates compared against several baseline methods:
| Method | User Preference (%) |
|---|---|
| GAN-VT | 0.8 |
| VITON | 8.5 |
| CP-VTON | 7.0 |
| M2E-Try-On | 83.7 |
Qualitative results demonstrate superior garment alignment to novel poses, better recovery of logo and texture detail, and more natural blending to the individual’s torso compared to these baselines (Wu et al., 2018).
5. Analysis of Capabilities, Limitations, and Prospective Enhancements
Strengths:
- Uniquely, M2E-Try-On Net synthesizes try-on images directly from photoshoot-style or in-situ model images, obviating the requirement for isolated product shots.
- The joint self-supervised training technique capitalizes on existing multi-pose captures in fashion datasets, enhancing generalizability and reducing annotation overhead.
- Pose-conditional adversarial objectives enforce both output realism and accurate pose mapping.
- The two-stage strategy for texture refinement coupled with perceptual and style constraints maintains garment textural fidelity.
Limitations:
- The pipeline can fail when complex garment graphics (such as face prints) are misclassified as body regions by parsing or pose estimation modules.
- Empirical validation is concentrated on upper-body clothing; lower garments and accessories are relatively unexplored.
- Reliance on user studies, rather than objective SSIM or LPIPS metrics, limits cross-comparison with other approaches.
Paths for Improvement:
- Incorporation of more robust semantic parsing that can distinguish intricate garment patterns from skin or other body regions would increase reliability.
- Extension to lower-body attire by introducing additional masking and compositing logic for hips, legs, and full-body clothing.
- Enabling video-clip processing with temporal consistency mechanisms such as optical flow.
- Interactive refinement based on sparse user feedback, enabling error correction in extreme or unusual poses.
6. Relevance and Positioning in Virtual Try-On Research
M2E-Try-On Net advances the paradigm of virtual try-on by circumventing product image constraints and leveraging self-supervision in a three-stage network involving dense pose alignment, high-frequency texture compensation, and robust fitting. Its architecture represents a convergence of geometric warping, generative adversarial settings, and multi-task learning driven by both paired and unpaired supervision. These features position M2E-Try-On Net as a reference solution for generating photo-realistic personalized try-on imagery from unstructured model data, with implications for e-commerce, visual search, and personalized recommendation systems (Wu et al., 2018).