IDM-VTON: Diffusion Virtual Try-On
- The paper introduces a dual-encoding diffusion framework that explicitly incorporates high-level semantic and low-level pixel features for improved virtual try-on synthesis.
- Key methodology uses cross-attention and self-attention conditioning to fuse visual and textual cues, ensuring detailed garment fidelity and robust compositional accuracy.
- Empirical evaluations on benchmarks like VITON-HD and DressCode demonstrate state-of-the-art performance with superior LPIPS, SSIM, FID, and CLIP-I metrics.
IDM-VTON defines a diffusion-based virtual try-on framework that enhances garment fidelity and synthesis authenticity in image-based virtual try-on applications. IDM-VTON addresses major limitations of prior inpainting-driven and GAN-based approaches by explicitly incorporating high-level semantic and low-level pixel features of garments, together with advanced prompt engineering and lightweight adaptation strategies. This architecture is grounded in the SDXL latent diffusion model, with innovations in cross- and self-attention conditioning to optimally combine visual and textual cues for rendering person–garment composites in diverse scenes, including in-the-wild datasets (Choi et al., 8 Mar 2024).
1. System Architecture and Conditioning Pathways
IDM-VTON is built atop a latent inpainting diffusion UNet based on the SDXL model. Incoming images, including the person (), garment (), and relevant masks, are first mapped into a shared latent space via a pretrained autoencoder . Diffusion is performed in this latent space, followed by decoding to pixel space using decoder . The "TryonNet" backbone receives, at each timestep , a concatenation of (i) noised person latent derived from , (ii) a binary garment mask , (iii) the masked-out person latent, and (iv) DensePose latent .
Significantly, IDM-VTON augments the standard UNet with two garment encoding pathways:
- High-level semantics encoded using an IP-Adapter (CLIP-ViT-H/14), fed via additional cross-attention over the TryonNet layers.
- Low-level spatial features introduced through a parallel UNet encoder ("GarmentNet") fed with , whose intermediate activations are concatenated into TryonNet's self-attention blocks.
The design enables both holistic garment identity retention (semantics, shape, category) and crisp local detail transfer (texture, logos, patterns).
2. Garment Encoding Modules
The architecture employs two complementary semantic extraction modules, each propagating garment information through distinct attention mechanisms:
- IP-Adapter for High-level Features: A frozen CLIP-ViT-H/14 produces a global embedding for . Fine-tuned projection heads transform for each cross-attention block, yielding image-conditioned attention outputs:
Here, represents per-location queries.
- GarmentNet for Low-level Features: The frozen SDXL UNet encoder processes , outputting per-layer features at each spatial resolution . In TryonNet, these are concatenated with block activations and processed jointly through self-attention:
After update, only the TryonNet channels are propagated.
This compositional encoding strategy permits granular control over visual fidelity at both semantic and pixel levels.
3. Attention Fusion and Prompt Conditioning
Attention fusion in IDM-VTON operates at two levels:
- Cross-attention Fusion: For each block, cross-attention is performed with both text () and image () modalities, their outputs summed before application to UNet hidden states:
- Self-attention Injection: High-resolution garment features are concatenated with the current UNet activations within self-attention layers, enhancing transfer of spatially localized details.
Textual prompt engineering plays a critical role. Fine-grained garment captions, generated using a fashion attribute annotator (e.g., "short sleeve round neck t-shirt with graphic print"), are fed both as “a photo of [description]” to GarmentNet and “Model is wearing [description]” to TryonNet. This leverages the SDXL prior's text-image alignment, refining garment-person compositional plausibility and reducing ambiguity regarding silhouette and appearance.
4. Customization and Adaptive Fine-tuning
IDM-VTON supports practical, in-the-wild customization by fine-tuning decoder (up-block) attention layers of TryonNet on person–garment pairs . This personalizes the model to generate target composites matching a specific instance, with the optimization objective remaining the standard diffusion denoising loss:
This adaptation enables high-fidelity reconstructions of real-world pairings with only few-shot examples, retaining generalization and avoiding overfitting through targeted updates.
5. Training Procedure and Loss Formulation
The principal training objective is the standard -prediction loss used in diffusion models. For each input , noise and timestep , latent is constructed as
The network is optimized to minimize the mean-squared difference to true noise, weighted by . No adversarial, perceptual, or explicit reconstruction losses are used. During inference, classifier-free guidance combines conditional and unconditional predictions:
with user-tunable scale .
6. Empirical Evaluation and Benchmarking
IDM-VTON demonstrates state-of-the-art performance for image-based virtual try-on across standard benchmarks such as VITON-HD and DressCode:
| Dataset | Model | LPIPS ↓ | SSIM ↑ | FID ↓ | CLIP-I ↑ |
|---|---|---|---|---|---|
| VITON-HD | IDM-VTON | 0.102 | 0.870 | 6.29 | 0.883 |
| StableVITON | 0.133 | 0.885 | 6.52 | 0.871 | |
| DressCode | IDM-VTON | 0.062 | 0.920 | 8.64 | 0.904 |
| StableVITON | 0.107 | 0.910 | 14.37 | 0.866 |
IDM-VTON achieves best-in-class (lower) LPIPS and FID, as well as highest CLIP-I similarity. In-the-wild robustness is validated via improvements in LPIPS and CLIP-I relative to prior models, and personalization via customization yields further qualitative gains, e.g., exact recovery of localized text/graphics and silhouette details. Ablations indicate that GarmentNet enhances detail transfer and that elaborate captions further improve compositional integrity.
7. Contributions, Strengths, and Open Problems
IDM-VTON introduces several key advances: a dual-module garment information conditioning strategy (IP-Adapter for high-level semantics and GarmentNet for low-level detail), fine-grained prompt engineering leveraging SDXL priors, and an effective, decoder-only adaptation protocol for real-world customization. The approach yields state-of-the-art garment fidelity and compositional authenticity, with strong performance on standard and in-the-wild datasets and efficient adaptation requiring minimal parameter tuning.
Limitations include imperfect preservation of recurrent fine-grained human attributes (e.g., tattoos) within masked regions, and the current inability to perform fully text-controlled garment editing (e.g., prompt-based pattern or color changes). Further, adaptation to full-body outfits and guarantee of multi-view consistency remain open challenges.
IDM-VTON establishes a rigorous foundation for authentic virtual try-on via diffusion, with architecture and conditioning innovations applicable to broader image compositing and conditional synthesis tasks (Choi et al., 8 Mar 2024).