Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On (2305.13501v3)

Published 22 May 2023 in cs.CV, cs.AI, and cs.MM

Abstract: The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task. Source code and trained models are publicly available at: https://github.com/miccunifi/ladi-vton.

Citations (70)

Summary

  • The paper introduces a novel method that uses latent diffusion models and textual inversion to enhance virtual try-on image synthesis.
  • It employs an enhanced autoencoder with Mask-Aware Skip Connections to reduce reconstruction errors and preserve detailed garment features.
  • The model achieves state-of-the-art performance with FID scores of 4.14 and 6.48 on key datasets, demonstrating significant realism improvements.

Overview of LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

The paper introduces LaDI-VTON, a novel approach leveraging Latent Diffusion Models (LDMs) for the virtual try-on task. This method capitalizes on recent advances in diffusion models to offer significant improvements in image synthesis quality over traditional Generative Adversarial Networks (GANs). The core contributions of the paper include a textual inversion technique, an enhanced autoencoder with skip connections, and a refined warping process, collectively setting new benchmarks in virtual try-on performance.

Technical Highlights

LaDI-VTON is underpinned by several innovative components:

  1. Textual Inversion: The method introduces a forward-only textual inversion module that maps visual features of the in-shop garment to pseudo-word token embeddings in the CLIP token embedding space. This enhances the ability of the model to retain fine details and textures of garments during the generation process.
  2. Enhanced Autoencoder: The authors propose Enhanced Mask-Aware Skip Connection (EMASC) modules to reduce reconstruction errors present in latent diffusion models. These modules effectively preserve high-frequency image details, thus improving image quality and realism, particularly in challenging areas such as hands and faces.
  3. Data Conditioning: LaDI-VTON conditions its latent diffusion model on essential information, including the warped garment and human pose data. This approach ensures that the generated output respects the physical characteristics and pose of the target model.

Results and Performance

The experimental evaluations conducted on the Dress Code and VITON-HD datasets demonstrate the superior performance of LaDI-VTON over existing state-of-the-art methods. In particular, it achieves notable improvements in realism metrics such as Fréchet Inception Distance (FID) and Kernel Inception Distance (KID), indicating a significant advancement in generating lifelike images.

Specifically, LaDI-VTON outperformed competitors in both paired and unpaired settings, achieving FID scores of 4.14 and 6.48 on the Dress Code dataset. Furthermore, qualitative assessments revealed its ability to generate images that closely resemble target aesthetics while maintaining structural coherence with the human model.

Implications and Future Directions

The implications of LaDI-VTON in virtual try-on and related fields are profound. Its use of LDMs represents a paradigmatic shift towards more robust and realistic image synthesis techniques. The textual inversion and enhanced autoencoder methodologies set a new standard for incorporating detailed conditioning data into diffusion models.

This work opens potential pathways for further research. Future developments may explore refining textual inversion techniques to broaden their applicability across diverse generative tasks. Additionally, improvements in high-frequency detail preservation could bolster the efficacy of LDMs in other domains such as video synthesis and 3D modeling.

In summary, LaDI-VTON showcases a compelling application of diffusion models to virtual try-on scenarios, marking a substantive leap in both theoretical understanding and practical capability within the field of generative modeling.