CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models (2407.15886v1)

Published 21 Jul 2024 in cs.CV and cs.AI

Abstract: Virtual try-on methods based on diffusion models achieve realistic try-on effects but often replicate the backbone network as a ReferenceNet or use additional image encoders to process condition inputs, leading to high training and inference costs. In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person by proposing CatVTON, a simple and efficient virtual try-on diffusion model. CatVTON facilitates the seamless transfer of in-shop or worn garments of any category to target persons by simply concatenating them in spatial dimensions as inputs. The efficiency of our model is demonstrated in three aspects: (1) Lightweight network: Only the original diffusion modules are used, without additional network modules. The text encoder and cross-attentions for text injection in the backbone are removed, reducing the parameters by 167.02M. (2) Parameter-efficient training: We identified the try-on relevant modules through experiments and achieved high-quality try-on effects by training only 49.57M parameters, approximately 5.51 percent of the backbone network's parameters. (3) Simplified inference: CatVTON eliminates all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input, requiring only a garment reference, target person image, and mask for the virtual try-on process. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces CatVTON, a novel VTON model that uses a concatenated diffusion approach to process garment and person images, reducing parameters by 167.02M.
The paper employs a parameter-efficient training strategy by fine-tuning only critical self-attention layers, significantly lowering computational costs.
The paper validates CatVTON on datasets like VITON-HD, achieving superior metrics (FID 5.43, LPIPS 0.0565) and producing high-quality virtual try-on results.

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Introduction

Virtual Try-On (VTON) is an essential technology for e-commerce, allowing users to see how garments would look on them before making a purchase. Traditional VTON methods often involve a two-stage process, including warping garments and blending them onto a target image, which can result in unsatisfactory quality, especially for complex poses. Recent advancements have been made using conditional generation methods based on diffusion models; however, these methods typically suffer from high computational costs due to the necessity of additional network modules such as ReferenceNet and image encoders.

CatVTON: A Streamlined Approach

CatVTON introduces a significant simplification in this landscape by proposing a lightweight and efficient VTON model. The model leverages a single UNet backbone to process both garment and person images, concatenated along spatial dimensions as inputs. This novel approach obviates the need for ReferenceNet and additional image encoders, thereby drastically reducing the model's parameters and computational requirements.

Three key aspects showcase CatVTON's efficiency:

Lightweight Network: By eliminating the ReferenceNet, text encoder, and cross attentions, the model focuses solely on essential diffusion modules, reducing parameters by 167.02M. Consequently, the total parameters amount to 899.06M, with only 49.57M trainable parameters, which is approximatively 5.51% of the baseline models’ parameters.
Parameter-Efficient Training: CatVTON focuses on training the most pertinent modules, identifying self-attention layers in transformer blocks as sufficient for high-quality try-on results. This strategy preserves prior knowledge while finetuning only necessary components, resulting in substantial computational savings.
Simplified Inference: CatVTON eliminates conventional preprocessing steps like pose estimation and human parsing. The inference process only requires the garment reference, target person image, and mask.

Experimental Validation

CatVTON's performance was validated through exhaustive experiments on publicly available datasets like VITON-HD, DressCode, and DeepFashion. The model demonstrated superior qualitative and quantitative outcomes compared to state-of-the-art baseline methods.

Quantitative Results

Assessing both paired and unpaired settings, CatVTON consistently outperformed contemporary methods on metrics such as SSIM, FID, KID, and LPIPS. For example, on the VITON-HD dataset, CatVTON achieved an FID of 5.43 and LPIPS of 0.0565, outperforming IDM-VTON, which recorded an FID of 5.76 and LPIPS of 0.0603.

Qualitative Results

CatVTON produces high-quality images with consistent details and superior handling of complex patterns and text. It maintains photo-realism across various scenarios, from simple to complex in-the-wild environments, validating its robustness.

Implications and Future Directions

CatVTON's streamlined architecture and efficient training methodology render it particularly viable for practical deployment in virtual try-on systems within the e-commerce industry. By using fewer parameters and requiring less computational power, it democratizes access to high-quality virtual try-on technology. The model’s results in robust, in-the-wild scenarios suggest promising extensions to other applications requiring detailed image synthesis and conditional generation.

Future research could explore higher resolution generation and further optimizations in real-time execution, leveraging CatVTON's efficient foundation. Additionally, attention may be directed towards addressing potential biases from the pre-trained models and expanding the datasets to ensure diversity and inclusiveness.

Conclusion

CatVTON represents a significant advancement in virtual try-on technology by simplifying the architecture and training process while maintaining and even improving the quality of generated try-ons. With its demonstrated efficiency and efficacy, CatVTON holds considerable potential for broader application within the fashion and e-commerce industries, contributing to more personalized and immersive customer experiences.

Related Papers

Tweets

https://twitter.com/taziku_co/status/1818599936495435982

https://twitter.com/_manan2005/status/1841530321356931112

https://twitter.com/bluesura/status/1879438411230278023

YouTube

Show All Videos