CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models (2407.15886v2)

Published 21 Jul 2024 in cs.CV and cs.AI

Abstract: Virtual try-on methods based on diffusion models achieve realistic effects but often require additional encoding modules, a large number of training parameters, and complex preprocessing, which increases the burden on training and inference. In this work, we re-evaluate the necessity of additional modules and analyze how to improve training efficiency and reduce redundant steps in the inference process. Based on these insights, we propose CatVTON, a simple and efficient virtual try-on diffusion model that transfers in-shop or worn garments of arbitrary categories to target individuals by concatenating them along spatial dimensions as inputs of the diffusion model. The efficiency of CatVTON is reflected in three aspects: (1) Lightweight network. CatVTON consists only of a VAE and a simplified denoising UNet, removing redundant image and text encoders as well as cross-attentions, and includes just 899.06M parameters. (2) Parameter-efficient training. Through experimental analysis, we identify self-attention modules as crucial for adapting pre-trained diffusion models to the virtual try-on task, enabling high-quality results with only 49.57M training parameters. (3) Simplified inference. CatVTON eliminates unnecessary preprocessing, such as pose estimation, human parsing, and captioning, requiring only a person image and garment reference to guide the virtual try-on process, reducing over 49% memory usage compared to other diffusion-based methods. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results compared to baseline methods and demonstrates strong generalization performance in in-the-wild scenarios, despite being trained solely on public datasets with 73K samples.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces CatVTON, a novel VTON model that uses a concatenated diffusion approach to process garment and person images, reducing parameters by 167.02M.
The paper employs a parameter-efficient training strategy by fine-tuning only critical self-attention layers, significantly lowering computational costs.
The paper validates CatVTON on datasets like VITON-HD, achieving superior metrics (FID 5.43, LPIPS 0.0565) and producing high-quality virtual try-on results.

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Introduction

Virtual Try-On (VTON) is an essential technology for e-commerce, allowing users to see how garments would look on them before making a purchase. Traditional VTON methods often involve a two-stage process, including warping garments and blending them onto a target image, which can result in unsatisfactory quality, especially for complex poses. Recent advancements have been made using conditional generation methods based on diffusion models; however, these methods typically suffer from high computational costs due to the necessity of additional network modules such as ReferenceNet and image encoders.

CatVTON: A Streamlined Approach

CatVTON introduces a significant simplification in this landscape by proposing a lightweight and efficient VTON model. The model leverages a single UNet backbone to process both garment and person images, concatenated along spatial dimensions as inputs. This novel approach obviates the need for ReferenceNet and additional image encoders, thereby drastically reducing the model's parameters and computational requirements.

Three key aspects showcase CatVTON's efficiency:

Lightweight Network: By eliminating the ReferenceNet, text encoder, and cross attentions, the model focuses solely on essential diffusion modules, reducing parameters by 167.02M. Consequently, the total parameters amount to 899.06M, with only 49.57M trainable parameters, which is approximatively 5.51% of the baseline models’ parameters.
Parameter-Efficient Training: CatVTON focuses on training the most pertinent modules, identifying self-attention layers in transformer blocks as sufficient for high-quality try-on results. This strategy preserves prior knowledge while finetuning only necessary components, resulting in substantial computational savings.
Simplified Inference: CatVTON eliminates conventional preprocessing steps like pose estimation and human parsing. The inference process only requires the garment reference, target person image, and mask.

Experimental Validation

CatVTON's performance was validated through exhaustive experiments on publicly available datasets like VITON-HD, DressCode, and DeepFashion. The model demonstrated superior qualitative and quantitative outcomes compared to state-of-the-art baseline methods.

Quantitative Results

Assessing both paired and unpaired settings, CatVTON consistently outperformed contemporary methods on metrics such as SSIM, FID, KID, and LPIPS. For example, on the VITON-HD dataset, CatVTON achieved an FID of 5.43 and LPIPS of 0.0565, outperforming IDM-VTON, which recorded an FID of 5.76 and LPIPS of 0.0603.

Qualitative Results

CatVTON produces high-quality images with consistent details and superior handling of complex patterns and text. It maintains photo-realism across various scenarios, from simple to complex in-the-wild environments, validating its robustness.

Implications and Future Directions

CatVTON's streamlined architecture and efficient training methodology render it particularly viable for practical deployment in virtual try-on systems within the e-commerce industry. By using fewer parameters and requiring less computational power, it democratizes access to high-quality virtual try-on technology. The model’s results in robust, in-the-wild scenarios suggest promising extensions to other applications requiring detailed image synthesis and conditional generation.

Future research could explore higher resolution generation and further optimizations in real-time execution, leveraging CatVTON's efficient foundation. Additionally, attention may be directed towards addressing potential biases from the pre-trained models and expanding the datasets to ensure diversity and inclusiveness.

Conclusion

CatVTON represents a significant advancement in virtual try-on technology by simplifying the architecture and training process while maintaining and even improving the quality of generated try-ons. With its demonstrated efficiency and efficacy, CatVTON holds considerable potential for broader application within the fashion and e-commerce industries, contributing to more personalized and immersive customer experiences.

Related Papers

Tweets

https://twitter.com/taziku_co/status/1818599936495435982

https://twitter.com/_manan2005/status/1841530321356931112

https://twitter.com/bluesura/status/1879438411230278023

YouTube

Show All Videos