- The paper introduces CatVTON, a novel VTON model that uses a concatenated diffusion approach to process garment and person images, reducing parameters by 167.02M.
- The paper employs a parameter-efficient training strategy by fine-tuning only critical self-attention layers, significantly lowering computational costs.
- The paper validates CatVTON on datasets like VITON-HD, achieving superior metrics (FID 5.43, LPIPS 0.0565) and producing high-quality virtual try-on results.
CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models
Introduction
Virtual Try-On (VTON) is an essential technology for e-commerce, allowing users to see how garments would look on them before making a purchase. Traditional VTON methods often involve a two-stage process, including warping garments and blending them onto a target image, which can result in unsatisfactory quality, especially for complex poses. Recent advancements have been made using conditional generation methods based on diffusion models; however, these methods typically suffer from high computational costs due to the necessity of additional network modules such as ReferenceNet and image encoders.
CatVTON: A Streamlined Approach
CatVTON introduces a significant simplification in this landscape by proposing a lightweight and efficient VTON model. The model leverages a single UNet backbone to process both garment and person images, concatenated along spatial dimensions as inputs. This novel approach obviates the need for ReferenceNet and additional image encoders, thereby drastically reducing the model's parameters and computational requirements.
Three key aspects showcase CatVTON's efficiency:
- Lightweight Network: By eliminating the ReferenceNet, text encoder, and cross attentions, the model focuses solely on essential diffusion modules, reducing parameters by 167.02M. Consequently, the total parameters amount to 899.06M, with only 49.57M trainable parameters, which is approximatively 5.51% of the baseline models’ parameters.
- Parameter-Efficient Training: CatVTON focuses on training the most pertinent modules, identifying self-attention layers in transformer blocks as sufficient for high-quality try-on results. This strategy preserves prior knowledge while finetuning only necessary components, resulting in substantial computational savings.
- Simplified Inference: CatVTON eliminates conventional preprocessing steps like pose estimation and human parsing. The inference process only requires the garment reference, target person image, and mask.
Experimental Validation
CatVTON's performance was validated through exhaustive experiments on publicly available datasets like VITON-HD, DressCode, and DeepFashion. The model demonstrated superior qualitative and quantitative outcomes compared to state-of-the-art baseline methods.
Quantitative Results
Assessing both paired and unpaired settings, CatVTON consistently outperformed contemporary methods on metrics such as SSIM, FID, KID, and LPIPS. For example, on the VITON-HD dataset, CatVTON achieved an FID of 5.43 and LPIPS of 0.0565, outperforming IDM-VTON, which recorded an FID of 5.76 and LPIPS of 0.0603.
Qualitative Results
CatVTON produces high-quality images with consistent details and superior handling of complex patterns and text. It maintains photo-realism across various scenarios, from simple to complex in-the-wild environments, validating its robustness.
Implications and Future Directions
CatVTON's streamlined architecture and efficient training methodology render it particularly viable for practical deployment in virtual try-on systems within the e-commerce industry. By using fewer parameters and requiring less computational power, it democratizes access to high-quality virtual try-on technology. The model’s results in robust, in-the-wild scenarios suggest promising extensions to other applications requiring detailed image synthesis and conditional generation.
Future research could explore higher resolution generation and further optimizations in real-time execution, leveraging CatVTON's efficient foundation. Additionally, attention may be directed towards addressing potential biases from the pre-trained models and expanding the datasets to ensure diversity and inclusiveness.
Conclusion
CatVTON represents a significant advancement in virtual try-on technology by simplifying the architecture and training process while maintaining and even improving the quality of generated try-ons. With its demonstrated efficiency and efficacy, CatVTON holds considerable potential for broader application within the fashion and e-commerce industries, contributing to more personalized and immersive customer experiences.