StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On (2312.01725v1)

Published 4 Dec 2023 in cs.CV

Abstract: Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task.The main challenge is to preserve the clothing details while effectively utilizing the robust generative capability of the pre-trained model. In order to tackle these issues, we propose StableVITON, learning the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process. Through our proposed novel attention total variation loss and applying augmentation, we achieve the sharp attention map, resulting in a more precise representation of clothing details. StableVITON outperforms the baselines in qualitative and quantitative evaluation, showing promising quality in arbitrary person images. Our code is available at https://github.com/rlawjdghek/StableVITON.

Authors (5)

Jeongho Kim (23 papers)
Gyojung Gu (4 papers)
Minho Park (13 papers)
Sunghyun Park (38 papers)
Jaegul Choo (161 papers)

Citations (58)

View on Semantic Scholar

Summary

The paper introduces an innovative latent diffusion model that learns semantic correspondence without the need for separate warping networks.
It refines clothing detail preservation using a novel attention mechanism combined with total variation loss, outperforming conventional methods on metrics like SSIM and FID.
Experimental results show improved generalizability across diverse datasets, enhancing the practicality of virtual try-on systems in realistic fashion applications.

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

The paper introduces "StableVITON," an innovative approach to advancing image-based virtual try-on systems through the utilization of a pre-trained diffusion model. This model addresses the critical challenge in virtual try-on applications: preserving clothing details while leveraging the robust generative capabilities of diffusion models.

Overview

StableVITON builds on the framework of large-scale pre-trained diffusion models, designed to generate high-fidelity images while retaining intricate clothing features. Traditional virtual try-on methods often rely on paired datasets and separate warping networks, which constrain their applicability to arbitrary person images due to limitations in generalizability and background maintenance. StableVITON leverages the latent space of diffusion models to overcome these constraints.

Methodology

Semantic Correspondence Learning: The core innovation of StableVITON is its ability to establish semantic correspondence between clothing and human body parts within the latent diffusion model. The model introduces zero cross-attention blocks, which integrate the encoder’s intermediate features into the U-Net structure of the diffusion model. This integration allows for precise alignment of clothing without an independent warping network.
Attention Mechanism and Total Variation Loss: StableVITON enhances clothing detail preservation by implementing a novel attention total variation loss alongside augmentation. This approach sharpens attention maps, ensuring high-fidelity detail representation in generated images.
Encoder Customization: Additionally, StableVITON employs a spatial encoder to condition the generative model using clothing features, further enhancing alignment and detail fidelity.

Experimental Evaluation

The paper involves extensive experimentation across datasets. StableVITON is evaluated on its capability to generate plausible try-on results in both single and cross-dataset scenarios. Results indicate superiority over existing GAN-based and diffusion-based methods, particularly in generating images with complex backgrounds or arbitrary subject postures.

Performance Metrics: StableVITON outperforms baseline methods in key metrics such as SSIM, LPIPS, FID, and KID, especially in cross-dataset evaluations where models are tested on unseen datasets for generalization assessment.
Qualitative Analysis: Visual comparisons reveal StableVITON's proficiency in maintaining both the fidelity of the human figure and intricate clothing details, outperforming contemporaries in challenging scenarios.

Implications and Future Work

The implications of StableVITON are significant for virtual try-on systems, offering enhanced detail preservation and increased generalizability. It opens new avenues for realistic and broadly applicable virtual fashion applications, from personal user experiences to scalable e-commerce solutions.

Future research could focus on improving accessory and occlusion management, potentially by integrating enhanced contextual understanding or external knowledge sources into the latent diffusion model.

Conclusion

StableVITON presents a compelling advancement in virtual try-on technologies, demonstrating state-of-the-art performance through the strategic utilization of pre-trained diffusion models. This paper contributes meaningfully to the evolving landscape of AI-driven fashion technologies, setting the stage for future innovations that push the boundaries of realism and applicability in virtual clothing experiences.

PDF Markdown

Related Papers

GitHub

GitHub - rlawjdghek/StableVITON (773 stars)

YouTube

Show All Videos