- The paper introduces TryOffDiff, a diffusion-based model that extracts standardized garment images to enable enhanced person-to-person virtual try-on.
- It employs image conditioning via a frozen SigLIP encoder and a trainable adapter, using class embeddings to unify multiple garment categories.
- Experiments demonstrate that TryOffDiff outperforms baselines on metrics like DISTS and FID, confirming high-quality garment reconstruction.
This paper introduces TryOffDiff, a novel diffusion-based model for Virtual Try-Off (VTOFF), the task of generating standardized, e-commerce-style garment images from photos of people wearing those garments. VTOFF is presented as a complementary task to Virtual Try-On (VTON) and a crucial component for enabling more robust Person-to-Person Virtual Try-On (p2p-VTON), where a garment is transferred between images of two different people.
Problem: Traditional VTON requires a standardized garment image (like from a catalog) and a person image. p2p-VTON uses an image of another person wearing the garment, which is more challenging as the full garment information isn't available. VTOFF aims to extract this standardized garment image, but existing methods are scarce, often limited to upper-body garments, and struggle with quality or require detailed text captions.
TryOffDiff Model:
- Architecture: Built upon a Latent Diffusion Model (LDM) framework, specifically adapting the Stable Diffusion v1.4 architecture.
- Image Conditioning: Instead of text prompts, TryOffDiff uses image conditioning. It employs a frozen SigLIP image encoder to extract visual features from the input person image. These features, preserving spatial information, are processed by a trainable adapter module (Linear layer + Layer Normalization) before being fed into the cross-attention layers of the denoising U-Net. This allows the model to focus on reconstructing the garment based on the visual cues in the reference image.
- Multi-Garment Capability: To handle different garment types (upper-body, lower-body, dresses) within a single model, TryOffDiff incorporates class conditioning. Learnable embeddings corresponding to garment categories are added to the timestep embeddings within the U-Net's residual blocks, guiding the diffusion process to generate the specified garment type.
- Training: The model leverages pretrained Stable Diffusion v1.4 weights for the U-Net, which are fine-tuned. The SigLIP encoder and VAE decoder remain frozen, while the adapter module and class embedding layer are trained from scratch. Training uses the standard MSE loss between predicted and actual noise.
Experiments & Results:
- Datasets: VITON-HD (upper-body) and Dress Code (upper-body, lower-body, dresses).
- Metrics: DISTS (primary), SSIM (and variants), LPIPS, FID, KID.
- Baselines: Compared against TryOffAnyone and category-specific versions of TryOffDiff.
- Key Findings:
- TryOffDiff achieves state-of-the-art VTOFF results on VITON-HD, outperforming TryOffAnyone on most metrics, especially DISTS and FID.
- Using SigLIP as the image encoder yields better results than CLIP, particularly for perceptual quality.
- Fine-tuning a pretrained U-Net is significantly better than training from scratch.
- The multi-garment TryOffDiff model performs comparably to models trained specifically for each garment category on the Dress Code dataset, demonstrating its effectiveness as a unified solution.
- Integrating TryOffDiff (for VTOFF) with a VTON model (OOTDiffusion) creates a p2p-VTON pipeline that achieves competitive results against specialized p2p-VTON models like CatVTON. This approach avoids issues like transferring unwanted attributes (e.g., skin color) from the source person, although it can sometimes leave residual garment artifacts due to masking imperfections.
- Qualitative Analysis: Visual results show TryOffDiff effectively reconstructs garment shape, texture, and patterns, even generalizing to unseen datasets. The multi-garment version shows strong performance across categories.
Contributions:
- TryOffDiff: A novel diffusion model specifically designed and optimized for VTOFF, using SigLIP-based image conditioning.
- Multi-Garment VTOFF: The first model demonstrated to handle multiple garment categories (upper-body, lower-body, dresses) within a single framework using class conditioning.
- Enhanced p2p-VTON: Showcasing that combining VTOFF (TryOffDiff) with existing VTON models improves p2p-VTON by generating an intermediate standardized garment, reducing attribute transfer issues.
Conclusion: TryOffDiff advances the VTOFF task with a robust diffusion-based approach capable of multi-garment reconstruction. Its integration with VTON models provides a practical pipeline for p2p-VTON. Future work could focus on improving the preservation of fine details like logos and complex patterns.