Papers
Topics
Authors
Recent
Search
2000 character limit reached

TryOnDiffusion: A Tale of Two UNets

Published 14 Jun 2023 in cs.CV and cs.GR | (2306.08276v1)

Abstract: Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person. A key challenge is to synthesize a photorealistic detail-preserving visualization of the garment, while warping the garment to accommodate a significant body pose and shape change across the subjects. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this paper, we propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which allows us to preserve garment details and warp the garment for significant pose and body change in a single network. The key ideas behind Parallel-UNet include: 1) garment is warped implicitly via a cross attention mechanism, 2) garment warp and person blend happen as part of a unified process as opposed to a sequence of two separate tasks. Experimental results indicate that TryOnDiffusion achieves state-of-the-art performance both qualitatively and quantitatively.

Citations (79)

Summary

  • The paper introduces TryOnDiffusion, a novel diffusion model with a Parallel-UNet architecture that addresses the challenge of realistic garment warping and detail preservation in virtual try-on across varied poses and shapes.
  • TryOnDiffusion's Parallel-UNet uses implicit garment warping via cross-attention and unifies warping and blending into a single process for high-fidelity detail preservation.
  • Quantitative metrics and user studies show TryOnDiffusion significantly outperforms prior methods, demonstrating its potential for enhancing virtual fashion applications like online retail try-on.

Analysis of TryOnDiffusion: A Tale of Two UNets

The research paper "TryOnDiffusion: A Tale of Two UNets" presents a novel methodology for virtual apparel try-on that addresses the significant challenges of realistic garment warping and detail preservation in the context of considerable transformations in body pose and shape. This paper proposes the TryOnDiffusion model, a diffusion-based architecture leveraging a unique Parallel-UNet configuration, capable of delivering state-of-the-art results in synthetic garment try-on tasks across varied virtual and practical applications.

The core challenge addressed by this work is maintaining the photorealistic details of garments while accommodating significant body shape and pose variations. Previous methodologies have either focused on detail preservation at the expense of adaptability to different poses and body shapes or allowed flexibility in poses while compromising on garment detail fidelity. In contrast, TryOnDiffusion aims to balance these two aspects using a dual UNet architecture.

The Parallel-UNet architecture achieves this balance through:

  1. Implicit Garment Warping: The system utilizes a cross-attention mechanism to implicitly warp garments to fit new body shapes and poses. This mechanism facilitates the establishment of long-range correspondences that effectively handle the occlusions and pronounced pose variations often encountered in synthetic try-on tasks.
  2. Unified Warping and Blending Process: Unlike traditional methods that separate the garment warping and blending stages, TryOnDiffusion integrates them into a single process. This unified approach enables feature-level blending, which is crucial for high-fidelity detail preservation, as opposed to pixel-level post-process blending seen in some other techniques.

The paper reports that TryOnDiffusion is trained using a dataset of 4 million image pairs and achieves high-resolution outputs at 1024x1024 pixels. The framework includes three cascaded diffusion stages—beginning with a base diffusion model at 128x128 resolution and advancing to super-resolution stages at 256x256 and 1024x1024 resolutions. Each stage of this process iteratively refines the try-on image, enhancing both visual fidelity and detail accuracy.

Quantitative assessments in the paper demonstrate significant improvements over previous methods such as TryOnGAN, SDAFN, and HR-VITON, with TryOnDiffusion achieving a notably lower FID (Frechet Inception Distance) and KID (Kernel Inception Distance) across test datasets. Complementing these results are extensive user studies involving over 2,800 samples, in which the TryOnDiffusion outputs were preferred over existing methods in over 92% of cases, further underscoring the model's effectiveness in generating realistic and detailed try-on images.

In terms of implications, TryOnDiffusion establishes a robust framework for virtual fashion applications, indicating potential enhancements in online retail experiences through improved virtual try-on systems. Theoretically, this research contributes significantly to the body of knowledge surrounding image-to-image translation tasks, specifically those requiring complex non-rigid transformations.

In the future, potential extensions of this work could include broader applications in video try-on, where the system's principles could be adapted to handle temporal consistency across frames. Additionally, exploring this approach in conjunction with dynamic backgrounds could further bolster the versatility and real-world applicability of such systems.

In summary, the research delivers substantial advancements in virtual garment try-on technologies by utilizing a sophisticated UNet-based architecture. It effectively combines garment warping and blending into a cohesive, highly effective model that advances the state-of-the-art performance in this domain.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.