M3D-VTON: A Monocular-to-3D Virtual Try-On Network (2108.05126v1)

Published 11 Aug 2021 in cs.CV

Abstract: Virtual 3D try-on can provide an intuitive and realistic view for online shopping and has a huge potential commercial value. However, existing 3D virtual try-on methods mainly rely on annotated 3D human shapes and garment templates, which hinders their applications in practical scenarios. 2D virtual try-on approaches provide a faster alternative to manipulate clothed humans, but lack the rich and realistic 3D representation. In this paper, we propose a novel Monocular-to-3D Virtual Try-On Network (M3D-VTON) that builds on the merits of both 2D and 3D approaches. By integrating 2D information efficiently and learning a mapping that lifts the 2D representation to 3D, we make the first attempt to reconstruct a 3D try-on mesh only taking the target clothing and a person image as inputs. The proposed M3D-VTON includes three modules: 1) The Monocular Prediction Module (MPM) that estimates an initial full-body depth map and accomplishes 2D clothes-person alignment through a novel two-stage warping procedure; 2) The Depth Refinement Module (DRM) that refines the initial body depth to produce more detailed pleat and face characteristics; 3) The Texture Fusion Module (TFM) that fuses the warped clothing with the non-target body part to refine the results. We also construct a high-quality synthesized Monocular-to-3D virtual try-on dataset, in which each person image is associated with a front and a back depth map. Extensive experiments demonstrate that the proposed M3D-VTON can manipulate and reconstruct the 3D human body wearing the given clothing with compelling details and is more efficient than other 3D approaches.

Citations (60)

View on Semantic Scholar

Summary

The paper introduces M3D-VTON, a novel network that generates realistic 3D virtual try-on results from monocular images without relying on annotated 3D data.
The methodology integrates a three-module approach—monocular prediction, depth refinement, and texture fusion—to accurately synthesize colored point clouds and 3D mesh reconstructions.
The paper demonstrates superior performance with improved SSIM, FID, and depth error metrics on the MPV-3D dataset, ensuring enhanced texture realism and geometric accuracy.

Overview of M3D-VTON: A Monocular-to-3D Virtual Try-On Network

The paper introduces M3D-VTON, a novel network architecture designed to address the limitations of existing virtual try-on technologies. This method combines the strengths of both 2D and 3D approaches to achieve realistic 3D virtual try-on capabilities from monocular inputs. The proposed methodology fills a notable gap by eliminating the dependence on annotated 3D human data and garment templates, which often restrict the deployment of current systems in real-world applications.

The core of M3D-VTON consists of three modules: the Monocular Prediction Module (MPM), the Depth Refinement Module (DRM), and the Texture Fusion Module (TFM). This multi-step process begins with MPM, which aligns the garment with the target human figure in 2D space using a novel two-stage warping procedure. This step is crucial to ensure accurate texture placement and alignment, leveraging a self-adaptive pre-alignment strategy that initially resizes and positions the garment appropriately before non-rigid transformation. Subsequently, DRM refines the initial body depth estimation by incorporating geometric details such as pleats and face characteristics, achieved through a depth gradient constraint that enhances high-frequency feature capture. Finally, TFM synthesizes a seamless 3D representation by integrating the warped clothing with preserved non-target textures using both the 2D information and 3D body depth, ultimately generating a colored point cloud for 3D mesh reconstruction.

The paper showcases the efficacy of M3D-VTON on the newly constructed MPV-3D dataset, which features a comprehensive collection of synthetic monocular-to-3D try-on scenarios. The experimental results highlight robust numerical performance, with the network outperforming existing 3D try-on techniques in both texture and shape realism while ensuring computational efficiency. Quantitative measures demonstrate superior SSIM and FID scores compared to traditional 2D methods, while depth error metrics reflect improved geometric accuracy against 3D benchmarks.

In terms of implications, M3D-VTON positions itself as a viable commercial and research solution, offering enhanced speed and precision for virtual fashion applications. The methodology’s reliance on single-view inputs and synthesized datasets potentially reduces deployment barriers, addressing the scalability issues faced by data-hungry contemporary systems.

Looking forward, the integration of M3D-VTON into broader AI systems for virtual modeling opens a discourse on its utility in various domains beyond fashion, such as digital human creation in entertainment and real-time interactive applications. Future advancements might explore adaptive learning from diverse datasets to enhance generalization capabilities, or refinement of depth estimation techniques to further advance model accuracy and realism. The research sets a precedent for further exploration into AI-driven virtual try-on solutions that seamlessly blend photorealistic representation with computational efficiency.

PDF Markdown

Related Papers

GitHub

GitHub - fyviezhao/M3D-VTON: Official code for ICCV2021 paper "M3D-VTON: A Monocular-to-3D Virtual Try-on Network" (174 stars)