WordRobe: Text-Guided Generation of Textured 3D Garments

Published 26 Mar 2024 in cs.CV and cs.GR | (2403.17541v2)

Abstract: In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (57)

Citations (5)

View on Semantic Scholar

Summary

The paper presents WordRobe, a novel framework that generates textured 3D garments using a two-stage encoder-decoder and disentanglement loss.
It aligns CLIP embeddings with a rich 3D garment latent space and employs fast, optimization-free texture synthesis validated by lower Chamfer and Point-to-Surface distances.
The framework produces production-ready outputs that seamlessly integrate into virtual try-on and digital fashion pipelines, paving the way for future 3D garment research.

Text-Guided Generation and Editing of 3D Textured Garments with WordRobe

Introduction

With the surge in 3D content creation driven by applications in virtual try-ons, gaming, and AR/VR, the demand for efficient methods to generate 3D garments has intensified. Traditional techniques either rely on manual design tools or the digitization of real garments, both of which are resource-intensive and difficult to scale. Conversely, recent advancements in text-to-3D generation open avenues for user-friendly garment creation but often fall short in generating high-fidelity, open-surface 3D garments ready for integration into standard graphics pipelines.

WordRobe Framework

WordRobe addresses these challenges by introducing a novel framework for the text-driven generation of textured 3D garments. The framework comprises three main components:

3D Garment Latent Space: Utilizing a two-stage encoder-decoder strategy to model 3D garments as unsigned distance fields (UDFs), WordRobe learns a rich latent space of unposed garments. It employs a novel disentanglement loss to promote better latent interpolation, facilitating effective manipulation of garment attributes.
CLIP-Guided Garment Generation: By aligning the garment latent space with the CLIP embedding space, WordRobe enables text-driven garment generation. A weakly-supervised training scheme for mapping CLIP embeddings to garment latent codes negates the need for manually annotated datasets.
Texture Synthesis: Leveraging pre-trained text-to-image models, WordRobe synthesizes photorealistic textures in a single feed-forward step, significantly enhancing efficiency compared to existing state-of-the-art (SOTA) methods. By rendering depth maps in front and back views and passing these to ControlNet, WordRobe ensures view-consistent texture generation.

Performance and Contributions

Quantitative Evaluation: WordRobe demonstrates superior performance over current SOTAs in learning 3D garment latent spaces. Specifically, it achieves significantly lower Point-to-Surface distance and Chamfer Distance metrics, indicating high-quality garment geometry.
Disentanglement Loss: The introduction of a novel disentanglement loss results in a more structured latent space, conducive to better concept separation and latent interpolation.
Texture Synthesis Efficiency: Compared to Text2Tex, WordRobe’s optimization-free texture synthesis method not only provides better view consistency but also operates significantly faster, making it a practical alternative for large-scale 3D garment generation.

Implications and Future Directions

WordRobe's efficient generation of high-quality, textured 3D garments from text prompts has practical implications in content creation for virtual environments. Its ability to produce production-ready garment meshes directly integrates with cloth simulation and animation pipelines, thereby streamlining workflow in digital fashion and virtual worlds creation.

The framework also opens avenues for future research, including the exploration of relighting to retain true albedo under varying lighting conditions and the extension to support layered clothing and material properties.

Conclusion

WordRobe marks a significant advancement in the text-driven generation and editing of 3D garments, offering unparalleled efficiency, quality, and practicality. Its innovative contributions to learning a structured garment latent space and view-consistent texture synthesis set new benchmarks in the field, fueling further research and development in 3D content creation for virtual environments.

Markdown Report Issue