Text-Guided Generation and Editing of 3D Textured Garments with WordRobe
Introduction
With the surge in 3D content creation driven by applications in virtual try-ons, gaming, and AR/VR, the demand for efficient methods to generate 3D garments has intensified. Traditional techniques either rely on manual design tools or the digitization of real garments, both of which are resource-intensive and difficult to scale. Conversely, recent advancements in text-to-3D generation open avenues for user-friendly garment creation but often fall short in generating high-fidelity, open-surface 3D garments ready for integration into standard graphics pipelines.
WordRobe Framework
WordRobe addresses these challenges by introducing a novel framework for the text-driven generation of textured 3D garments. The framework comprises three main components:
- 3D Garment Latent Space: Utilizing a two-stage encoder-decoder strategy to model 3D garments as unsigned distance fields (UDFs), WordRobe learns a rich latent space of unposed garments. It employs a novel disentanglement loss to promote better latent interpolation, facilitating effective manipulation of garment attributes.
- CLIP-Guided Garment Generation: By aligning the garment latent space with the CLIP embedding space, WordRobe enables text-driven garment generation. A weakly-supervised training scheme for mapping CLIP embeddings to garment latent codes negates the need for manually annotated datasets.
- Texture Synthesis: Leveraging pre-trained text-to-image models, WordRobe synthesizes photorealistic textures in a single feed-forward step, significantly enhancing efficiency compared to existing state-of-the-art (SOTA) methods. By rendering depth maps in front and back views and passing these to ControlNet, WordRobe ensures view-consistent texture generation.
Performance and Contributions
- Quantitative Evaluation: WordRobe demonstrates superior performance over current SOTAs in learning 3D garment latent spaces. Specifically, it achieves significantly lower Point-to-Surface distance and Chamfer Distance metrics, indicating high-quality garment geometry.
- Disentanglement Loss: The introduction of a novel disentanglement loss results in a more structured latent space, conducive to better concept separation and latent interpolation.
- Texture Synthesis Efficiency: Compared to Text2Tex, WordRobe’s optimization-free texture synthesis method not only provides better view consistency but also operates significantly faster, making it a practical alternative for large-scale 3D garment generation.
Implications and Future Directions
WordRobe's efficient generation of high-quality, textured 3D garments from text prompts has practical implications in content creation for virtual environments. Its ability to produce production-ready garment meshes directly integrates with cloth simulation and animation pipelines, thereby streamlining workflow in digital fashion and virtual worlds creation.
The framework also opens avenues for future research, including the exploration of relighting to retain true albedo under varying lighting conditions and the extension to support layered clothing and material properties.
Conclusion
WordRobe marks a significant advancement in the text-driven generation and editing of 3D garments, offering unparalleled efficiency, quality, and practicality. Its innovative contributions to learning a structured garment latent space and view-consistent texture synthesis set new benchmarks in the field, fueling further research and development in 3D content creation for virtual environments.