Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WordRobe: Text-Guided Generation of Textured 3D Garments (2403.17541v2)

Published 26 Mar 2024 in cs.CV and cs.GR

Abstract: In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.

Text-Guided Generation and Editing of 3D Textured Garments with WordRobe

Introduction

With the surge in 3D content creation driven by applications in virtual try-ons, gaming, and AR/VR, the demand for efficient methods to generate 3D garments has intensified. Traditional techniques either rely on manual design tools or the digitization of real garments, both of which are resource-intensive and difficult to scale. Conversely, recent advancements in text-to-3D generation open avenues for user-friendly garment creation but often fall short in generating high-fidelity, open-surface 3D garments ready for integration into standard graphics pipelines.

WordRobe Framework

WordRobe addresses these challenges by introducing a novel framework for the text-driven generation of textured 3D garments. The framework comprises three main components:

  1. 3D Garment Latent Space: Utilizing a two-stage encoder-decoder strategy to model 3D garments as unsigned distance fields (UDFs), WordRobe learns a rich latent space of unposed garments. It employs a novel disentanglement loss to promote better latent interpolation, facilitating effective manipulation of garment attributes.
  2. CLIP-Guided Garment Generation: By aligning the garment latent space with the CLIP embedding space, WordRobe enables text-driven garment generation. A weakly-supervised training scheme for mapping CLIP embeddings to garment latent codes negates the need for manually annotated datasets.
  3. Texture Synthesis: Leveraging pre-trained text-to-image models, WordRobe synthesizes photorealistic textures in a single feed-forward step, significantly enhancing efficiency compared to existing state-of-the-art (SOTA) methods. By rendering depth maps in front and back views and passing these to ControlNet, WordRobe ensures view-consistent texture generation.

Performance and Contributions

  • Quantitative Evaluation: WordRobe demonstrates superior performance over current SOTAs in learning 3D garment latent spaces. Specifically, it achieves significantly lower Point-to-Surface distance and Chamfer Distance metrics, indicating high-quality garment geometry.
  • Disentanglement Loss: The introduction of a novel disentanglement loss results in a more structured latent space, conducive to better concept separation and latent interpolation.
  • Texture Synthesis Efficiency: Compared to Text2Tex, WordRobe’s optimization-free texture synthesis method not only provides better view consistency but also operates significantly faster, making it a practical alternative for large-scale 3D garment generation.

Implications and Future Directions

WordRobe's efficient generation of high-quality, textured 3D garments from text prompts has practical implications in content creation for virtual environments. Its ability to produce production-ready garment meshes directly integrates with cloth simulation and animation pipelines, thereby streamlining workflow in digital fashion and virtual worlds creation.

The framework also opens avenues for future research, including the exploration of relighting to retain true albedo under varying lighting conditions and the extension to support layered clothing and material properties.

Conclusion

WordRobe marks a significant advancement in the text-driven generation and editing of 3D garments, offering unparalleled efficiency, quality, and practicality. Its innovative contributions to learning a structured garment latent space and view-consistent texture synthesis set new benchmarks in the field, fueling further research and development in 3D content creation for virtual environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Dreambooth3d: Subject-driven text-to-3d generation. ICCV, 2023.
  2. Text2nerf: Text-driven 3d scene generation with neural radiance fields. arXiv preprint arXiv:2305.11588, 2023a.
  3. Dreamhuman: Animatable 3d avatars from text. 2023.
  4. CLO. https://www.clo3d.com/en/. URL https://www.clo3d.com/en/.
  5. Artec3D. https://www.artec3d.com/portable-3d-scanners.
  6. Bcnet: Learning body and cloth shape from a single image. In European Conference on Computer Vision. Springer, 2020.
  7. Smplicit: Topology-aware generative model for clothed people, 2021.
  8. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, October 2015.
  9. Deepcloth: Neural garment representation for shape and style editing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1581–1593, 2023. doi: 10.1109/TPAMI.2022.3168569.
  10. Registering explicit to implicit: Towards high-fidelity garment mesh reconstruction from single images, 2022.
  11. xcloth: Extracting template-free textured 3d clothes from a monocular image. Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  12. DrapeNet: Garment Generation and Self-Supervised Draping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  13. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  14. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023a.
  15. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior, 2023.
  16. Texture generation on 3d meshes with point-uv diffusion, 2023.
  17. Texture: Text-guided texturing of 3d shapes, 2023.
  18. Text2tex: Text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396, 2023b.
  19. Adding conditional control to text-to-image diffusion models, 2023b.
  20. Multi-Garment Net: Learning to dress 3D people from images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019. doi: 10.1109/ICCV.2019.00552.
  21. Deep fashion3d: A dataset and benchmark for 3d garment reconstruction from single images. ArXiv, abs/2003.12753, 2020.
  22. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019. doi: 10.1109/ICCV.2019.00239.
  23. PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. doi: 10.1109/cvpr42600.2020.00016.
  24. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. doi: 10.1109/TPAMI.2021.3050505.
  25. ECON: Explicit Clothed humans Optimized via Normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
  26. CVIT. 3dhumans: A rich 3d dataset of scanned humans, 2021. URL http://cvit.iiit.ac.in/research/projects/cvit-projects/sharp-3dhumans-a-rich-3d-dataset-of-scanned-humans.
  27. Computational pattern making from 3d garment models. ACM Trans. Graph., 41(4), jul 2022. ISSN 0730-0301. doi: 10.1145/3528223.3530145. URL https://doi.org/10.1145/3528223.3530145.
  28. Generating datasets of 3d garments with sewing patterns. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/013d407166ec4fa56eb1e1f8cbe183b9-Paper-round1.pdf.
  29. Dreamfusion: Text-to-3d using 2d diffusion, 2022.
  30. Textdeformer: Geometry manipulation using text guidance, 2023.
  31. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis, 2021.
  32. High-resolution image synthesis with latent diffusion models, 2022.
  33. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  34. Denoising diffusion probabilistic models, 2020.
  35. Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
  36. invs: Repurposing diffusion inpainters for novel view synthesis, 2023.
  37. Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion prior, 2023.
  38. Mvdream: Multi-view diffusion for 3d generation, 2023.
  39. Objaverse: A universe of annotated 3d objects, 2022.
  40. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 2019.
  41. Occupancy networks: Learning 3d reconstruction in function space, 2019.
  42. Meshudf: Fast and differentiable meshing of unsigned distance field networks. In European Conference on Computer Vision, 2022.
  43. Styleclip: Text-driven manipulation of stylegan imagery, 2021.
  44. Cloth3d: clothed 3d humans. In European Conference on Computer Vision, pages 344–359. Springer, 2020.
  45. ULNeF: Untangled layered neural fields for mix-and-match virtual try-on. In Advances in Neural Information Processing Systems, (NeurIPS), 2022a.
  46. Multisource point clouds, point simplification and surface reconstruction. Remote Sensing, 11(22), 2019. ISSN 2072-4292. doi: 10.3390/rs11222659. URL https://www.mdpi.com/2072-4292/11/22/2659.
  47. A near-linear time algorithm for the chamfer distance, 2023.
  48. ClipFace: Text-guided Editing of Textured 3D Morphable Models. In ArXiv preprint arXiv:2212.01406, 2022.
  49. Shap-e: Generating conditional 3d implicit functions, 2023.
  50. Neural Representation of Open Surfaces. Computer Graphics Forum, 2023. ISSN 1467-8659. doi: 10.1111/cgf.14916.
  51. Modulating early visual processing by language, 2017.
  52. Decoupled weight decay regularization, 2019.
  53. Optcuts: Joint optimization of surface cuts and parameterization. ACM Transactions on Graphics, 37(6), 2018. doi: http://dx.doi.org/10.1145/3272127.3275042.
  54. Snug: Self-supervised neural dynamic garments, 2022b.
  55. Neural cloth simulation. ACM Transactions on Graphics, 41(6):1–14, November 2022. ISSN 1557-7368. doi: 10.1145/3550454.3555491. URL http://dx.doi.org/10.1145/3550454.3555491.
  56. HOOD: Hierarchical graphs for generalized modelling of clothing dynamics. 2023.
  57. Triposr: Fast 3d object reconstruction from a single image, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Astitva Srivastava (9 papers)
  2. Pranav Manu (2 papers)
  3. Amit Raj (24 papers)
  4. Varun Jampani (125 papers)
  5. Avinash Sharma (25 papers)
Citations (5)