Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GALA: Generating Animatable Layered Assets from a Single Scan (2401.12979v1)

Published 23 Jan 2024 in cs.CV

Abstract: We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. Single-image 3d human digitization with shape-guided diffusion. In Proc. ACM SIGGRAPH Asia, 2023.
  2. Video based reconstruction of 3d people models. In Proc. CVPR, 2018.
  3. Learning to reconstruct people in clothing from a single rgb camera. In Proc. CVPR, 2019.
  4. Photorealistic monocular 3d reconstruction of humans wearing clothing. In Proc. CVPR, 2022.
  5. AXYZ DESIGN. https://secure.axyz-design.com.
  6. Cloth3d: clothed 3d humans. In Proc. ECCV, 2020.
  7. Multi-garment net: Learning to dress 3d people from images. In Proc. ICCV, 2019.
  8. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Proc. ECCV, 2016.
  9. Instructpix2pix: Learning to follow image editing instructions. In Proc. CVPR, 2023.
  10. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
  11. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI, 2019.
  12. Text2shape: Generating shapes from natural language by learning joint embeddings. arXiv preprint arXiv:1803.08495, 2018.
  13. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proc. ICCV, 2023.
  14. Tightcap: 3d human shape capture with clothing tightness field. ACM TOG, 41(1):1–17, 2021a.
  15. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In Proc. ICCV, 2021b.
  16. Fast-snarf: A fast deformer for articulated neural fields. IEEE TPAMI, 45:11796–11809, 2022a.
  17. gdna: Towards generative detailed neural avatars. In Proc. CVPR, 2022b.
  18. Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In Proc. CVPR, 2019.
  19. Smplicit: Topology-aware generative model for clothed people. In Proc. CVPR, 2021.
  20. Objaverse: A universe of annotated 3d objects. In Proc. CVPR, 2023.
  21. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  22. Pina: Learning a personalized implicit neural avatar from a single rgb-d video sequence. In Proc. CVPR, 2022.
  23. Capturing and animation of body and clothing from monocular video. In Proc. ACM SIGGRAPH Asia, 2022.
  24. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM TOG, 41(4):1–13, 2022.
  25. Generative adversarial nets. In NeurIPS, 2014.
  26. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proc. ICCV, 2023.
  27. Arch++: Animation-ready clothed human reconstruction revisited. In Proc. ICCV, 2021.
  28. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  29. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. TOG, 2022.
  30. Humanliff: Layer-wise 3d human generation with diffusion model. arXiv preprint, 2023.
  31. Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. arXiv preprint arXiv:2310.01406, 2023.
  32. TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In Proc. 3DV, 2024.
  33. Arch: Animatable reconstruction of clothed humans. In Proc. CVPR, 2020.
  34. Zero-shot text-guided object generation with dream fields. In Proc. CVPR, 2022.
  35. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Proc. CVPR, 2018.
  36. End-to-end recovery of human shape and pose. In Proc. CVPR, 2018.
  37. Relu fields: The little non-linearity that could. In Proc. ACM SIGGRAPH, 2022.
  38. Chupa: Carving 3d clothed humans from skinned shape priors using 2d diffusion probabilistic models. Proc. ICCV, 2023a.
  39. Ncho: Unsupervised learning for neural 3d composition of humans and objects. In Proc. ICCV, 2023b.
  40. Segment anything. In Proc. ICCV, 2023.
  41. Dreamhuman: Animatable 3d avatars from text. arXiv preprint arXiv:2306.09329, 2023.
  42. Megane: Morphable eyeglass and avatar network. In CVPR, 2023a.
  43. Focaldreamer: Text-driven 3d editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608, 2023b.
  44. TADA! Text to Animatable Digital Avatars. In Proc. 3DV, 2024.
  45. Magic3d: High-resolution text-to-3d content creation. In Proc. CVPR, 2023.
  46. Meshdiffusion: Score-based generative 3d mesh modeling. In Proc. ICLR, 2023.
  47. SMPL: A skinned multi-person linear model. ACM TOG, 34(6):248:1–248:16, 2015.
  48. Learning to dress 3d people in generative clothing. In Proc. CVPR, 2020.
  49. Occupancy networks: Learning 3d reconstruction in function space. In Proc. CVPR, 2019.
  50. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proc. CVPR, 2023.
  51. Text2mesh: Text-driven neural stylization for meshes. In Proc. CVPR, 2022.
  52. Leap: Learning articulated occupancy of people. In CVPR, 2021.
  53. Coap: Compositional articulated occupancy of people. In Proc. CVPR, 2022.
  54. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proc. ECCV, 2020.
  55. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In Proc. ACM SIGGRAPH Asia, 2022.
  56. Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 2022.
  57. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  58. Deepsdf: Learning continuous signed distance functions for shape representation. In Proc. CVPR, 2019.
  59. Tailornet: Predicting clothing in 3d as a function of human pose, shape and garment style. In Proc. CVPR, 2020.
  60. Expressive body capture: 3d hands, face, and body from a single image. In Proc. CVPR, 2019.
  61. Clothcap: Seamless 4d clothing capture and retargeting. ACM TOG, 2017.
  62. Dreamfusion: Text-to-3d using 2d diffusion. In Proc. ICLR, 2023.
  63. Learning transferable visual models from natural language supervision. In Proc. ICML, 2021.
  64. Renderpeople, 2018. https://renderpeople.com/3d-people.
  65. Texture: Text-guided texturing of 3d shapes. ACM TOG, 2023.
  66. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022.
  67. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In Proc. ICCV, 2021.
  68. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proc. CVPR, 2023.
  69. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proc. ICCV, 2019.
  70. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proc. CVPR, 2020.
  71. Scanimate: Weakly supervised learning of skinned clothed avatar networks. In Proc. CVPR, 2021.
  72. Vox-e: Text-guided voxel editing of 3d objects. In Proc. ICCV, 2023.
  73. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  74. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021.
  75. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
  76. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021.
  77. Neural-gif: Neural generalized implicit functions for animating people in clothing. In Proc. ICCV, 2021.
  78. Fully convolutional graph neural networks for parametric virtual try-on. Comput. Graph. Forum, 39(8):145–156, 2020.
  79. Metaavatar: Learning animatable clothed human models from few depth images. In NeurIPS, 2021.
  80. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proc. CVPR, 2023.
  81. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proc. CVPR, 2022.
  82. M. Westerlund. The emergence of deepfake technology: A review. Technology innovation management review, 9(11), 2019.
  83. Neural fields in visual computing and beyond. Comput. Graph. Forum, 41(2):641–676, 2022.
  84. Icon: implicit clothed humans obtained from normals. In Proc. CVPR, 2022.
  85. ECON: Explicit Clothed humans Optimized via Normal integration. In Proc. CVPR, 2023.
  86. Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proc. CVPR, 2020.
  87. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In Proc. ECCV, 2022.
  88. Analyzing clothing layer deformation statistics of 3d human motions. In Proc. ECCV, 2018.
  89. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In Proc. CVPR, 2021.
  90. Lion: Latent point diffusion models for 3d shape generation. In NeurIPS, 2022.
  91. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proc. ICCV, 2021.
  92. Avatarverse: High-quality & stable 3d avatar creation from text and pose. arXiv preprint arXiv:2308.03610, 2023a.
  93. Adding conditional control to text-to-image diffusion models. In Proc. ICCV, 2023b.
  94. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE TPAMI, 2021.
  95. Registering explicit to implicit: Towards high-fidelity garment mesh reconstruction from single images. In Proc. CVPR, 2022.
  96. Dreameditor: Text-driven 3d scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Taeksoo Kim (6 papers)
  2. Byungjun Kim (10 papers)
  3. Shunsuke Saito (56 papers)
  4. Hanbyul Joo (37 papers)
Citations (5)

Summary

Introduction

In online environments, virtual try-on and avatar customization represent significant areas of interest. 3D models can now be constructed with relative ease, but the models generated are often static and limited in terms of animation and customizability. Traditionally, the creation of animatable and layerable 3D assets has been a manual and time-consuming endeavor. Addressing this challenge, we introduce GALA—Generating Animatable Layered Assets from a Single Scan. GALA provides a novel framework enabling the automatic transformation of single-layer 3D human scans into animatable, multi-layered 3D assets.

Approach

GALA's approach encompasses a multi-faceted process to deconstruct a single-layered mesh, typically derived from clothed human 3D scans, into versatile layered assets. Its advantage lies in the generation of full decomposed assets, including occluded regions, which are critical for seamless recomposition and animation. The framework heavily relies on a 2D diffusion model, tapping into the vast image corpus it has been trained on to complement missing geometry and textures. By separating input meshes via a 3D surface segmentation extracted from multi-view 2D segmentations and employing a pose-guided Score Distillation Sampling (SDS) loss, GALA reconstructs high-fidelity geometry and texture in both posed and canonical spaces.

Method

GALA builds on the effective use of a diffusion model and a geometric representation known as Deep Marching Tetrahedra (DMTet). It begins with geometric decomposition, relying on 3D surface segmentation and the pose-guided SDS loss to model these layers in a canonical space that supports reanimation. Then, GALA takes on texture generation, utilizing the pose-guided SDS loss in canonical space to create textures for occluded regions. The composition of the assets includes a refinement step to minimize penetration between layers, which optimizes vertex positions and penalizes misalignments. These steps result in 3D models that can be rigged, posed, and layered effectively.

Evaluation

The GALA framework's capabilities are tested through a series of experiments and benchmarking against existing solutions, demonstrating superior performance in decomposition, canonicalization, and composition tasks. With refined optimization strategies, GALA showcases its ability to maintain geometry and texture integrity even when facing complex occluded regions. These assets can be matched with various identities and animated in a range of poses while retaining a high level of realism.

Conclusion

GALA is uniquely positioned to advance how avatars and digital apparel are generated, serving as a significant step towards automated, high-fidelity 3D model creation. The flexibility and quality of the output assets hold promise for various applications, inspiring future research paths to explore pose-dependent deformations and further independence from 2D segmentations. The comprehensive benchmark and future release of the codebase make GALA a pivotal resource for researchers seeking to innovate in virtual representations.