GALA: Generating Animatable Layered Assets from a Single Scan (2401.12979v1)
Abstract: We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.
- Single-image 3d human digitization with shape-guided diffusion. In Proc. ACM SIGGRAPH Asia, 2023.
- Video based reconstruction of 3d people models. In Proc. CVPR, 2018.
- Learning to reconstruct people in clothing from a single rgb camera. In Proc. CVPR, 2019.
- Photorealistic monocular 3d reconstruction of humans wearing clothing. In Proc. CVPR, 2022.
- AXYZ DESIGN. https://secure.axyz-design.com.
- Cloth3d: clothed 3d humans. In Proc. ECCV, 2020.
- Multi-garment net: Learning to dress 3d people from images. In Proc. ICCV, 2019.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Proc. ECCV, 2016.
- Instructpix2pix: Learning to follow image editing instructions. In Proc. CVPR, 2023.
- Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TPAMI, 2019.
- Text2shape: Generating shapes from natural language by learning joint embeddings. arXiv preprint arXiv:1803.08495, 2018.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proc. ICCV, 2023.
- Tightcap: 3d human shape capture with clothing tightness field. ACM TOG, 41(1):1–17, 2021a.
- Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In Proc. ICCV, 2021b.
- Fast-snarf: A fast deformer for articulated neural fields. IEEE TPAMI, 45:11796–11809, 2022a.
- gdna: Towards generative detailed neural avatars. In Proc. CVPR, 2022b.
- Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In Proc. CVPR, 2019.
- Smplicit: Topology-aware generative model for clothed people. In Proc. CVPR, 2021.
- Objaverse: A universe of annotated 3d objects. In Proc. CVPR, 2023.
- P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Pina: Learning a personalized implicit neural avatar from a single rgb-d video sequence. In Proc. CVPR, 2022.
- Capturing and animation of body and clothing from monocular video. In Proc. ACM SIGGRAPH Asia, 2022.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM TOG, 41(4):1–13, 2022.
- Generative adversarial nets. In NeurIPS, 2014.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proc. ICCV, 2023.
- Arch++: Animation-ready clothed human reconstruction revisited. In Proc. ICCV, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. TOG, 2022.
- Humanliff: Layer-wise 3d human generation with diffusion model. arXiv preprint, 2023.
- Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. arXiv preprint arXiv:2310.01406, 2023.
- TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In Proc. 3DV, 2024.
- Arch: Animatable reconstruction of clothed humans. In Proc. CVPR, 2020.
- Zero-shot text-guided object generation with dream fields. In Proc. CVPR, 2022.
- Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Proc. CVPR, 2018.
- End-to-end recovery of human shape and pose. In Proc. CVPR, 2018.
- Relu fields: The little non-linearity that could. In Proc. ACM SIGGRAPH, 2022.
- Chupa: Carving 3d clothed humans from skinned shape priors using 2d diffusion probabilistic models. Proc. ICCV, 2023a.
- Ncho: Unsupervised learning for neural 3d composition of humans and objects. In Proc. ICCV, 2023b.
- Segment anything. In Proc. ICCV, 2023.
- Dreamhuman: Animatable 3d avatars from text. arXiv preprint arXiv:2306.09329, 2023.
- Megane: Morphable eyeglass and avatar network. In CVPR, 2023a.
- Focaldreamer: Text-driven 3d editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608, 2023b.
- TADA! Text to Animatable Digital Avatars. In Proc. 3DV, 2024.
- Magic3d: High-resolution text-to-3d content creation. In Proc. CVPR, 2023.
- Meshdiffusion: Score-based generative 3d mesh modeling. In Proc. ICLR, 2023.
- SMPL: A skinned multi-person linear model. ACM TOG, 34(6):248:1–248:16, 2015.
- Learning to dress 3d people in generative clothing. In Proc. CVPR, 2020.
- Occupancy networks: Learning 3d reconstruction in function space. In Proc. CVPR, 2019.
- Latent-nerf for shape-guided generation of 3d shapes and textures. In Proc. CVPR, 2023.
- Text2mesh: Text-driven neural stylization for meshes. In Proc. CVPR, 2022.
- Leap: Learning articulated occupancy of people. In CVPR, 2021.
- Coap: Compositional articulated occupancy of people. In Proc. CVPR, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In Proc. ECCV, 2020.
- Clip-mesh: Generating textured meshes from text using pretrained image-text models. In Proc. ACM SIGGRAPH Asia, 2022.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 2022.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- Deepsdf: Learning continuous signed distance functions for shape representation. In Proc. CVPR, 2019.
- Tailornet: Predicting clothing in 3d as a function of human pose, shape and garment style. In Proc. CVPR, 2020.
- Expressive body capture: 3d hands, face, and body from a single image. In Proc. CVPR, 2019.
- Clothcap: Seamless 4d clothing capture and retargeting. ACM TOG, 2017.
- Dreamfusion: Text-to-3d using 2d diffusion. In Proc. ICLR, 2023.
- Learning transferable visual models from natural language supervision. In Proc. ICML, 2021.
- Renderpeople, 2018. https://renderpeople.com/3d-people.
- Texture: Text-guided texturing of 3d shapes. ACM TOG, 2023.
- High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022.
- Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In Proc. ICCV, 2021.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proc. CVPR, 2023.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proc. ICCV, 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proc. CVPR, 2020.
- Scanimate: Weakly supervised learning of skinned clothed avatar networks. In Proc. CVPR, 2021.
- Vox-e: Text-guided voxel editing of 3d objects. In Proc. ICCV, 2023.
- Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021.
- Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
- Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021.
- Neural-gif: Neural generalized implicit functions for animating people in clothing. In Proc. ICCV, 2021.
- Fully convolutional graph neural networks for parametric virtual try-on. Comput. Graph. Forum, 39(8):145–156, 2020.
- Metaavatar: Learning animatable clothed human models from few depth images. In NeurIPS, 2021.
- Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proc. CVPR, 2023.
- Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proc. CVPR, 2022.
- M. Westerlund. The emergence of deepfake technology: A review. Technology innovation management review, 9(11), 2019.
- Neural fields in visual computing and beyond. Comput. Graph. Forum, 41(2):641–676, 2022.
- Icon: implicit clothed humans obtained from normals. In Proc. CVPR, 2022.
- ECON: Explicit Clothed humans Optimized via Normal integration. In Proc. CVPR, 2023.
- Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proc. CVPR, 2020.
- Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In Proc. ECCV, 2022.
- Analyzing clothing layer deformation statistics of 3d human motions. In Proc. ECCV, 2018.
- Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In Proc. CVPR, 2021.
- Lion: Latent point diffusion models for 3d shape generation. In NeurIPS, 2022.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proc. ICCV, 2021.
- Avatarverse: High-quality & stable 3d avatar creation from text and pose. arXiv preprint arXiv:2308.03610, 2023a.
- Adding conditional control to text-to-image diffusion models. In Proc. ICCV, 2023b.
- Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE TPAMI, 2021.
- Registering explicit to implicit: Towards high-fidelity garment mesh reconstruction from single images. In Proc. CVPR, 2022.
- Dreameditor: Text-driven 3d scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.
- Taeksoo Kim (6 papers)
- Byungjun Kim (10 papers)
- Shunsuke Saito (56 papers)
- Hanbyul Joo (37 papers)