DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models (2304.00916v3)
Abstract: We present DreamAvatar, a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses. While encouraging results have been reported by recent methods on text-guided 3D common object generation, generating high-quality human avatars remains an open challenge due to the complexity of the human body's shape, pose, and appearance. We propose DreamAvatar to tackle this challenge, which utilizes a trainable NeRF for predicting density and color for 3D points and pretrained text-to-image diffusion models for providing 2D self-supervision. Specifically, we leverage the SMPL model to provide shape and pose guidance for the generation. We introduce a dual-observation-space design that involves the joint optimization of a canonical space and a posed space that are related by a learnable deformation field. This facilitates the generation of more complete textures and geometry faithful to the target pose. We also jointly optimize the losses computed from the full body and from the zoomed-in 3D head to alleviate the common multi-face ''Janus'' problem and improve facial details in the generated avatars. Extensive evaluations demonstrate that DreamAvatar significantly outperforms existing methods, establishing a new state-of-the-art for text-and-shape guided 3D human avatar generation.
- Learning representations and generative models for 3d point clouds. In International conference on machine learning, 2018.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Generative neural articulated radiance fields. In Advances in Neural Information Processing Systems, 2022.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, 2016.
- Jiff: Jointly-aligned implicit face function for high quality single view clothed human reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Guide3d: Create 3d avatars from text and image guidance. arXiv preprint arXiv:2308.09705, 2023.
- Efficient geometry-aware 3D generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. 2023.
- Learning implicit fields for generative shape modeling. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- High-quality streamable free-viewpoint video. ACM Transactions on Graphics, 2015.
- CrucibleAI. ControlNetMediaPipeFace. https://huggingface.co/CrucibleAI/ControlNetMediaPipeFace, 2023.
- Multiview 3d reconstruction in geosciences. Computers Geosciences, 2012.
- Stylegan-human: A data-centric odyssey of human generation. In European Conference on Computer Vision, 2022.
- 3d shape induction from 2d views of multiple objects. In International Conference on 3D Vision, 2017.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. 2023.
- Mps-nerf: Generalizable 3d human rendering from multiview images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
- Headsculpt: Crafting 3d head avatars with text. arXiv preprint arXiv:2306.03038, 2023.
- Escaping plato’s cave: 3d shape from adversarial rendering. In International Conference on Computer Vision, 2019.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics, 2022.
- EVA3d: Compositional 3d human generation from 2d image collections. In International Conference on Learning Representations, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Dreamwaltz: Make a scene with complex 3d animatable avatars. 2023.
- Zero-shot text-guided object generation with dream fields. 2022.
- Humangen: Generating human radiance fields with explicit priors. arXiv preprint arXiv:2212.05321, 2022.
- Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
- Clip-mesh: Generating textured meshes from text using pretrained image-text models. 2022.
- Tada! text to animatable digital avatars. arXiv preprint arXiv:2308.10899, 2023.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- Neural actor: Neural free-view synthesis of human actors with pose control. ACM SIGGRAPH Asia, 2021.
- Zero-1-to-3: Zero-shot one image to 3d object. 2023.
- SMPL: A skinned multi-person linear model. TOG, 2015.
- Inverse graphics gan: Learning to generate 3d shapes from unstructured 2d data. arXiv preprint arXiv:2002.12674, 2020.
- Diffusion probabilistic models for 3d point cloud generation. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
- Text2mesh: Text-driven neural stylization for meshes. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, 2020.
- Structurenet: Hierarchical graph networks for 3d shape generation. ACM Transactions on Graphics, 2019.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Extracting triangular 3d models, materials, and lighting from images. 2022.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- Neural articulated radiance field. In International Conference on Computer Vision, 2021.
- Unsupervised learning of efficient geometry-aware neural articulated representations. In European Conference on Computer Vision, 2022.
- Expressive body capture: 3d hands, face, and body from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
- High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
- Simo Ryu. Low-rank adaptation for fast text-to-image diffusion fine-tuning. https://github.com/cloneofsimo/lora. 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In International Conference on Computer Vision, 2019.
- Clip-forge: Towards zero-shot text-to-shape generation. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. 2021.
- Stability.AI. Stable diffusion. https://stability.ai/blog/stable-diffusion-public-release, 2022.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
- Humannerf: Free-viewpoint rendering of moving people from monocular video. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in Neural Information Processing Systems, 29, 2016.
- ECON: Explicit Clothed humans Obtained from Normals. In IEEELearning Transferable Visual Models From Natural Language Supervision Conference on Computer Vision and Pattern Recognition, 2023.
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704, 2022.
- Pointflow: 3d point cloud generation with continuous normalizing flows. In International Conference on Computer Vision, 2019.
- Lion: Latent point diffusion models for 3d shape generation. In Advances in Neural Information Processing Systems, 2022.
- Avatarbooth: High-quality and customizable 3d human avatar generation. 2023.
- Avatarverse: High-quality & stable 3d avatar creation from text and pose, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
- Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023b.
- Image gans meet differentiable rendering for inverse graphics and interpretable 3d neural rendering. 2021.
- 3d shape generation and completion through point-voxel diffusion. In International Conference on Computer Vision, 2021.
- Yukang Cao (13 papers)
- Yan-Pei Cao (58 papers)
- Kai Han (184 papers)
- Ying Shan (252 papers)
- Kwan-Yee K. Wong (51 papers)