UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures (2401.11078v2)
Abstract: Recent advances in 3D avatar generation have gained significant attentions. These breakthroughs aim to produce more realistic animatable avatars, narrowing the gap between virtual and real-world experiences. Most of existing works employ Score Distillation Sampling (SDS) loss, combined with a differentiable renderer and text condition, to guide a diffusion model in generating 3D avatars. However, SDS often generates oversmoothed results with few facial details, thereby lacking the diversity compared with ancestral sampling. On the other hand, other works generate 3D avatar from a single image, where the challenges of unwanted lighting effects, perspective views, and inferior image quality make them difficult to reliably reconstruct the 3D face meshes with the aligned complete textures. In this paper, we propose a novel 3D avatar generation approach termed UltrAvatar with enhanced fidelity of geometry, and superior quality of physically based rendering (PBR) textures without unwanted lighting. To this end, the proposed approach presents a diffuse color extraction model and an authenticity guided texture diffusion model. The former removes the unwanted lighting effects to reveal true diffuse colors so that the generated avatars can be rendered under various lighting conditions. The latter follows two gradient-based guidances for generating PBR textures to render diverse face-identity features and details better aligning with 3D mesh geometry. We demonstrate the effectiveness and robustness of the proposed method, outperforming the state-of-the-art methods by a large margin in the experiments.
- 3DScan store. https://www.3dscanstore.com/.
- Wrap4d. https://www.russian3dscanner.com/wrap4d/.
- Hyperhuman. https://hyperhuman.deemos.com/.
- Using modified BiSeNet for face parsing in PyTorch. https://github.com/zllrunning/face-parsing.PyTorch.
- https://openai.com/blog/chatgpt-can-now-see-hear-and-speak, 2023a.
- GPT-4V(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023b.
- The Digital Emily Project: Achieving a Photorealistic Digital Actor. IEEE Computer Graphics and Applications, 30(4):20–31, 2010.
- PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360°. In CVPR, pages 20950–20959, 2023.
- ClipFace: Text-guided Editing of Textured 3D Morphable Models. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
- RigNeRF: Fully Controllable Neural 3D Portraits. In CVPR, pages 20364–20373, 2022.
- Blended Latent Diffusion. ACM TOG, 42(4):1–11, 2023a.
- SpaText: Spatio-Textual Representation for Controllable Image Generation. In CVPR, pages 18370–18380, 2023b.
- Realistic Human Face Rendering for “The Matrix Reloaded”. In ACM Siggraph 2005 Courses, pages 13–es. 2005.
- Text and Image Guided 3D Avatar Generation and Manipulation. In CVPR, pages 4421–4431, 2023.
- John Canny. A Computational Approach to Edge Detection. IEEE TPAMI, (6):679–698, 1986.
- Authentic volumetric avatars from a phone scan. ACM TOG, 41(4):1–19, 2022.
- Efficient Geometry-aware 3D Generative Adversarial Networks. In CVPR, pages 16123–16133, 2022.
- Diffusion Posterior Sampling for General Noisy Inverse Problems. In ICLR, 2022a.
- Improving Diffusion Models for Inverse Problems using Manifold Constraints. NeurIPS, 35:25683–25696, 2022b.
- EMOCA: Emotion Driven Monocular Face Capture and Animation. In CVPR, pages 20311–20322, 2022.
- GRAM: Generative Radiance Manifolds for 3D-Aware Image Generation. In CVPR, pages 10673–10683, 2022.
- Diffusion Models Beat GANs on Image Synthesis. NeurIPS, 34:8780–8794, 2021.
- DiffusionRig: Learning Personalized Priors for Facial Appearance Editing. In CVPR, pages 12736–12746, 2023.
- Haven Feng. Photometric FLAME fitting. https://github.com/HavenFeng/photometric_optimization, 2019.
- Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. ACM TOG, 40(4):1–13, 2021.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR, 2022.
- Implicit Diffusion Models for Continuous Super-Resolution. In CVPR, pages 10021–10030, 2023.
- Reconstructing Personalized Semantic Facial NeRF Models from Monocular Video. ACM TOG, 41(6):1–12, 2022.
- GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction. In CVPR, pages 1155–1164, 2019.
- Synthesizing Coupled 3D Face Modalities by Trunk-Branch Generative Adversarial Networks. In ECCV, pages 415–433. Springer, 2020.
- Morphable Face Models - An Open Framework. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 75–82. IEEE, 2018.
- Neural Head Avatars from Monocular RGB Videos. In CVPR, pages 18653–18664, 2022.
- The relightables: Volumetric performance capture of humans with realistic relighting. ACM TOG, 38(6):1–19, 2019.
- Prompt-to-Prompt Image Editing with Cross-Attention Control. In ICLR, 2022.
- Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM TOG, 41(4):1–19, 2022a.
- HeadNeRF: A Real-time NeRF-based Parametric Head Model. In CVPR, pages 20374–20384, 2022b.
- Nikolay Jetchev. ClipMatrix: Text-controlled Creation of 3D Textured Meshes. arXiv preprint arXiv:2109.12922, 2021.
- A Style-Based Generator Architecture for Generative Adversarial Networks. In CVPR, pages 4401–4410, 2019.
- Analyzing and Improving the Image Quality of StyleGAN. In CVPR, pages 8110–8119, 2020.
- Multi-concept customization of text-to-image diffusion. In CVPR, pages 1931–1941, 2023.
- Diffusion-based Image Translation using Disentangled Style and Content Representation. In ICLR, 2022.
- AvatarMe: Realistically Renderable 3D Facial Reconstruction “in-the-wild”. In CVPR, pages 760–769, 2020.
- FitMe: Deep Photorealistic 3D Morphable Model Avatars. In CVPR, pages 8629–8640, 2023.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. 2023.
- Learning a model of facial shape and expression from 4D scans. ACM TOG, 36(6), 2017.
- Magic3D: High-Resolution Text-to-3D Content Creation. In CVPR, pages 300–309, 2023.
- Deep Appearance Models for Face Rendering. ACM TOG, 37(4):1–13, 2018.
- Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. In CVPR, pages 12663–12673, 2023.
- Text2Mesh: Text-Driven Neural Stylization for Meshes. In CVPR, pages 13492–13502, 2022.
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Int. Conf. Machine Learn., pages 16784–16804. PMLR, 2022.
- StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation. In CVPR, pages 13503–13513, 2022.
- Relightify: Relightable 3d faces from a single image via diffusion models. arXiv preprint arXiv:2305.06077, 2023.
- A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. Ieee, 2009.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv preprint arXiv:2307.01952, 2023.
- DreamFusion: Text-to-3D using 2D Diffusion. In ICLR, 2022.
- Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. In CVPR, pages 10619–10629, 2022.
- Zero-Shot Text-to-Image Generation. In Int. Conf. Machine Learn., pages 8821–8831. PMLR, 2021.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, pages 10684–10695, 2022.
- Text2Face: A Multi-Modal 3D Face Model. arXiv preprint arXiv:2303.02688, 2023.
- Palette: Image-to-Image Diffusion Models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
- Image Super-Resolution via Iterative Refinement. IEEE TPAMI, 45(4):4713–4726, 2022b.
- Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision. In CVPR, pages 7763–7772, 2019.
- Structure-from-Motion Revisited. In CVPR, pages 4104–4113, 2016.
- A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In CVPR, pages 519–528. IEEE, 2006.
- FreeU: Free Lunch in Diffusion U-Net. arXiv preprint arXiv:2309.11497, 2023.
- Denoising diffusion implicit models. In ICLR, 2021.
- Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In CVPR, pages 1921–1930, 2023.
- Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion. In CVPR, pages 4563–4573, 2023.
- Deblurring via Stochastic Refinement. In CVPR, pages 16293–16303, 2022.
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision), 2023.
- NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation. In CVPR, pages 8539–8548, 2023.
- BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In ECCV, pages 325–341, 2018.
- Text-Guided Generation and Editing of Compositional 3D Avatars. arXiv preprint arXiv:2309.07125, 2023a.
- DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance. ACM TOG, 42(4), 2023b.
- The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks, 2023c.
- I M Avatar: Implicit Morphable Head Avatars from Videos. In CVPR, pages 13545–13555, 2022.
- Towards Metrical Reconstruction of Human Faces. In ECCV, pages 250–269. Springer, 2022.
- Mingyuan Zhou (161 papers)
- Rakib Hyder (10 papers)
- Ziwei Xuan (5 papers)
- Guojun Qi (15 papers)