Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures (2401.11078v2)

Published 20 Jan 2024 in cs.CV

Abstract: Recent advances in 3D avatar generation have gained significant attentions. These breakthroughs aim to produce more realistic animatable avatars, narrowing the gap between virtual and real-world experiences. Most of existing works employ Score Distillation Sampling (SDS) loss, combined with a differentiable renderer and text condition, to guide a diffusion model in generating 3D avatars. However, SDS often generates oversmoothed results with few facial details, thereby lacking the diversity compared with ancestral sampling. On the other hand, other works generate 3D avatar from a single image, where the challenges of unwanted lighting effects, perspective views, and inferior image quality make them difficult to reliably reconstruct the 3D face meshes with the aligned complete textures. In this paper, we propose a novel 3D avatar generation approach termed UltrAvatar with enhanced fidelity of geometry, and superior quality of physically based rendering (PBR) textures without unwanted lighting. To this end, the proposed approach presents a diffuse color extraction model and an authenticity guided texture diffusion model. The former removes the unwanted lighting effects to reveal true diffuse colors so that the generated avatars can be rendered under various lighting conditions. The latter follows two gradient-based guidances for generating PBR textures to render diverse face-identity features and details better aligning with 3D mesh geometry. We demonstrate the effectiveness and robustness of the proposed method, outperforming the state-of-the-art methods by a large margin in the experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. 3DScan store. https://www.3dscanstore.com/.
  2. Wrap4d. https://www.russian3dscanner.com/wrap4d/.
  3. Hyperhuman. https://hyperhuman.deemos.com/.
  4. Using modified BiSeNet for face parsing in PyTorch. https://github.com/zllrunning/face-parsing.PyTorch.
  5. https://openai.com/blog/chatgpt-can-now-see-hear-and-speak, 2023a.
  6. GPT-4V(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023b.
  7. The Digital Emily Project: Achieving a Photorealistic Digital Actor. IEEE Computer Graphics and Applications, 30(4):20–31, 2010.
  8. PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360°. In CVPR, pages 20950–20959, 2023.
  9. ClipFace: Text-guided Editing of Textured 3D Morphable Models. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  10. RigNeRF: Fully Controllable Neural 3D Portraits. In CVPR, pages 20364–20373, 2022.
  11. Blended Latent Diffusion. ACM TOG, 42(4):1–11, 2023a.
  12. SpaText: Spatio-Textual Representation for Controllable Image Generation. In CVPR, pages 18370–18380, 2023b.
  13. Realistic Human Face Rendering for “The Matrix Reloaded”. In ACM Siggraph 2005 Courses, pages 13–es. 2005.
  14. Text and Image Guided 3D Avatar Generation and Manipulation. In CVPR, pages 4421–4431, 2023.
  15. John Canny. A Computational Approach to Edge Detection. IEEE TPAMI, (6):679–698, 1986.
  16. Authentic volumetric avatars from a phone scan. ACM TOG, 41(4):1–19, 2022.
  17. Efficient Geometry-aware 3D Generative Adversarial Networks. In CVPR, pages 16123–16133, 2022.
  18. Diffusion Posterior Sampling for General Noisy Inverse Problems. In ICLR, 2022a.
  19. Improving Diffusion Models for Inverse Problems using Manifold Constraints. NeurIPS, 35:25683–25696, 2022b.
  20. EMOCA: Emotion Driven Monocular Face Capture and Animation. In CVPR, pages 20311–20322, 2022.
  21. GRAM: Generative Radiance Manifolds for 3D-Aware Image Generation. In CVPR, pages 10673–10683, 2022.
  22. Diffusion Models Beat GANs on Image Synthesis. NeurIPS, 34:8780–8794, 2021.
  23. DiffusionRig: Learning Personalized Priors for Facial Appearance Editing. In CVPR, pages 12736–12746, 2023.
  24. Haven Feng. Photometric FLAME fitting. https://github.com/HavenFeng/photometric_optimization, 2019.
  25. Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. ACM TOG, 40(4):1–13, 2021.
  26. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR, 2022.
  27. Implicit Diffusion Models for Continuous Super-Resolution. In CVPR, pages 10021–10030, 2023.
  28. Reconstructing Personalized Semantic Facial NeRF Models from Monocular Video. ACM TOG, 41(6):1–12, 2022.
  29. GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction. In CVPR, pages 1155–1164, 2019.
  30. Synthesizing Coupled 3D Face Modalities by Trunk-Branch Generative Adversarial Networks. In ECCV, pages 415–433. Springer, 2020.
  31. Morphable Face Models - An Open Framework. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 75–82. IEEE, 2018.
  32. Neural Head Avatars from Monocular RGB Videos. In CVPR, pages 18653–18664, 2022.
  33. The relightables: Volumetric performance capture of humans with realistic relighting. ACM TOG, 38(6):1–19, 2019.
  34. Prompt-to-Prompt Image Editing with Cross-Attention Control. In ICLR, 2022.
  35. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  36. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM TOG, 41(4):1–19, 2022a.
  37. HeadNeRF: A Real-time NeRF-based Parametric Head Model. In CVPR, pages 20374–20384, 2022b.
  38. Nikolay Jetchev. ClipMatrix: Text-controlled Creation of 3D Textured Meshes. arXiv preprint arXiv:2109.12922, 2021.
  39. A Style-Based Generator Architecture for Generative Adversarial Networks. In CVPR, pages 4401–4410, 2019.
  40. Analyzing and Improving the Image Quality of StyleGAN. In CVPR, pages 8110–8119, 2020.
  41. Multi-concept customization of text-to-image diffusion. In CVPR, pages 1931–1941, 2023.
  42. Diffusion-based Image Translation using Disentangled Style and Content Representation. In ICLR, 2022.
  43. AvatarMe: Realistically Renderable 3D Facial Reconstruction “in-the-wild”. In CVPR, pages 760–769, 2020.
  44. FitMe: Deep Photorealistic 3D Morphable Model Avatars. In CVPR, pages 8629–8640, 2023.
  45. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. 2023.
  46. Learning a model of facial shape and expression from 4D scans. ACM TOG, 36(6), 2017.
  47. Magic3D: High-Resolution Text-to-3D Content Creation. In CVPR, pages 300–309, 2023.
  48. Deep Appearance Models for Face Rendering. ACM TOG, 37(4):1–13, 2018.
  49. Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. In CVPR, pages 12663–12673, 2023.
  50. Text2Mesh: Text-Driven Neural Stylization for Meshes. In CVPR, pages 13492–13502, 2022.
  51. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Int. Conf. Machine Learn., pages 16784–16804. PMLR, 2022.
  52. StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation. In CVPR, pages 13503–13513, 2022.
  53. Relightify: Relightable 3d faces from a single image via diffusion models. arXiv preprint arXiv:2305.06077, 2023.
  54. A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. Ieee, 2009.
  55. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv preprint arXiv:2307.01952, 2023.
  56. DreamFusion: Text-to-3D using 2D Diffusion. In ICLR, 2022.
  57. Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. In CVPR, pages 10619–10629, 2022.
  58. Zero-Shot Text-to-Image Generation. In Int. Conf. Machine Learn., pages 8821–8831. PMLR, 2021.
  59. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  60. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, pages 10684–10695, 2022.
  61. Text2Face: A Multi-Modal 3D Face Model. arXiv preprint arXiv:2303.02688, 2023.
  62. Palette: Image-to-Image Diffusion Models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
  63. Image Super-Resolution via Iterative Refinement. IEEE TPAMI, 45(4):4713–4726, 2022b.
  64. Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision. In CVPR, pages 7763–7772, 2019.
  65. Structure-from-Motion Revisited. In CVPR, pages 4104–4113, 2016.
  66. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In CVPR, pages 519–528. IEEE, 2006.
  67. FreeU: Free Lunch in Diffusion U-Net. arXiv preprint arXiv:2309.11497, 2023.
  68. Denoising diffusion implicit models. In ICLR, 2021.
  69. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In CVPR, pages 1921–1930, 2023.
  70. Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion. In CVPR, pages 4563–4573, 2023.
  71. Deblurring via Stochastic Refinement. In CVPR, pages 16293–16303, 2022.
  72. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision), 2023.
  73. NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation. In CVPR, pages 8539–8548, 2023.
  74. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In ECCV, pages 325–341, 2018.
  75. Text-Guided Generation and Editing of Compositional 3D Avatars. arXiv preprint arXiv:2309.07125, 2023a.
  76. DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance. ACM TOG, 42(4), 2023b.
  77. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  78. GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks, 2023c.
  79. I M Avatar: Implicit Morphable Head Avatars from Videos. In CVPR, pages 13545–13555, 2022.
  80. Towards Metrical Reconstruction of Human Faces. In ECCV, pages 250–269. Springer, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mingyuan Zhou (161 papers)
  2. Rakib Hyder (10 papers)
  3. Ziwei Xuan (5 papers)
  4. Guojun Qi (15 papers)
Citations (3)

Summary

Introduction

The field of 3D avatar generation represents an intersection of computer vision and computer graphics that has evolved with deep learning. Accurate and realistic generation of 3D avatars from single images or text prompts remains a complex endeavor. Difficulties include eradicating unwanted lighting effects and conserving facial details across different viewpoints. Existing image-to-avatar and text-to-avatar methods either rely on extensive datasets and complex pre-processing or struggle with the effects of occlusion and uncontrolled lighting conditions. Moreover, approaches using Score Distillation Sampling (SDS) in training often result in avatars that lack diversity in texture details.

Previous Work

Reviewing current strategies, the image-to-avatar methodologies frequently hinge on physical setups for detailed scanning, which restrict scalability. These also encompass a range of 3D representations – from the parametric model to neural implicit functions. Generative Adversarial Networks (GANs) have been deftly employed to embed 3D features into generative models, while recent works also utilize text prompts as input for 3D generation, relying on SDS loss for visual consistency. Yet, SDS loss compromises diversity. Moreover, existing guided diffusion models, adaptable via post-training guidance, have leveraged intermediate features for tasks like image editing, highlighting the potential for utilizing the attention features for extracting diffuse colors from a single image and integrating guidance to preserve identity and details.

Methodology

The UltrAvatar framework begins by generating a face image from a textual prompt or using an existing one as input. The novel Diffuse Color Extraction (DCE) removes lighting effects to reveal true colors. Self-attention features within diffusion models assist in eliminating these effects, a pivotal breakthrough allowing for relightable 3D avatars. The process yields undeteriorated diffuse textures, integral for renderings under varying lighting conditions. Then, 3D face meshes are created using a Morphable Model and the authenticity guided texture diffusion model (AGT-DM) to generate complete Physically Based Rendering (PBR) textures. The AGT-DM leverages photometric guidance and edge guidance during its sampling process, ensuring higher diversity and fidelity in generated avatars.

Contributions and Results

The paper's contributions are three-pronged. Firstly, it elucidates the relationship between self-attention features and lighting effects, leading to a robust DCE model that overcomes the challenge of separating colors from lighting. Secondly, it introduces an authenticity guided diffusion model capable of generating superior quality PBR textures. Finally, through extensive experimentation, the paper establishes the UltrAvatar framework's superiority over state-of-the-art techniques in rendering high-quality diverse 3D avatars with sharp, true-to-life details in both observed and unobserved views.

In practice, UltrAvatar efficiently handles a range of prompts, delivering high-quality avatars that maintain fidelity to textual prompts while possessing improved realism and diversity. The framework allows for the generation under various lighting conditions and viewing angles, demonstrating exceptional results in both fidelity and texture details. Ablation studies emphasize the significance of each proposed component, with the combined effect of photometric and edge guidance reflecting markedly in the generation of nuanced facial features. Furthermore, the framework exhibits versatility in generating out-of-domain characters, affirming its adaptability.

In conclusion, UltrAvatar marks a significant step forward, suggesting that the successful generation of lifelike, animatable 3D avatars from simple inputs is within grasp. As these avatars become increasingly indistinguishable from real humans and seamlessly responsive to various conditions, they affirm the potential to revolutionize not just gaming and virtual reality but also broader domains where digital human presence is pivotal.