Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior (2404.10394v1)
Abstract: Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360{\deg} canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.
- Single-Image 3D Human Digitization with Shape-Guided Diffusion. In SIGGRAPH Asia 2023 Conference Papers. Article 62, 11 pages.
- imGHUM: Implicit Generative Models of 3D Human Shape and Articulated Pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5461–5470.
- PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360deg. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 20950–20959.
- Efficient Geometry-aware 3D Generative Adversarial Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 16102–16112.
- Pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 5799–5809.
- Mimic3D: Thriving 3D-Aware GANs via 3D-to-2D Imitation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2338–2348.
- MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar. CoRR abs/2312.04558 (2023).
- GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. In Advances in Neural Information Processing Systems, Vol. 35. 31841–31854.
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Vol. 27. 2672–2680.
- StyleNeRF: A Style-based 3D Aware Generator for High-resolution Image Synthesis. In The 10th International Conference on Learning Representations, ICLR.
- DensePose: Dense Human Pose Estimation in the Wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 7297–7306.
- Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems, Vol. 30. 5767–5777.
- HeadSculpt: Crafting 3D Head Avatars with Text. CoRR abs/2306.03038 (2023).
- Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. IEEE, 19683–19693.
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP. 7514–7528.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, Vol. 30. 6626–6637.
- SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion. arXiv:2311.15855 [cs.CV]
- Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems, Vol. 33. 6840–6851.
- HumanLiff: Layer-wise 3D Human Generation with Diffusion Model. CoRR abs/2308.09712 (2023).
- HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation. CoRR abs/2310.01406 (2023).
- arXiv:2308.08545 [cs.CV]
- AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 14371–14382.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation. In 6th International Conference on Learning Representations, ICLR.
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 42, 4 (2023), 139:1–139:14.
- 3D GAN Inversion with Pose Optimization. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023. 2966–2975.
- DreamHuman: Animatable 3D Avatars from Text. CoRR abs/2306.09329 (2023).
- Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194:1–194:17.
- LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching. CoRR abs/2311.11284 (2023).
- TADA! Text to Animatable Digital Avatars. CoRR abs/2308.10899 (2023).
- HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion. CoRR abs/2310.08579 (2023).
- HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting. CoRR abs/2311.17061 (2023).
- SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 6 (2015), 248:1–248:16.
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12346). 405–421.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41, 4 (2022), 102:1–102:15.
- BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images. In Advances in Neural Information Processing Systems, Vol. 33. 6767–6778.
- DreamFusion: Text-to-3D using 2D Diffusion. In The 11th International Conference on Learning Representations, ICLR.
- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In 4th International Conference on Learning Representations, ICLR.
- Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos. CoRR abs/2402.03723 (2024).
- High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
- VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids. In Advances in Neural Information Processing Systems, Vol. 35. 33999–34011.
- EpiGRAF: Rethinking training of 3D GANs. CoRR abs/2206.10535 (2022). https://doi.org/10.48550/arXiv.2206.10535 arXiv:2206.10535
- Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars. CoRR abs/2211.11208 (2022).
- Real-Time Radiance Fields for Single-Image Portrait View Synthesis. ACM Trans. Graph. 42, 4 (2023), 135:1–135:15.
- What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs. CoRR abs/2401.02411 (2024).
- GaussianHead: High-fidelity Head Avatars with Learnable Gaussian Derivation. arXiv:2312.01632 [cs.CV]
- NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In Advances in Neural Information Processing Systems, Vol. 34. 27171–27183.
- RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4563–4573.
- ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. In Advances in Neural Information Processing Systems, Vol. 34.
- 3DPortraitGAN: Learning One-Quarter Headshot 3D GANs from a Single-View Portrait Dataset with Diverse Body Poses. arXiv:2307.14770 [cs.CV]
- AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections. In SIGGRAPH Asia 2023 Conference Papers, SA 2023, Sydney, NSW, Australia, December 12-15, 2023, June Kim, Ming C. Lin, and Bernd Bickel (Eds.). ACM, 51:1–51:9.
- High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 321–331.
- Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians. CoRR abs/2312.03029 (2023).
- SEEAvatar: Photorealistic Text-to-3D Avatar Generation with Constrained Geometry and Appearance. arXiv:2312.08889 [cs.CV]
- Progressive Text-to-3D Generation for Automatic 3D Prototyping. CoRR abs/2309.14600 (2023).
- 3D GAN Inversion with Facial Symmetry Prior. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 342–351.
- AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose. CoRR abs/2308.03610 (2023).
- TECA: Text-Guided Generation and Editing of Compositional 3D Avatars.
- AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text. CoRR abs/2311.17917 (2023).
- GETAvatar: Generative Textured Meshes for Animatable Human Avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2273–2282.
- HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting. CoRR abs/2402.06149 (2024).
- Visual Object Networks: Image Generation with Disentangled 3D Representations. In Advances in Neural Information Processing Systems, Vol. 31. 118–129.
- Yiqian Wu (13 papers)
- Hao Xu (351 papers)
- Xiangjun Tang (10 papers)
- Xien Chen (3 papers)
- Siyu Tang (88 papers)
- Zhebin Zhang (8 papers)
- Chen Li (387 papers)
- Xiaogang Jin (38 papers)