Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment (2312.04651v1)

Published 7 Dec 2023 in cs.CV

Abstract: We present a 3D-aware one-shot head reenactment method based on a fully volumetric neural disentanglement framework for source appearance and driver expressions. Our method is real-time and produces high-fidelity and view-consistent output, suitable for 3D teleconferencing systems based on holographic displays. Existing cutting-edge 3D-aware reenactment methods often use neural radiance fields or 3D meshes to produce view-consistent appearance encoding, but, at the same time, they rely on linear face models, such as 3DMM, to achieve its disentanglement with facial expressions. As a result, their reenactment results often exhibit identity leakage from the driver or have unnatural expressions. To address these problems, we propose a neural self-supervised disentanglement approach that lifts both the source image and driver video frame into a shared 3D volumetric representation based on tri-planes. This representation can then be freely manipulated with expression tri-planes extracted from the driving images and rendered from an arbitrary view using neural radiance fields. We achieve this disentanglement via self-supervised learning on a large in-the-wild video dataset. We further introduce a highly effective fine-tuning approach to improve the generalizability of the 3D lifting using the same real-world data. We demonstrate state-of-the-art performance on a wide range of datasets, and also showcase high-quality 3D-aware head reenactment on highly challenging and diverse subjects, including non-frontal head poses and complex expressions for both source and driver.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (114)
  1. ItSeez3D AvatarSDK, https://avatarsdk.com.
  2. in3D, https://in3d.io.
  3. Leia, https://www.leiainc.com.
  4. Looking Glass Factory, https://lookingglassfactory.com.
  5. Pinscreen Avatar Neo, https://www.avatarneo.com.
  6. ReadyPlayerMe, https://readyplayer.me.
  7. Panohead: Geometry-aware 3d full-head synthesis in 360deg. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  8. Rignerf: Fully controllable neural 3d portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20364–20373, 2022.
  9. Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 362–371, 2023a.
  10. Learning personalized high quality volumetric head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  11. FLARE: Fast learning of animatable and relightable mesh avatars. ACM Transactions on Graphics, 42:15, 2023.
  12. Triplanenet: An encoder for eg3d inversion. arXiv preprint arXiv:2303.13497, 2023.
  13. Deep relightable appearance models for animatable faces. ACM Transactions on Graphics (TOG), 40, 2021.
  14. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, page 187–194, USA, 1999. ACM Press/Addison-Wesley Publishing Co.
  15. A Morphable Model For The Synthesis Of 3D Faces. Association for Computing Machinery, New York, NY, USA, 1 edition, 2023.
  16. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017.
  17. Neural head reenactment with latent pose descriptors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13786–13795, 2020.
  18. Authentic volumetric avatars from a phone scan. ACM Trans. Graph., 41, 2022.
  19. Pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5799–5809, 2021.
  20. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  21. Implicit neural head synthesis via controllable local deformation fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 416–426, 2023.
  22. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  23. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022.
  24. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
  25. Gram: Generative radiance manifolds for 3d-aware image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10673–10683, 2022.
  26. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  27. Headgan: One-shot neural head synthesis and editing. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  28. Megaportraits: One-shot megapixel neural head avatars. arXiv preprint arXiv:2207.07621, 2022.
  29. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
  30. Auto-card: Efficient and robust codec avatar driving for real-time mobile telepresence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21036–21045, 2023.
  31. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8649–8658, 2021a.
  32. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2021b.
  33. High-fidelity and freely controllable talking head video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5609–5619, 2023.
  34. Fast-ganfit: Generative adversarial network for high fidelity 3d face reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  35. Neural head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18653–18664, 2022.
  36. Depth-aware generative adversarial network for talking head video generation. 2022a.
  37. Headnerf: A real-time nerf-based parametric head model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374–20384, 2022b.
  38. Avatar digitization from a single image for real-time rendering. ACM Trans. Graph., 36(6), 2017.
  39. Mofanerf: Morphable facial neural radiance field. In ECCV, 2022.
  40. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings, 2022.
  41. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  42. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  43. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  44. Realistic one-shot mesh-based head avatars. In European Conference of Computer vision (ECCV), 2022.
  45. Deep video portraits. ACM Transactions on Graphics 2018 (TOG), 2018.
  46. 3d gan inversion with pose optimization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2967–2976, 2023.
  47. Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  48. Fitme: Deep photorealistic 3d morphable model avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8629–8640, 2023.
  49. Project starline: A high-fidelity telepresence system. ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 40(6), 2021.
  50. A hierarchical representation network for accurate and detailed face reconstruction from in-the-wild images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 394–403, 2023.
  51. Facial performance sensing head-mounted display. ACM Transactions on Graphics (Proceedings SIGGRAPH 2015), 34(4), 2015.
  52. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017a.
  53. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017b.
  54. One-shot high-fidelity talking-head synthesis with deformable neural radiance field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17969–17978, 2023a.
  55. Generalizable one-shot neural head avatar. arXiv preprint arXiv:2306.08768, 2023b.
  56. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
  57. Deep appearance models for face rendering. ACM Trans. Graph., 37(4), 2018.
  58. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
  59. Normalized avatar synthesis using stylegan and perceptual refinement. CoRR, abs/2106.11423, 2021.
  60. Pixel codec avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 64–73, 2021.
  61. Otavatar: One-shot talking face avatar with controllable tri-plane rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16910, 2023.
  62. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020.
  63. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  64. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11453–11464, 2021.
  65. FSGAN: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE International Conference on Computer Vision, pages 7184–7193, 2019.
  66. High-fidelity facial and speech animation for vr hmds. ACM Transactions on Graphics (TOG), 35:1 – 14, 2016.
  67. Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13503–13513, 2022.
  68. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13759–13768, 2021.
  69. Encoding in style: a stylegan encoder for image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  70. Pivotal tuning for latent-based editing of real images. ACM Trans. Graph., 2021.
  71. Projected gans converge faster. Advances in Neural Information Processing Systems, 34:17480–17492, 2021.
  72. The eyes have it: An integrated eye and face model for photorealistic facial animation. ACM Trans. Graph., 39(4), 2020.
  73. Graf: Generative radiance fields for 3d-aware image synthesis. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020.
  74. Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids. Advances in Neural Information Processing Systems, 35:33999–34011, 2022.
  75. Animating arbitrary objects via deep motion transfer. In CVPR, 2019a.
  76. First order motion model for image animation. In Conference on Neural Information Processing Systems (NeurIPS), 2019b.
  77. Motion representations for articulated animation. In CVPR, 2021.
  78. EpiGRAF: Rethinking training of 3d GANs. In Advances in Neural Information Processing Systems, 2022.
  79. Pareidolia face reenactment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  80. Structure-aware motion transfer with deformable anchor model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3637–3646, 2022.
  81. Face2face: Real-time face capture and reenactment of rgb videos. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.
  82. Headon: Real-time reenactment of human portrait videos. ACM Transactions on Graphics 2018 (TOG), 2018.
  83. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4):1–14, 2021.
  84. Real-time radiance fields for single-image portrait view synthesis. ACM Transactions on Graphics (TOG), 42(4):1–15, 2023.
  85. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021a.
  86. Towards real-world blind face restoration with generative facial prior. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
  87. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9168–9178, 2021c.
  88. Latent image animator: Learning to animate images via latent space navigation. In International Conference on Learning Representations, 2022.
  89. X2face: A network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision, 2018.
  90. Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2195–2205, 2023.
  91. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  92. High-fidelity 3d gan inversion by pseudo-multi-view optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 321–331, 2023.
  93. Omniavatar: Geometry-guided controllable 3d head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12814–12824, 2023a.
  94. Latentavatar: Learning latent expression code for expressive neural head avatar. In ACM SIGGRAPH 2023 Conference Proceedings. Association for Computing Machinery, 2023b.
  95. Pv3d: A 3d generative model for portrait video generation. In The Tenth International Conference on Learning Representations, 2023c.
  96. Giraffe hd: A high-resolution 3d-aware generative model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18440–18449, 2022.
  97. Face2face ρ𝜌\rhoitalic_ρ: Real-time high-resolution one-shot face reenactment. 2022.
  98. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In ECCV, 2022.
  99. Nerfinvertor: High fidelity nerf-gan inversion for single-shot real image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8539–8548, 2023.
  100. Nofa: Nerf-based one-shot facial avatar reconstruction. In ACM SIGGRAPH 2023 Conference Proceedings, 2023.
  101. Make encoder great again in 3d gan inversion through geometry and occlusion-aware encoding. arXiv preprint arXiv:2303.12326, 2023.
  102. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9459–9468, 2019.
  103. Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference on Computer Vision, pages 524–540. Springer, 2020.
  104. Metaportrait: Identity-preserving talking head generation with fast personalized adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22096–22105, 2023.
  105. Fdnerf: Few-shot dynamic neural radiance fields for face reconstruction and expression editing. arXiv preprint arXiv:2208.05751, 2022.
  106. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  107. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
  108. Thin-plate spline motion model for image animation. In CVPR, pages 3657–3666, 2022.
  109. Havatar: High-fidelity head avatar via facial model conditioned neural radiance field. ACM Trans. Graph., 2023.
  110. I m avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13545–13555, 2022.
  111. Pointavatar: Deformable point-based head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  112. Celebv-hq: A large-scale video facial attributes dataset. In European conference on computer vision, pages 650–667. Springer, 2022a.
  113. CelebV-HQ: A large-scale video facial attributes dataset. In ECCV, 2022b.
  114. Instant volumetric head avatars. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
Citations (7)

Summary

  • The paper introduces a novel volumetric neural disentanglement approach that separates source appearance from driver expressions for one-shot 3D head reenactment.
  • It employs a tri-plane representation with fine-tuning on real-world video data to enhance rendering fidelity and overcome the limits of linear face models.
  • The method demonstrates superior performance through key metrics like PSNR, SSIM, LPIPS, and FID, paving the way for advances in AR/VR and holographic telepresence.

Volumetric Disentanglement and Real-Time 3D Head Reenactment: A Critical Analysis of VOODOO 3D

The academic paper titled "VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment" introduces a novel methodology for 3D-aware head reenactment, leveraging a volumetric neural disentanglement framework to separate source appearance and driver expressions. This research proposes a real-time method that produces high-fidelity, view-consistent outputs, ideal for 3D teleconferencing systems, especially those employing holographic displays.

Methodological Contributions

The authors critically address the limitations of existing 3D-aware techniques that often suffer from identity leakage or unnatural expressions due to their reliance on linear face models like 3DMM. Instead, this paper proposes a neural self-supervised disentanglement approach that integrates both source and driver images into a shared 3D volumetric representation, employing tri-planes to enhance rendering fidelity. This representation allows for the manipulation of expression tri-planes derived from driving images, rendered from arbitrary viewpoints using neural radiance fields.

An intriguing component of their methodology is the fine-tuning mechanism of the 3D lifting model. Using real-world video data rather than synthetic sources, the model achieves better generalizability. This aspect underscores the model's adaptability to diverse subjects and complex expressions, overcoming the limitations of training only on synthetic datasets.

Results and Performance

The numerical experiments conducted reflect the method's superiority over current state-of-the-art techniques across various datasets, showcasing robustness in handling difficult and varied head poses and expressions. The quantitative measures such as PSNR, SSIM, LPIPS, and FID establish the method's efficacy in maintaining identity likeness and expression accuracy. Notably, the paper provides evidence of the technique's capacity to manage non-frontal views, a significant challenge in voluminous data representation and synthesis.

Practical and Theoretical Implications

Practically, VOODOO 3D, with its view-consistent neural fields and fine-scale expression synthesis, presents a transformative approach for applications in AR/VR and holographic telepresence. The technology extends possibilities for creating realistic 3D avatars from minimal input. Theoretically, this research demonstrates the efficacy of fully volumetric disentanglement models, arguing for their utility over traditional linear approaches. It highlights the potential of facial tri-plane representations in improving the fidelity of dynamic expressions in static inputs.

Future AI Directions

The paper opens avenues for future research in the refinement of volumetric representations and their application in full-body dynamics, potentially impacting virtual reality experiences significantly. Furthermore, integrating this technology with advanced generative models could enhance the photo-realism and expressiveness of avatars.

Conclusion

In conclusion, VOODOO 3D contributes significantly to the domain of neural head reenactment by providing a nuanced approach to identity and expression disentanglement, effectively circumventing the shortcomings of traditional methods. Its implications for enhancing 3D telepresence systems are substantial, encouraging deeper explorations into volumetric neural networks and their real-time applications. This research could serve as a foundational methodology for future advancements in realistic avatar creation for immersive technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com