Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior (2405.05749v2)

Published 9 May 2024 in cs.CV

Abstract: Audio-driven talking head generation is advancing from 2D to 3D content. Notably, Neural Radiance Field (NeRF) is in the spotlight as a means to synthesize high-quality 3D talking head outputs. Unfortunately, this NeRF-based approach typically requires a large number of paired audio-visual data for each identity, thereby limiting the scalability of the method. Although there have been attempts to generate audio-driven 3D talking head animations with a single image, the results are often unsatisfactory due to insufficient information on obscured regions in the image. In this paper, we mainly focus on addressing the overlooked aspect of 3D consistency in the one-shot, audio-driven domain, where facial animations are synthesized primarily in front-facing perspectives. We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head. Using prior knowledge of generative models combined with NeRF, our method can craft a 3D-consistent facial feature space corresponding to a single image. Our spatial synchronization method employs audio-correlated vertex dynamics of a parametric face model to transform static image features into dynamic visuals through ray deformation, ensuring realistic 3D facial motion. Moreover, we introduce LipaintNet that can replenish the lacking information in the inner-mouth area, which can not be obtained from a given single image. The network is trained in a self-supervised manner by utilizing the generative capabilities without additional data. The comprehensive experiments demonstrate the superiority of our method in generating audio-driven talking heads from a single image with enhanced 3D consistency compared to previous approaches. In addition, we introduce a quantitative way of measuring the robustness of a model against pose changes for the first time, which has been possible only qualitatively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Panohead: Geometry-aware 3d full-head synthesis in 360deg. In CVPR, 2023.
  2. ASCUST. 3dmm fitting in pytorch, 2023. GitHub repository.
  3. Rignerf: Fully controllable neural 3d portraits. In CVPR, pages 20364–20373, 2022.
  4. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, pages 5855–5864, 2021.
  5. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, page 187–194, USA, 1999. ACM Press/Addison-Wesley Publishing Co.
  6. Authentic volumetric avatars from a phone scan. ACM TOG, 41(4):1–19, 2022.
  7. Pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR, 2021.
  8. Efficient geometry-aware 3D generative adversarial networks. In CVPR, 2022.
  9. Implicit neural head synthesis via controllable local deformation fields. In CVPR, pages 416–426, 2023.
  10. Videoretalking: Audio-based lip synchronization for talking head video editing in the wild. In ACM SIGGRAPH Asia 2022 Conference Proceedings, pages 1–9, 2022.
  11. Arcface: Additive angular margin loss for deep face recognition. In CVPR, pages 4690–4699, 2019a.
  12. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In IEEE Computer Vision and Pattern Recognition Workshops, 2019b.
  13. GRAM: generative radiance manifolds for 3d-aware image generation. In CVPR, 2022.
  14. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In CVPR, pages 8649–8658, 2021.
  15. Reconstructing personalized semantic facial nerf models from monocular video. ACM TOG, 41(6), 2022.
  16. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  17. Generative adversarial nets. NeurIPS, 27, 2014.
  18. Neural head avatars from monocular rgb videos. In CVPR, pages 18653–18664, 2022.
  19. Stylenerf: A style-based 3d aware generator for high-resolution image synthesis. In ICLR, 2022.
  20. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In ICCV, 2021.
  21. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017.
  22. Headnerf: A real-time nerf-based parametric head model. In CVPR, pages 20374–20384, 2022.
  23. Nerfshop: Interactive editing of neural radiance fields. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(1), 2023.
  24. Humangen: Generating human radiance fields with explicit priors. In CVPR, pages 12543–12554, 2023.
  25. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  26. Analyzing and improving the image quality of stylegan. In CVPR, 2020.
  27. Alias-free generative adversarial networks. In NeurIPS, 2021.
  28. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  29. Towards automatic face-to-face translation. In ACM MM, pages 1428–1436, 2019.
  30. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In ICCV, pages 7568–7578, 2023.
  31. Neural 3d video synthesis from multi-view video. In CVPR, pages 5521–5531, 2022.
  32. Semantic-aware implicit neural audio-driven video portrait generation. In ECCV, pages 106–125. Springer, 2022.
  33. Neural volumes: Learning dynamic renderable volumes from images. ACM Trans. Graph., 38(4):65:1–65:14, 2019.
  34. Mixture of volumetric primitives for efficient neural rendering. ACM Trans. Graph., 40(4), 2021.
  35. Otavatar: One-shot talking face avatar with controllable tri-plane rendering. In CVPR, pages 16901–16910, 2023.
  36. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  37. Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 41(4):1–15, 2022.
  38. A no-reference image blur metric based on the cumulative probability of blur detection (cpbd). IEEE Transactions on Image Processing, 20(9):2678–2683, 2011.
  39. GIRAFFE: representing scenes as compositional generative neural feature fields. In CVPR, 2021.
  40. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  41. Nerfies: Deformable neural radiance fields. In CVPR, pages 5865–5874, 2021a.
  42. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM TOG, 40(6), 2021b.
  43. Animatable neural radiance fields for modeling dynamic human bodies. In ICCV, pages 14314–14323, 2021.
  44. A lip sync expert is all you need for speech to lip generation in the wild. In ACM MM, pages 484–492, 2020.
  45. D-nerf: Neural radiance fields for dynamic scenes. In CVPR, pages 10318–10327, 2021.
  46. Pirenderer: Controllable portrait image generation via semantic neural rendering. In ICCV, pages 13759–13768, 2021.
  47. Pivotal tuning for latent-based editing of real images. ACM TOG, 42(1):1–13, 2022.
  48. GRAF: generative radiance fields for 3d-aware image synthesis. In NeurIPS, 2020.
  49. Learning dynamic facial radiance fields for few-shot talking head synthesis. In ECCV, pages 666–682. Springer, 2022.
  50. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
  51. Next3d: Generative neural texture rasterization for 3d-aware head avatars. In CVPR, 2023.
  52. Controllable 3d face synthesis with conditional generative occupancy fields. NeurIPS, 35:16331–16343, 2022.
  53. Synthesizing obama: learning lip sync from audio. ACM TOG, 36(4):1–13, 2017.
  54. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. In IJCAI, 2021.
  55. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 33:7537–7547, 2020.
  56. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368, 2022a.
  57. Explicitly controllable 3d-aware portrait generation. arXiv preprint arXiv:2209.05434, 2022b.
  58. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In ICCV, pages 12959–12970, 2021.
  59. Morf: Morphable radiance fields for multiview neural head modeling. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022a.
  60. One-shot talking face generation from single-speaker audio-visual correlation learning. In AAAI, pages 2531–2539, 2022b.
  61. One-shot free-view neural talking-head synthesis for video conferencing. In CVPR, pages 10039–10049, 2021a.
  62. Learning compositional radiance fields of dynamic human heads. In CVPR, pages 5704–5713, 2021b.
  63. Aniportraitgan: Animatable 3d portrait generation from 2d image collections. In SIGGRAPH Asia 2023 Conference Proceedings, 2023.
  64. GRAM-HD: 3d-consistent image generation at high resolution with generative radiance manifolds. 2022.
  65. Omniavatar: Geometry-guided controllable 3d head synthesis. In CVPR, pages 12814–12824, 2023a.
  66. Latentavatar: Learning latent expression code for expressive neural head avatar. In ACM SIGGRAPH 2023 Conference Proceedings, New York, NY, USA, 2023b. Association for Computing Machinery.
  67. Dfa-nerf: Personalized talking head generation via disentangled face attributes neural rendering. arXiv preprint arXiv:2201.00791, 2022.
  68. Volume rendering of neural implicit surfaces. NeurIPS, 34:4805–4815, 2021.
  69. Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. In ICLR, 2023.
  70. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In ECCV, pages 85–101. Springer, 2022.
  71. Nerf-editing: geometry editing of neural radiance fields. In CVPR, pages 18353–18364, 2022.
  72. Fnevr: Neural volume rendering for face animation. NeurIPS, 35:22451–22462, 2022.
  73. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  74. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In CVPR, pages 8652–8661, 2023.
  75. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In CVPR, pages 3661–3670, 2021.
  76. Imface: A nonlinear 3d morphable face model with implicit neural representations. In CVPR, pages 20343–20352, 2022a.
  77. Im avatar: Implicit morphable head avatars from videos. In CVPR, pages 13545–13555, 2022b.
  78. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In CVPR, pages 4176–4186, 2021.
  79. Makelttalk: speaker-aware talking-head animation. ACM TOG, 39(6):1–15, 2020.
  80. Hairnet: Hairstyle transfer with pose changes. In ECCV, pages 651–667. Springer, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com