Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting (2404.14037v3)

Published 22 Apr 2024 in cs.CV and cs.MM

Abstract: Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).
  2. Volker Blanz and Thomas Vetter. 2023. A morphable model for the synthesis of 3D faces. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 157–164.
  3. Matthew Brand. 1999. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 21–28.
  4. Video rewrite: Driving visual speech with audio. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 715–722.
  5. Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In International Conference on Computer Vision.
  6. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2013), 413–425.
  7. LipNeRF: What is the right feature space to lip-sync a NeRF?. In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1–8.
  8. CN-CVS: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  9. Lip Movements Generation at a Glance. In Proceedings of the European Conference on Computer Vision (ECCV).
  10. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7832–7841.
  11. Monogaussianavatar: Monocular gaussian point-based head avatar. arXiv preprint arXiv:2312.04558 (2023).
  12. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018).
  13. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20311–20322.
  14. Trainable videorealistic speech animation. ACM Transactions on Graphics (TOG) 21, 3 (2002), 388–398.
  15. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8649–8658.
  16. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
  17. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF international conference on computer vision. 5784–5794.
  18. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6629–6640.
  19. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125–1134.
  20. You said that?: Synthesising talking faces from audio. International Journal of Computer Vision 127 (2019), 1767–1779.
  21. MNN: A Universal and Efficient Inference Engine. In MLSys.
  22. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
  23. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. (2023).
  24. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  25. Fast Optical Flow using Dense Inverse Search. arXiv:arXiv:1603.03590
  26. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7568–7578.
  27. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.
  28. Generalizable One-shot 3D Neural Head Avatar. Advances in Neural Information Processing Systems 36 (2024).
  29. Semantic-aware implicit neural audio-driven video portrait generation. In European conference on computer vision. Springer, 106–125.
  30. Live speech portraits: real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–17.
  31. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision.
  32. Shigeo Morishima. 1998. Real-time Talking Head Driven by Voice and its Application to Communication and Entertainment. 195–200.
  33. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5865–5874.
  34. A 3D face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance. Ieee, 296–301.
  35. Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces. In Proceedings of the 31st ACM International Conference on Multimedia. 5292–5301.
  36. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484–492.
  37. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10318–10327.
  38. OpenVoice: Versatile Instant Voice Cloning. arXiv preprint arXiv:2312.01479 (2023).
  39. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1173–1182.
  40. Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos. arXiv preprint arXiv:2402.03723 (2024).
  41. Self-supervised monocular 3d face reconstruction by occlusion-aware multi-view geometry consistency. In European Conference on Computer Vision. Springer, 53–70.
  42. SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. arXiv preprint arXiv:2403.05087 (2024).
  43. Learning dynamic facial radiance fields for few-shot talking head synthesis. In European conference on computer vision. Springer, 666–682.
  44. Hiroki Tanaka and Satoshi Nakamura. 2022. The Acceptability of Virtual Characters as Social Skills Trainers: Usability Study. JMIR Hum Factors 9, 1 (29 Mar 2022), e35358. https://doi.org/10.2196/35358
  45. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368 (2022).
  46. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 716–731.
  47. Neural Voice Puppetry: Audio-Driven Facial Reenactment. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 716–731.
  48. Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018). arXiv:1807.03748 http://arxiv.org/abs/1807.03748
  49. Realistic speech-driven facial animation with gans. International Journal of Computer Vision 128, 5 (2020), 1398–1413.
  50. GaussianHead: Impressive Head Avatars with Learnable Gaussian Diffusion. arXiv preprint arXiv:2312.01632 (2023).
  51. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision. Springer, 700–717.
  52. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV). 670–686.
  53. OpenGL programming guide: the official guide to learning OpenGL, version 1.2. Addison-Wesley Longman Publishing Co., Inc.
  54. Simple and Effective Zero-shot Cross-lingual Phoneme Recognition. arXiv:2109.11680 [cs.CL]
  55. Dfa-nerf: Personalized talking head generation via disentangled face attributes neural rendering. arXiv preprint arXiv:2201.00791 (2022).
  56. GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation. arXiv preprint arXiv:2305.00787 (2023).
  57. Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. arXiv preprint arXiv:2301.13430 (2023).
  58. Audio-driven talking face video generation with natural head pose. arXiv preprint arXiv:2002.10137 2, 6 (2020), 7.
  59. Multimodal inputs driven talking face generation with spatial–temporal dependency. IEEE Transactions on Circuits and Systems for Video Technology 31, 1 (2020), 203–216.
  60. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  61. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3661–3670.
  62. General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18697–18709.
  63. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 9299–9306.
  64. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG) 39, 6 (2020), 1–15.
  65. HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting. arXiv preprint arXiv:2402.06149 (2024).
  66. State of the art on monocular 3D face reconstruction, tracking, and applications. In Computer graphics forum, Vol. 37. Wiley Online Library, 523–550.
Citations (4)

Summary

  • The paper introduces a novel speaker-specific motion translator that integrates a universal audio encoder with a customized motion decoder to accurately predict FLAME parameters.
  • It employs dynamic Gaussian splatting with real-time deformation and speaker-specific blend shapes to enhance lip synchronization and reduce rendering artifacts.
  • The paper demonstrates superior performance with higher PSNR and SSIM scores and real-time rendering speeds up to 130 FPS, paving the way for advanced multimedia applications.

GaussianTalker: Advancing Talking Head Synthesis with 3D Gaussian Splatting and FLAME Integration

Introduction

GaussianTalker introduces a transformative approach to audio-driven talking head synthesis, enhancing the dynamic and realistic rendering of human head videos. This model capitalizes on the strengths of 3D Gaussian Splatting, bound to the FLAME (Faces Learned with an Articulated Model and Expressions) framework, to overcome the challenges posed by existing methods like Neural Radiance Fields (NeRF). By associating the Gaussian splatting technique with parametric 3D modeling, GaussianTalker achieves superior lip synchronization, reduces artifacts, and dramatically increases rendering speeds.

Core Methodologies

Speaker-specific Motion Translator

This component is essential for producing accurate lip movements that are speaker-specific. It achieves this through a unique process that involves:

  1. Universal Audio Encoder: Uses adversarial learning designed to exclude speaker identity information from the audio features, focusing purely on content.
  2. Customized Motion Decoder: Integrates identity-specific embeddings with universal audio features to accurately predict FLAME parameters representing dynamic facial expressions.

Dynamic Gaussian Renderer

The rendering is achieved through:

  1. Dynamic Deformation: Gaussians attached to the FLAME mesh adapt in real-time, following facial movements dictated by the FLAME parameters.
  2. Speaker-specific BlendShapes: Augments FLAME by adding specific morphologies relevant to the individual’s facial features, enhancing detail in areas like teeth and wrinkles.

Experimental Outcomes

Quantitative assessments indicate a substantial improvement over existing state-of-the-art approaches, with the GaussianTalker achieving:

  • Higher PSNR and SSIM scores, indicating better raw image quality.
  • Lower LPIPS and FID scores, suggesting greater perceptual likeness to real video.
  • Exceptional real-time performance capabilities, with rendering speeds reaching up to 130 FPS on an NVIDIA RTX4090 GPU, and practical deployment demonstrated on other hardware platforms such as Apple's M1 chip.

Theoretical and Practical Implications

The integration of 3D Gaussian Splatting with the FLAME model brings several advancements to the field of talking head synthesis.

  • Enhanced Realism: By resolving issues such as unnatural lip synchronization and visual jitter typically seen in previous methods.
  • Improved Performance: Significantly faster rendering capabilities make it suitable for real-time applications.
  • Cross-modal and Speaker-specific Adaptations: The methodology not only synchronizes audio and visual data but also adapts these to the nuances of individual speakers.

Future Perspectives

Looking forward, the principles demonstrated by GaussianTalker can be extended to other areas of generative modeling where dynamic, realistic rendering of human-like characters is required. Considering further advancements in hardware and optimization techniques, the potential applications of Gaussian splatting could expand into more interactive and immersive realms like augmented and virtual reality, further enhancing user experiences in digital human interaction.

Conclusion

GaussianTalker stands out in the landscape of talking head synthesis by addressing the critical challenges of synchronization, realism, and efficiency. Its innovative use of 3D Gaussian Splatting combined with the FLAME framework sets a new standard in the field, promising exciting avenues for future research and application in multimedia, communications, and entertainment technologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com