GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting (2404.14037v3)
Abstract: Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.
- LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).
- Volker Blanz and Thomas Vetter. 2023. A morphable model for the synthesis of 3D faces. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 157–164.
- Matthew Brand. 1999. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 21–28.
- Video rewrite: Driving visual speech with audio. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 715–722.
- Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In International Conference on Computer Vision.
- Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2013), 413–425.
- LipNeRF: What is the right feature space to lip-sync a NeRF?. In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1–8.
- CN-CVS: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Lip Movements Generation at a Glance. In Proceedings of the European Conference on Computer Vision (ECCV).
- Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7832–7841.
- Monogaussianavatar: Monocular gaussian point-based head avatar. arXiv preprint arXiv:2312.04558 (2023).
- Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018).
- Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20311–20322.
- Trainable videorealistic speech animation. ACM Transactions on Graphics (TOG) 21, 3 (2002), 388–398.
- Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8649–8658.
- Generative adversarial nets. Advances in neural information processing systems 27 (2014).
- Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF international conference on computer vision. 5784–5794.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6629–6640.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125–1134.
- You said that?: Synthesising talking faces from audio. International Journal of Computer Vision 127 (2019), 1767–1779.
- MNN: A Universal and Efficient Inference Engine. In MLSys.
- Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering. (2023).
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Fast Optical Flow using Dense Inverse Search. arXiv:arXiv:1603.03590
- Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7568–7578.
- Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.
- Generalizable One-shot 3D Neural Head Avatar. Advances in Neural Information Processing Systems 36 (2024).
- Semantic-aware implicit neural audio-driven video portrait generation. In European conference on computer vision. Springer, 106–125.
- Live speech portraits: real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–17.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision.
- Shigeo Morishima. 1998. Real-time Talking Head Driven by Voice and its Application to Communication and Entertainment. 195–200.
- Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5865–5874.
- A 3D face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance. Ieee, 296–301.
- Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces. In Proceedings of the 31st ACM International Conference on Multimedia. 5292–5301.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484–492.
- D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10318–10327.
- OpenVoice: Versatile Instant Voice Cloning. arXiv preprint arXiv:2312.01479 (2023).
- Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1173–1182.
- Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos. arXiv preprint arXiv:2402.03723 (2024).
- Self-supervised monocular 3d face reconstruction by occlusion-aware multi-view geometry consistency. In European Conference on Computer Vision. Springer, 53–70.
- SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. arXiv preprint arXiv:2403.05087 (2024).
- Learning dynamic facial radiance fields for few-shot talking head synthesis. In European conference on computer vision. Springer, 666–682.
- Hiroki Tanaka and Satoshi Nakamura. 2022. The Acceptability of Virtual Characters as Social Skills Trainers: Usability Study. JMIR Hum Factors 9, 1 (29 Mar 2022), e35358. https://doi.org/10.2196/35358
- Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368 (2022).
- Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 716–731.
- Neural Voice Puppetry: Audio-Driven Facial Reenactment. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 716–731.
- Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018). arXiv:1807.03748 http://arxiv.org/abs/1807.03748
- Realistic speech-driven facial animation with gans. International Journal of Computer Vision 128, 5 (2020), 1398–1413.
- GaussianHead: Impressive Head Avatars with Learnable Gaussian Diffusion. arXiv preprint arXiv:2312.01632 (2023).
- Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision. Springer, 700–717.
- X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV). 670–686.
- OpenGL programming guide: the official guide to learning OpenGL, version 1.2. Addison-Wesley Longman Publishing Co., Inc.
- Simple and Effective Zero-shot Cross-lingual Phoneme Recognition. arXiv:2109.11680 [cs.CL]
- Dfa-nerf: Personalized talking head generation via disentangled face attributes neural rendering. arXiv preprint arXiv:2201.00791 (2022).
- GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation. arXiv preprint arXiv:2305.00787 (2023).
- Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. arXiv preprint arXiv:2301.13430 (2023).
- Audio-driven talking face video generation with natural head pose. arXiv preprint arXiv:2002.10137 2, 6 (2020), 7.
- Multimodal inputs driven talking face generation with spatial–temporal dependency. IEEE Transactions on Circuits and Systems for Video Technology 31, 1 (2020), 203–216.
- The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3661–3670.
- General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18697–18709.
- Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 9299–9306.
- Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG) 39, 6 (2020), 1–15.
- HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting. arXiv preprint arXiv:2402.06149 (2024).
- State of the art on monocular 3D face reconstruction, tracking, and applications. In Computer graphics forum, Vol. 37. Wiley Online Library, 523–550.