GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting (2404.19040v1)
Abstract: We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$\sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Gaussian deformation field to translate and transform 3D Gaussians to synchronize with audio information, in which multi-resolution hashing grid-based tri-plane and temporal smooth module are incorporated to learn accurate deformation for fine-grained facial details. In addition, a pose-conditioned deformation field is designed to model the stabilized torso. To enable efficient optimization of the condition Gaussian deformation field, we initialize 3D Gaussians by learning a coarse static Gaussian representation. Extensive experiments in person-specific videos with audio tracks validate that GSTalker can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.
- Deep speech 2: End-to-end speech recognition in English and Mandarin. In International conference on machine learning. PMLR, 173–182.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
- Tensorf: Tensorial radiance fields. In European Conference on Computer Vision. Springer, 333–350.
- Lip movements generation at a glance. In Proceedings of the European conference on computer vision (ECCV). 520–535.
- Joon Son Chung and Andrew Zisserman. 2017. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13. Springer, 251–263.
- DAE-talker: High fidelity speech-driven talking face generation with diffusion autoencoder. In Proceedings of the 31st ACM International Conference on Multimedia. 4281–4289.
- AD-NeRF: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5784–5794.
- SPACE: Speech-driven portrait animation with controllable expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 20914–20923.
- Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
- HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
- GaussianAvatar: Towards realistic human Avatar modeling from a single video via animatable 3D Gaussians. arXiv preprint arXiv:2312.02134 (2023).
- Shoukang Hu and Ziwei Liu. 2023. GauHuman: Articulated Gaussian splatting from monocular human videos. arXiv preprint arXiv:2312.02973 (2023).
- 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG) 42, 4 (2023), 1–14.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5549–5558.
- Gart: Gaussian articulated template models. arXiv preprint arXiv:2311.16099 (2023).
- Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7568–7578.
- Semantic-aware implicit neural audio-driven video portrait generation. In European Conference on Computer Vision. Springer, 106–125.
- SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866.
- Live speech portraits: real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–17.
- Dynamic 3D Gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023).
- NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–15.
- Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5865–5874.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484–492.
- 3DGS-Avatar: Animatable Avatars via deformable 3D Gaussian splatting. arXiv preprint arXiv:2312.09228 (2023).
- Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113.
- DiffTalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1982–1991.
- Photo tourism: exploring photo collections in 3D. In ACM siggraph 2006 papers. 835–846.
- Diffused heads: Diffusion models beat GANs on talking-face generation. arXiv preprint arXiv:2301.03396 (2023).
- Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5459–5469.
- Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–13.
- Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368 (2022).
- Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 716–731.
- Face2face: Real-time face capture and reenactment of RGB videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2387–2395.
- 4D Gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023).
- Gaussian Head Avatar: Ultra high-fidelity head Avatar via dynamic Gaussians. arXiv preprint arXiv:2312.03029 (2023).
- Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023).
- Geneface: Generalized and high-fidelity audio-driven 3D talking face synthesis. arXiv preprint arXiv:2301.13430 (2023).
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
- Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG) 39, 6 (2020), 1–15.
- Drivable 3D Gaussian Avatars. arXiv preprint arXiv:2311.08581 (2023).
- EWA volume splatting. In Proceedings Visualization, 2001. VIS’01. IEEE, 29–538.