Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting (2404.19040v1)

Published 29 Apr 2024 in cs.CV

Abstract: We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$\sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Gaussian deformation field to translate and transform 3D Gaussians to synchronize with audio information, in which multi-resolution hashing grid-based tri-plane and temporal smooth module are incorporated to learn accurate deformation for fine-grained facial details. In addition, a pose-conditioned deformation field is designed to model the stabilized torso. To enable efficient optimization of the condition Gaussian deformation field, we initialize 3D Gaussians by learning a coarse static Gaussian representation. Extensive experiments in person-specific videos with audio tracks validate that GSTalker can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Deep speech 2: End-to-end speech recognition in English and Mandarin. In International conference on machine learning. PMLR, 173–182.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
  3. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision. Springer, 333–350.
  4. Lip movements generation at a glance. In Proceedings of the European conference on computer vision (ECCV). 520–535.
  5. Joon Son Chung and Andrew Zisserman. 2017. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13. Springer, 251–263.
  6. DAE-talker: High fidelity speech-driven talking face generation with diffusion autoencoder. In Proceedings of the 31st ACM International Conference on Multimedia. 4281–4289.
  7. AD-NeRF: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5784–5794.
  8. SPACE: Speech-driven portrait animation with controllable expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 20914–20923.
  9. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  10. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
  11. GaussianAvatar: Towards realistic human Avatar modeling from a single video via animatable 3D Gaussians. arXiv preprint arXiv:2312.02134 (2023).
  12. Shoukang Hu and Ziwei Liu. 2023. GauHuman: Articulated Gaussian splatting from monocular human videos. arXiv preprint arXiv:2312.02973 (2023).
  13. 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG) 42, 4 (2023), 1–14.
  14. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  15. Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5549–5558.
  16. Gart: Gaussian articulated template models. arXiv preprint arXiv:2311.16099 (2023).
  17. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7568–7578.
  18. Semantic-aware implicit neural audio-driven video portrait generation. In European Conference on Computer Vision. Springer, 106–125.
  19. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866.
  20. Live speech portraits: real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–17.
  21. Dynamic 3D Gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023).
  22. NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
  23. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–15.
  24. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5865–5874.
  25. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484–492.
  26. 3DGS-Avatar: Animatable Avatars via deformable 3D Gaussian splatting. arXiv preprint arXiv:2312.09228 (2023).
  27. Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113.
  28. DiffTalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1982–1991.
  29. Photo tourism: exploring photo collections in 3D. In ACM siggraph 2006 papers. 835–846.
  30. Diffused heads: Diffusion models beat GANs on talking-face generation. arXiv preprint arXiv:2301.03396 (2023).
  31. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5459–5469.
  32. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–13.
  33. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368 (2022).
  34. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 716–731.
  35. Face2face: Real-time face capture and reenactment of RGB videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2387–2395.
  36. 4D Gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023).
  37. Gaussian Head Avatar: Ultra high-fidelity head Avatar via dynamic Gaussians. arXiv preprint arXiv:2312.03029 (2023).
  38. Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023).
  39. Geneface: Generalized and high-fidelity audio-driven 3D talking face synthesis. arXiv preprint arXiv:2301.13430 (2023).
  40. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
  41. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG) 39, 6 (2020), 1–15.
  42. Drivable 3D Gaussian Avatars. arXiv preprint arXiv:2311.08581 (2023).
  43. EWA volume splatting. In Proceedings Visualization, 2001. VIS’01. IEEE, 29–538.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com