Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Driven 3D Facial Animation from In-the-Wild Videos (2306.11541v1)

Published 20 Jun 2023 in cs.CV

Abstract: Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head. Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs. Consequently, their generalization capability remains limited. In this paper, we propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model. The abundance of easily accessible 2D talking-head videos equips our model with a robust generalization capability. By combining these videos with existing 3D face reconstruction methods, our model excels in generating consistent and high-fidelity lip synchronization. Additionally, our model proficiently captures the speaking styles of different individuals, allowing it to generate 3D talking-heads with distinct personal styles. Extensive qualitative and quantitative experimental results demonstrate the superiority of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  3. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019.
  4. 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5117–5126, 2018.
  5. You said that? arXiv preprint arXiv:1705.02966, 2017.
  6. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
  7. Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
  8. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022.
  9. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on graphics (TOG), 35(4):1–11, 2016.
  10. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18780, 2022.
  11. A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia, 12(6):591–598, 2010.
  12. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5784–5794, 2021.
  13. Modality dropout for improved performance-driven talking faces. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 378–386, 2020.
  14. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021.
  15. Face animation based on observed 3d speech dynamics. In Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No. 01TH8596), pages 20–251. IEEE, 2001.
  16. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
  17. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  18. Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1911–1920, 2021.
  19. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017.
  20. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3387–3396, 2022.
  21. Semantic-aware implicit neural audio-driven video portrait generation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 106–125. Springer, 2022.
  22. Video-audio driven real-time facial animation. ACM Transactions on Graphics (TOG), 34(6):1–10, 2015.
  23. Live speech portraits: real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG), 40(6):1–17, 2021.
  24. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
  25. 12 animated speech: research progress and applications. 2012.
  26. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  27. End-to-end learning for 3d facial animation from speech. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pages 361–365, 2018.
  28. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484–492, 2020.
  29. An efficient representation for irradiance environment maps. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 497–500, 2001.
  30. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
  31. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
  32. Lip reading sentences in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6447–6456, 2017.
  33. Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 17:585–598, 2022.
  34. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786, 2018.
  35. Masked lip-sync prediction by audio-visual contextual exploitation in transformers. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
  36. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
  37. Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, pages 275–284, 2012.
  38. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 716–731. Springer, 2020.
  39. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  40. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI, pages 700–717. Springer, 2020.
  41. One-shot free-view neural talking-head synthesis for video conferencing. In CVPR, 2021.
  42. Codetalker: Speech-driven 3d facial animation with discrete motion prior. arXiv preprint arXiv:2301.02379, 2023.
  43. A practical and configurable lip sync method for games. In Proceedings of Motion on Games, pages 131–140. 2013.
  44. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
  45. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9299–9306, 2019.
  46. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186, 2021.
  47. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15, 2020.
  48. Instant volumetric head avatars. arXiv preprint arXiv:2211.12499, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Liying Lu (7 papers)
  2. Tianke Zhang (13 papers)
  3. Yunfei Liu (40 papers)
  4. Xuangeng Chu (7 papers)
  5. Yu Li (378 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.