Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EmoVOCA: Speech-Driven Emotional 3D Talking Heads (2403.12886v2)

Published 19 Mar 2024 in cs.CV

Abstract: The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained model will be made available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477, 2020.
  2. Neural 3D morphable models: Spiral convolutional networks for 3D shape representation learning and generation. In IEEE/CVF Int. Conf. on Computer Vision (CVPR), pages 7213–7222, 2019.
  3. Disentangling audio content and emotion with adaptive instance normalization for expressive facial animation synthesis. Computer Animation and Virtual Worlds, 33(3-4):e2076, 2022.
  4. Capture, learning, and synthesis of 3D speaking styles. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019.
  5. Emoca: Emotion driven monocular face capture and animation. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022.
  6. Emotional speech-driven animation with content-emotion disentanglement. In SIGGRAPH Asia 2023 Conference Papers, SA ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  7. Facial animation based on context-dependent visemes. Computers & Graphics, 30(6):971–980, 2006.
  8. Jali: An animator-centric viseme model for expressive lip synchronization. ACM Trans. on Graphics, 35(4), jul 2016.
  9. Faceformer: Speech-driven 3D facial animation with transformers. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 18770–18780, 2022.
  10. G.A. Kalberer and L. Van Gool. Face animation based on observed 3D speech dynamics. In IEEE Conf. on Computer Animation, pages 20–251, 2001.
  11. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. on Graphics, 36(4), jul 2017.
  12. Adam: A method for stochastic optimization, 2017.
  13. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
  14. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
  15. Audio-driven 3D facial animation from in-the-wild videos, 2023.
  16. Learning landmarks motion from speech for speaker-agnostic 3D talking heads generation. In Int. Conf. on Image Analysis and Processing (ICIAP), 2023.
  17. Sparse to dense dynamic 3D facial expression generation. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pages 20385–20394, 2022.
  18. Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces, 2023.
  19. Emotalk: Speech-driven emotional disentanglement for 3D face animation. arXiv preprint arXiv:2303.11089, 2023.
  20. The Florence 4D facial expression dataset. In IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG), pages 1–6, 2023.
  21. Meshtalk: 3D face animation from speech using cross-modality disentanglement. In IEEE/CVF Int. Conf. on Computer Vision (CVPR), pages 1173–1182, 2021.
  22. Facediffuser: Speech-driven 3D facial animation synthesis using diffusion. In ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG’23), 2023.
  23. Laughtalk: Expressive 3d talking head generation with laughter, 2023.
  24. Imitator: Personalized speech-driven 3D facial animation. In IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 20621–20631, October 2023.
  25. Laurens van der Maaten and Geoffrey Hinton. Viualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 11 2008.
  26. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Federico Nocentini (5 papers)
  2. Claudio Ferrari (30 papers)
  3. Stefano Berretti (28 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.