Papers
Topics
Authors
Recent
2000 character limit reached

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face Animation (2408.11518v3)

Published 21 Aug 2024 in cs.CV

Abstract: The creation of increasingly vivid 3D talking face has become a hot topic in recent years. Currently, most speech-driven works focus on lip synchronisation but neglect to effectively capture the correlations between emotions and facial motions. To address this problem, we propose a two-stream network called EmoFace, which consists of an emotion branch and a content branch. EmoFace employs a novel Mesh Attention mechanism to analyse and fuse the emotion features and content features. Particularly, a newly designed spatio-temporal graph-based convolution, SpiralConv3D, is used in Mesh Attention to learn potential temporal and spatial feature dependencies between mesh vertices. In addition, to the best of our knowledge, it is the first time to introduce a new self-growing training scheme with intermediate supervision to dynamically adjust the ratio of groundtruth adopted in the 3D face animation task. Comprehensive quantitative and qualitative evaluations on our high-quality 3D emotional facial animation dataset, 3D-RAVDESS ($4.8863\times 10{-5}$mm for LVE and $0.9509\times 10{-5}$mm for EVE), together with the public dataset VOCASET ($2.8669\times 10{-5}$mm for LVE and $0.4664\times 10{-5}$mm for EVE), demonstrate that our approach achieves state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33: 12449–12460.
  2. Expressive speech-driven facial animation. ACM Transactions on Graphics (TOG), 24(4): 1283–1302.
  3. DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser. arXiv preprint arXiv:2311.16565.
  4. Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10101–10111.
  5. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20311–20322.
  6. Emotional speech-driven animation with content-emotion disentanglement. In SIGGRAPH Asia 2023 Conference Papers, 1–13.
  7. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29.
  8. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on graphics (TOG), 35(4): 1–11.
  9. Facial action coding system. Environmental Psychology & Nonverbal Behavior.
  10. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18770–18780.
  11. Spiralnet++: A fast and highly efficient mesh convolution operator. In Proceedings of the IEEE/CVF international conference on computer vision workshops, 0–0.
  12. MobileHand: Real-time 3D Hand Shape and Pose Estimation from Color Image. In 27th International Conference on Neural Information Processing (ICONIP).
  13. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14080–14089.
  14. Kalman, R. E. 1960. A new approach to linear filtering and prediction problems.
  15. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4): 1–12.
  16. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4990–5000.
  17. Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2755–2764.
  18. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph., 36(6): 194–1.
  19. A simple approach to intrinsic correspondence learning on unstructured 3d meshes. In Proceedings of the European conference on computer vision (ECCV) workshops, 0–0.
  20. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 13(5): e0196391.
  21. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, 37–45.
  22. Animated speech: Research progress and applications. Audiovisual Speech Processing.
  23. Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication, 66: 182–217.
  24. Scheduled Sampling for Transformers. In Proceedings ACL SRW.
  25. DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion. arXiv preprint arXiv:2310.05934.
  26. Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces. In Proceedings of the 31st ACM International Conference on Multimedia, 5292–5301.
  27. EmoTalk: Speech-driven emotional disentanglement for 3D face animation. arXiv preprint arXiv:2303.11089.
  28. Speech-driven 3D facial animation with implicit emotional awareness: A deep learning approach. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 80–88.
  29. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1173–1182.
  30. Emotion-controllable generalized talking face generation. arXiv preprint arXiv:2205.01155.
  31. Facediffuser: Speech-driven 3d facial animation synthesis using diffusion. In Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games, 1–11.
  32. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. arXiv preprint arXiv:2310.00434.
  33. Laughtalk: Expressive 3d talking head generation with laughter. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 6404–6413.
  34. A deep learning approach for generalized speech animation. ACM Transactions On Graphics (TOG), 36(4): 1–11.
  35. Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, 275–284.
  36. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497.
  37. Neural discrete representation learning. Advances in neural information processing systems, 30.
  38. Attention is all you need. Advances in neural information processing systems, 30.
  39. Speech-Driven 3D Face Animation with Composite and Regional Facial Movements. In Proceedings of the 31st ACM International Conference on Multimedia, 6822–6830.
  40. CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. arXiv preprint arXiv:2301.02379.
  41. A practical and configurable lip sync method for games. In Proceedings of Motion on Games, 131–140.
  42. Facial expression synthesis based on emotion dimensions for affective talking avatar. Modeling Machine Emotions for Realizing Intelligence: Foundations and Applications, 109–132.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.