Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion (2309.11306v1)

Published 20 Sep 2023 in cs.CV, cs.AI, and cs.GR

Abstract: Speech-driven 3D facial animation synthesis has been a challenging task both in industry and research. Recent methods mostly focus on deterministic deep learning methods meaning that given a speech input, the output is always the same. However, in reality, the non-verbal facial cues that reside throughout the face are non-deterministic in nature. In addition, majority of the approaches focus on 3D vertex based datasets and methods that are compatible with existing facial animation pipelines with rigged characters is scarce. To eliminate these issues, we present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations that is trained with both 3D vertex and blendshape based datasets. Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input. To the best of our knowledge, we are the first to employ the diffusion method for the task of speech-driven 3D facial animation synthesis. We have run extensive objective and subjective analyses and show that our approach achieves better or comparable results in comparison to the state-of-the-art methods. We also introduce a new in-house dataset that is based on a blendshape based rigged character. We recommend watching the accompanying supplementary video. The code and the dataset will be publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Trans. Graph. 42, 4 (2023), 1–20. https://doi.org/10.1145/3592458
  2. Voice2Face: Audio-driven Facial and Tongue Rig Animations with cVAEs. In EUROGRAPHICS SYMPOSIUM ON COMPUTER ANIMATION (SCA 2022.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
  4. Speech Driven Video Editing via an Audio-Conditioned Diffusion Model. arXiv preprint arXiv:2301.04474 (2023).
  5. Expressive Speech-Driven Facial Animation. ACM Trans. Graph. 24, 4 (oct 2005), 1283–1302. https://doi.org/10.1145/1095878.1095881
  6. Audio‐driven emotional speech animation for interactive virtual characters. Computer Animation and Virtual Worlds 30 (2019).
  7. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  8. Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10101–10111.
  9. EMOCA: Emotion Driven Monocular Face Capture and Animation. In Conference on Computer Vision and Pattern Recognition (CVPR). 20311–20322.
  10. Emotional Speech-Driven Animation with Content-Emotion Disentanglement. arXiv:2306.08990 [cs.CV]
  11. DI4D 2023. DI4D. https://di4d.com/.
  12. DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder. arXiv preprint arXiv:2303.17550 (2023).
  13. Dynamixyz 2023. Dynamixyz. https://www.dynamixyz.com.
  14. 3D Morphable Face Models—Past, Present, and Future. ACM Trans. Graph. 39, 5, Article 157 (jun 2020), 38 pages. https://doi.org/10.1145/3395208
  15. Epic Games 2023. MetaHuman Animator. https://www.unrealengine.com/en-US/blog/delivering-high-quality-facial-animation-in-minutes-metahuman-animator-is-now-available.
  16. Faceware 2023. Faceware. https://facewaretech.com/.
  17. FaceFormer: Speech-Driven 3D Facial Animation with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18770–18780.
  18. Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation. Proc. ACM Comput. Graph. Interact. Tech. 5, 1, Article 16 (may 2022), 15 pages. https://doi.org/10.1145/3522615
  19. A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia 12, 6 (2010), 591–598.
  20. Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. ACM Transactions on Graphics, (Proc. SIGGRAPH) 40, 8. https://doi.org/10.1145/3450626.3459936
  21. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).
  22. Kazi Injamamul Haque and Zerrin Yumak. 2023. FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION (ICMI ’23) (Paris, France). ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3577190.3614157
  23. Denoising Diffusion Probabilistic Models. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 574, 12 pages.
  24. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
  25. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. https://doi.org/10.48550/ARXIV.2106.07447
  26. JALI 2023. JALI Research. https://jaliresearch.com/.
  27. EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 61, 10 pages. https://doi.org/10.1145/3528233.3530745
  28. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In International Conference on Intelligent Virtual Agents (IVA ’20). ACM.
  29. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
  30. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.
  31. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. In European conference on computer vision.
  32. Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation. ACM Transactions on Graphics 40, 6 (12 2021), 17 pages. https://doi.org/10.1145/3478513.3480484
  33. Survey on 3D face reconstruction from uncalibrated images. Computer Science Review 40 (2021), 100400. https://doi.org/10.1016/j.cosrev.2021.100400
  34. Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022).
  35. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964
  36. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10619–10629.
  37. Prolific 2023. Prolific. https://www.prolific.co.
  38. Qualtrics 2023. Qualtrics. https://www.qualtrics.com.
  39. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  40. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 [cs.CV]
  41. Ray CGTARIAN 2023. Ray Character Maya Scene by CGTarian. https://www.cgtarian.com/maya-character-rigs/download-free-3d-character-ray.html.
  42. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1173–1182.
  43. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
  44. Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 7763–7772.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
  46. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations. https://openreview.net/forum?id=PxTIG12RRHS
  47. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–11.
  48. Dynamic Units of Visual Speech. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (Lausanne, Switzerland) (SCA ’12). Eurographics Association, Goslar, DEU, 275–284.
  49. Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022).
  50. Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 448–458.
  51. Neural Discrete Representation Learning. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf
  52. Vicon 2023. Vicon. https://www.vicon.co.
  53. Multiface: A dataset for neural face rendering. arXiv preprint arXiv:2207.11243 (2022).
  54. CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. arXiv preprint arXiv:2301.02379 (2023).
  55. DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models. arXiv preprint arXiv:2305.04919 (2023).
  56. Generating Holistic 3D Human Motion from Speech. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 469–480.
  57. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8652–8661.
  58. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–10.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Stefan Stan (1 paper)
  2. Kazi Injamamul Haque (3 papers)
  3. Zerrin Yumak (3 papers)
Citations (40)

Summary

We haven't generated a summary for this paper yet.