Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser (2311.16565v2)

Published 28 Nov 2023 in cs.CV, cs.SD, and eess.AS

Abstract: Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Joan A Argente. From speech to speaking styles. Speech communication, 11(4-5):325–335, 1992.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Audio-driven emotional speech animation for interactive virtual characters. Computer Animation and Virtual Worlds, 30(3-4):e1892, 2019.
  4. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020a.
  5. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023.
  6. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020b.
  7. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  8. Multimodal human machine interactions in virtual and augmented reality. Multimodal Signals: Cognitive and Algorithmic Issues: COST Action 2102 and euCognition International School Vietri sul Mare, Italy, April 21-26, 2008 Revised Selected and Invited Papers, pages 1–23, 2009.
  9. Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
  10. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on graphics (TOG), 35(4):1–11, 2016.
  11. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18780, 2022.
  12. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  13. Boot: Data-free distillation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference {normal-{\{{\normal-\\backslash\&}normal-}\}} Generative Modeling, 2023.
  14. Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023.
  15. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  16. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  17. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  18. Visual speech emotion conversion using deep learning for 3d talking head. In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data, pages 7–13, 2018.
  19. Ddp: Diffusion model for dense visual prediction. arXiv preprint arXiv:2303.17559, 2023.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  22. Practice and theory of blendshape facial models. Eurographics (State of the Art Reports), 1(8):2, 2014.
  23. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv preprint arXiv:2306.00980, 2023.
  24. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, page 612–630, Berlin, Heidelberg, 2022. Springer-Verlag.
  25. Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023.
  26. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
  27. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  28. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
  29. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  30. Emotalk: Speech-driven emotional disentanglement for 3d face animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20687–20697, 2023.
  31. End-to-end learning for 3d facial animation from speech. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pages 361–365, 2018.
  32. Computer facial animation: A review. International Journal of Computer Theory and Engineering, 5(4):658, 2013.
  33. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796–13805, 2020.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  35. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  37. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  38. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  39. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  40. Deep learning for visual speech analysis: A survey. arXiv preprint arXiv:2205.10839, 2022.
  41. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  42. Facediffuser: Speech-driven 3d facial animation synthesis using diffusion. arXiv preprint arXiv:2309.11306, 2023.
  43. A deep learning approach for generalized speech animation. ACM Transactions On Graphics (TOG), 36(4):1–11, 2017.
  44. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
  45. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  46. Virtual reality. Business & Information Systems Engineering, 62:455–461, 2020.
  47. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
  48. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023.
  49. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG), 37(4):1–10, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Peng Chen (324 papers)
  2. Xiaobao Wei (28 papers)
  3. Ming Lu (157 papers)
  4. Yitong Zhu (4 papers)
  5. Naiming Yao (2 papers)
  6. Xingyu Xiao (6 papers)
  7. Hui Chen (298 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.