Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation (2310.07236v3)

Published 11 Oct 2023 in cs.CV and cs.MM

Abstract: Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Transformer-s2a: Robust and efficient speech-to-animation. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7247–7251, 2022.
  2. Vast: Vivify your talking avatar via zero-shot expressive facial style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCV-W), 2023.
  3. Adaspeech: Adaptive text to speech for custom voice. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  4. VoxCeleb2: Deep Speaker Recognition. In Proc. Interspeech 2018, pages 1086–1090, 2018.
  5. Capture, learning, and synthesis of 3d speaking styles. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10093–10103, Los Alamitos, CA, USA, 2019. IEEE Computer Society.
  6. Emotional speech-driven animation with content-emotion disentanglement, 2023.
  7. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In IEEE Computer Vision and Pattern Recognition Workshops, 2019.
  8. Faceformer: Speech-driven 3d facial animation with transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18749–18758, Los Alamitos, CA, USA, 2022. IEEE Computer Society.
  9. A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia, 12(6):591–598, 2010.
  10. Learning an animatable detailed 3d face model from in-the-wild images. ACM Trans. Graph., 40(4), 2021.
  11. Spectre: Visual speech-informed perceptual 3d facial expression reconstruction from videos. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 5745–5755, 2023.
  12. MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics, 39(4):236:1–236:14, 2020.
  13. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021.
  14. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  15. Conformer-based end-to-end speech recognition with rotary position embedding. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 443–447, 2021.
  16. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6), 2017.
  17. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, 2021. Association for Computational Linguistics.
  18. Stack more layers differently: High-rank training through low-rank updates, 2023.
  19. Live Speech Portraits: Real-time photorealistic talking-head animation. ACM Transactions on Graphics, 40(6), 2021.
  20. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20395–20405, 2022.
  21. Emotalk: Speech-driven emotional disentanglement for 3d face animation, 2023.
  22. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, 2017.
  23. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1153–1162, Los Alamitos, CA, USA, 2021. IEEE Computer Society.
  24. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  25. Bailando: 3d dance generation via actor-critic gpt with choreographic memory. In CVPR, 2022.
  26. Laughtalk: Expressive 3d talking head generation with laughter. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024.
  27. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph., 36(4), 2017.
  28. Imitator: Personalized speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  29. Neural voice puppetry: Audio-driven facial reenactment. ECCV 2020, 2020.
  30. Neural discrete representation learning, 2018.
  31. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research (JMLR), 9(11):2579–2605, 2008.
  32. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  33. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020.
  34. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12780–12790, Los Alamitos, CA, USA, 2023. IEEE Computer Society.
  35. Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  36. Soundstream: An end-to-end neural audio codec, 2021.
  37. Investigating the catastrophic forgetting in multimodal large language models, 2023.
  38. Facial: Synthesizing dynamic talking face with implicit attribute learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3867–3876, 2021.
  39. Few-shot adaptation of pre-trained networks for domain shift. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 1665–1671. International Joint Conferences on Artificial Intelligence Organization, 2022. Main Track.
  40. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8652–8661, 2023.
  41. Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10544–10553, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.