Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time (2404.10667v2)

Published 16 Apr 2024 in cs.CV

Abstract: We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only generating lip movements that are exquisitely synchronized with the audio, but also producing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. https://www.prnewswire.com/news-releases/deepbrain-ai-delivers-ai-avatar-to-empower-people-with-disabilities-302026965.html, 2024. [Online; accessed 8-Apr-2024].
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
  3. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
  4. Improving image generation with better captions. https://cdn. openai. com/papers/dall-e-3.pdf, 2(3):8, 2023.
  5. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  6. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  7. Speculative futures on chatgpt and generative artificial intelligence (ai): A collective reflection from the educational landscape. Asian Journal of Distance Education, 18(1):53–130, 2023.
  8. Video generation models as world simulators. 2024.
  9. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  10. Neural head reenactment with latent pose descriptors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13786–13795, 2020.
  11. Lip movements generation at a glance. In European Conference on Computer Vision, pages 520–535, 2018.
  12. Videoretalking: Audio-based lip synchronization for talking head video editing in the wild. In SIGGRAPH Asia 2022, pages 1–9, 2022.
  13. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
  14. Out of time: automated lip sync in the wild. In Asian Conference on Computer Vision Workshops, pages 251–263. Springer, 2017.
  15. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
  16. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  17. Megaportraits: One-shot megapixel neural head avatars. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2663–2671, 2022.
  18. Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder. In Proceedings of the ACM International Conference on Multimedia, pages 4281–4289, 2023.
  19. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18780, 2022.
  20. High-fidelity and freely controllable talking head video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5609–5619, 2023.
  21. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  22. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
  23. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In IEEE/CVF International Conference on Computer Vision, pages 5784–5794, 2021.
  24. Gaia: Zero-shot talking avatar generation. In International Conference on Learning Representations, 2024.
  25. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  26. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  27. Assessing empathy and managing emotions through interactions with an affective avatar. Health informatics journal, 24(2):182–193, 2018.
  28. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.
  29. Greg Kessler. Technology and the future of language teaching. Foreign Language Annals, 51(1):205–218, 2018.
  30. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  31. Avatar therapy for persecutory auditory hallucinations: What is it and how does it work? Psychosis, 6(2):166–176, 2014.
  32. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3387–3396, 2022.
  33. Pixel codec avatars. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 64–73, 2021.
  34. Styletalk: One-shot talking head generation with controllable speaking styles. In AAAI Conference on Artificial Intelligence, pages arXiv–2301, 2023.
  35. Dpe: Disentanglement of pose and expression for general video portrait editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 427–436, 2023.
  36. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  37. A lip sync expert is all you need for speech to lip generation in the wild. In ACM International Conference on Multimedia, pages 484–492, 2020.
  38. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  39. What role can avatars play in e-mental health interventions? exploring new models of client–therapist interaction. Frontiers in Psychiatry, 7:186, 2016.
  40. PIRenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13759–13768, 2021.
  41. Andrey V Savchenko. Hsemotion: High-speed emotion recognition library. Software Impacts, 14:100433, 2022.
  42. First order motion model for image animation. In Advances in Neural Information Processing Systems, 2019.
  43. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022.
  44. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3626–3636, 2022.
  45. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  46. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  47. Diffused heads: Diffusion models beat gans on talking-face generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5091–5100, 2024.
  48. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3667–3676, 2020.
  49. Speech2talking-face: Inferring and driving a face with synchronized audio-visual representation. In International Joint Conference on Artificial Intelligence, volume 2, page 4, 2021.
  50. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. arXiv preprint arXiv:2310.00434, 2023.
  51. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 36(4):1–13, 2017.
  52. Emo: Emote portrait alive-generating expressive portrait videos with audio2video diffusion model under weak conditions. arXiv preprint arXiv:2402.17485, 2024.
  53. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1526–1535, 2018.
  54. Fvd: A new metric for video generation. 2019.
  55. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  56. Generating videos with scene dynamics. Advances in Neural Information Processing Systems, 29, 2016.
  57. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17979–17989, 2023.
  58. Lipformer: High-fidelity and generalizable talking face generation with a pre-learned facial codebook. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13844–13853, 2023.
  59. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. In International Joint Conference on Artificial Intelligence, 2021.
  60. One-shot talking face generation from single-speaker audio-visual correlation learning. In AAAI Conference on Artificial Intelligence, volume 36, pages 2531–2539, 2022.
  61. One-shot free-view neural talking-head synthesis for video conferencing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10039–10049, 2021.
  62. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024.
  63. Anifacegan: Animatable 3d-aware face image generation for video avatars. Advances in Neural Information Processing Systems, 35:36188–36201, 2022.
  64. Aniportraitgan: Animatable 3d portrait generation from 2d image collections. In SIGGRAPH Asia 2023, pages 1–9, 2023.
  65. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023.
  66. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  67. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In European Conference on Computer Vision, pages 85–101, 2022.
  68. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7645–7655, 2023.
  69. Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference on Computer Vision, pages 524–540, 2020.
  70. gazenet: End-to-end eye-movement event detection with deep neural networks. Behavior research methods, 51:840–864, 2019.
  71. Metaportrait: Identity-preserving talking head generation with fast personalized adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22096–22105, 2023.
  72. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8652–8661, 2023.
  73. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
  74. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on computer Vision and Pattern Recognition, pages 4176–4186, 2021.
  75. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sicheng Xu (8 papers)
  2. Guojun Chen (14 papers)
  3. Yu-Xiao Guo (8 papers)
  4. Jiaolong Yang (47 papers)
  5. Chong Li (112 papers)
  6. Zhenyu Zang (1 paper)
  7. Yizhong Zhang (8 papers)
  8. Xin Tong (193 papers)
  9. Baining Guo (53 papers)
Citations (44)

Summary

Enhanced Realism in Audio-Generated Talking Faces: Introducing the VASA-1 Framework

Introduction to VASA-1

The VASA-1 framework represents a significant contribution to the field of multimedia and AI-driven communication, enhancing the generation of lifelike talking faces from a single static image and an audio clip. This method excels in producing videos where lip movements are in sync with audio, enriched by a broad range of facial expressions and natural head movements that heighten the realism of the digital personas.

Core Innovations

VASA-1 introduces several technical advancements:

  • A diffusion-based model for holistic facial dynamics and head movement generation, operating within a specially constructed latent face space.
  • Development of a highly expressive and disentangled latent space for facial dynamics using extensive video data, facilitating nuanced control over generated facial features and movements.

Methodological Framework

The framework operates by generating a holistic dynamics model in a latent space that encompasses all facial movements—not just lips but also eye gaze, blinks, and other nuanced expressions. This approach differs considerably from past methods which treated various facial components separately. The entire process leverages a Diffusion Transformer, educated on a vast corpus of talking face videos, making it robust and capable of real-time video generation at 40 FPS with minimal latency.

Theoretical and Practical Implications

From a theoretical standpoint, VASA-1's ability to synchronize audio with a 3D latent representation of facial movements represents a complex integration of audio-visual data that pushes forward the boundaries of what generative models can achieve in multimedia. Practically, this technology sets the stage for creating more emotionally resonant and engaging AI avatars, potentially revolutionizing sectors like remote education, virtual assistance, and telehealth by providing a more human-like interaction model.

Evaluation and Results

VASA-1 demonstrates superior performance across various metrics compared to existing methods, with outstanding video quality and a highly realistic portrayal of facial and head dynamics. Not only does VASA-1 generate high-fidelity videos that align closely with the given audio, but the model also supports dynamic adjustments based on optional signals such as gaze direction, head distance, and emotional tone.

Future Research Directions

While VASA-1 marks a substantial leap forward, the exploration can extend into full-body dynamics, which would allow for more comprehensive interaction simulations. Additionally, enhancing the model to handle diverse environmental contexts or integrate more varied emotional responses could increase the range of applications for this technology.

Conclusion

VASA-1 combines innovative approaches to generative modeling with practical execution of high-resolution, real-time talking face videos. This synergistic integration of technologies not only meets but exceeds current standards in digital communication, offering a glimpse into the future of how humans might interact with AI systems in a manner that mirrors natural human interaction more closely than ever before.

Youtube Logo Streamline Icon: https://streamlinehq.com