Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation (2208.10922v2)

Published 23 Aug 2022 in cs.CV, cs.LG, eess.AS, and eess.IV

Abstract: We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
  2. Image2stylegan++: How to edit the embedded images? In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  3. Talking-head generation with rhythmic head motion. In European Conference on Computer Vision, 2020.
  4. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  5. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, 2014.
  6. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
  7. Out of time: Automated lip sync in the wild. In ACCV Workshops, 2016.
  8. Speech-driven facial animation using cascaded gans for learning of motion and texture. In European Conference on Computer Vision, 2020.
  9. Arcface: Additive angular margin loss for deep face recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  10. Long short-term memory. Neural Computation, 1997.
  11. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 2020.
  12. Contragan: Contrastive learning for conditional image generation. arXiv: Computer Vision and Pattern Recognition, 2020.
  13. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 2021.
  14. A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  15. Analyzing and improving the image quality of stylegan. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  16. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2014.
  17. Towards automatic face-to-face translation. In Proceedings of the 27th ACM International Conference on Multimedia, 2019.
  18. Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  19. Feature pyramid networks for object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  20. Nwt: Towards natural audio-to-video generation with representation learning. arXiv preprint arXiv:2106.04283, 2021.
  21. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, 2020.
  22. Variational inference with normalizing flows. In ICML, 2015.
  23. Encoding in style: a stylegan encoder for image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  24. First order motion model for image animation. NeurIPS, 32, 2019.
  25. Disentanglement by nonlinear ica with general incompressible-flow networks (gin). ArXiv, abs/2001.04872, 2020.
  26. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 2017.
  27. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision, 2020.
  28. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 2021.
  29. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. ArXiv, abs/2005.05957, 2021.
  30. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
  31. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 2020.
  32. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293, 2021.
  33. Cross-modal contrastive learning for text-to-image generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  34. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  35. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
  36. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  37. Makelttalk: speaker-aware talking-head animation. ACM Transactions on Graphics (TOG), 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Dongchan Min (8 papers)
  2. Minyoung Song (2 papers)
  3. Eunji Ko (3 papers)
  4. Sung Ju Hwang (178 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com