Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization (2403.06375v3)

Published 11 Mar 2024 in cs.CV

Abstract: Generating emotional talking faces is a practical yet challenging endeavor. To create a lifelike avatar, we draw upon two critical insights from a human perspective: 1) The connection between audio and the non-deterministic facial dynamics, encompassing expressions, blinks, poses, should exhibit synchronous and one-to-many mapping. 2) Vibrant expressions are often accompanied by emotion-aware high-definition (HD) textures and finely detailed teeth. However, both aspects are frequently overlooked by existing methods. To this end, this paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces that satisfy both insights concurrently (FlowVQTalker). Specifically, we develop a flow-based coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space represented as a mixture distribution. The generation process commences with random sampling from the modeled distribution, guided by the accompanying audio, enabling both lip-synchronization and the uncertain nonverbal facial cues generation. Furthermore, our designed vector-quantization image generator treats the creation of expressive facial images as a code query task, utilizing a learned codebook to provide rich, high-quality textures that enhance the emotional perception of the results. Extensive experiments are conducted to showcase the effectiveness of our approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Robust model training and generalisation with studentising flows. arXiv preprint arXiv:2006.06599, 2020.
  2. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
  3. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194, 1999.
  4. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV), pages 520–535, 2018.
  5. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019.
  6. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017.
  7. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
  8. Speech driven talking face generation from a single image and an emotion condition. arXiv: Audio and Speech Processing, 2020.
  9. Taming transformers for high-resolution image synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  10. Gary Faigin. The artist’s complete guide to facial expression. Watson-Guptill, 2012.
  11. Efficient emotional adaptation for audio-driven talking-head generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634–22645, 2023.
  12. Vqfr: Blind face restoration with vector-quantized dictionary and parallel decoder. In European Conference on Computer Vision, pages 126–143. Springer, 2022.
  13. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
  14. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
  15. Stylevr: Stylizing character animations with normalizing flows. IEEE Transactions on Visualization and Computer Graphics, 2023.
  16. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  17. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016.
  18. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
  19. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
  22. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3387–3396, 2022.
  23. Time-series forecasting with deep learning: a survey. Philosophical Transactions of the Royal Society A, 379(2194):20200209, 2021.
  24. Live speech portraits: real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG), 40(6):1–17, 2021.
  25. Styletalk: One-shot talking head generation with controllable speaking styles. arXiv preprint arXiv:2301.01081, 2023.
  26. Frame attention networks for facial expression recognition in videos. In 2019 IEEE international conference on image processing (ICIP), pages 3866–3870. IEEE, 2019.
  27. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  28. A no-reference image blur metric based on the cumulative probability of blur detection (cpbd). IEEE Transactions on Image Processing, 20(9):2678–2683, 2011.
  29. Learning to listen: Modeling non-deterministic dyadic facial motion. 2022.
  30. Neural discrete representation learning.
  31. Robust one-shot face video re-enactment using hybrid latent spaces of stylegan2. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20947–20957, 2023.
  32. Emotional voice puppetry. IEEE Transactions on Visualization and Computer Graphics, 29(5):2527–2535, 2023.
  33. Ai-generated characters for supporting personalized learning and well-being. Nature Machine Intelligence, 3(12):1013–1022, 2021.
  34. Emotalk: Speech-driven emotional disentanglement for 3d face animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20687–20697, 2023a.
  35. Emotalk: Speech-driven emotional disentanglement for 3d face animation. 2023b.
  36. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020.
  37. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13759–13768, 2021.
  38. Variational inference with normalizing flows. International Conference on Machine Learning,International Conference on Machine Learning, 2015a.
  39. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015b.
  40. Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, 2020. Version 0.3.0.
  41. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  42. Emotion-controllable generalized talking face generation. international joint conference on artificial intelligence, 2022.
  43. Talking face generation by conditional recurrent adversarial network. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019.
  44. Facediffuser: Speech-driven 3d facial animation synthesis using diffusion. In ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG ’23), November 15–17, 2023, Rennes, France, New York, NY, USA, 2023. ACM.
  45. Emmn: Emotional motion memory network for audio-driven emotional talking face generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22146–22156, 2023.
  46. Memories are one-to-many mapping alleviators in talking face generation. arXiv preprint arXiv:2212.05005, 2022.
  47. Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. In 2019 IEEE international conference on Multimedia & Expo Workshops (ICMEW), pages 366–371. IEEE, 2019.
  48. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 128(5):1398–1413, 2020.
  49. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17979–17989, 2023a.
  50. Lipformer: High-fidelity and generalizable talking face generation with a pre-learned facial codebook.
  51. Seeing what you said: Talking face generation guided by a lip reading expert. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14653–14662, 2023b.
  52. Emotional talking head generation based on memory-sharing and attention-augmented networks. 2023c.
  53. Emotional talking head generation based on memory-sharing and attention-augmented networks. arXiv preprint arXiv:2306.03594, 2023d.
  54. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI, pages 700–717. Springer, 2020.
  55. Robust video portrait reenactment via personalized representation quantization. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2):2564–2572, 2023e.
  56. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. In International Joint Conference on Artificial Intelligence. IJCAI, 2021.
  57. One-shot talking face generation from single-speaker audio-visual correlation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2531–2539, 2022a.
  58. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004.
  59. Restoreformer: High-quality blind face restoration from undegraded key-value pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17512–17521, 2022b.
  60. Codetalker: Speech-driven 3d facial animation with discrete motion prior. 2023.
  61. High-fidelity generalized emotional talking face generation with multi-modal emotion space learning. 2023.
  62. Audio-driven talking face video generation with learning-based personalized head pose. arXiv: Computer Vision and Pattern Recognition, 2020.
  63. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9459–9468, 2019.
  64. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  65. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. arXiv preprint arXiv:2211.12194, 2022.
  66. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
  67. Identity-preserving talking face generation with landmark and appearance priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2023.
  68. Talking face generation by adversarially disentangled audio-visual representation. national conference on artificial intelligence, 2019.
  69. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4176–4186, 2021.
  70. Towards robust blind face restoration with codebook lookup transformer. Advances in Neural Information Processing Systems, 35:30599–30611, 2022.
  71. Makelttalk: speaker-aware talking-head animation. ACM Transactions on Graphics, 2020a.
  72. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15, 2020b.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com