Papers
Topics
Authors
Recent
Search
2000 character limit reached

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Published 16 May 2024 in cs.CV, cs.AI, cs.SD, eess.AS, and eess.IV | (2405.10272v1)

Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Image2styleGAN++: How to edit the embedded images? In Proc. CVPR, 2020.
  2. Deep audio-visual speech recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018a.
  3. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018b.
  4. Statistical parametric speech synthesis. In Proc. ICASSP, 2007.
  5. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proc. ICCV, 2017.
  6. Neural head reenactment with latent pose descriptors. In Proc. CVPR, 2020.
  7. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proc. CVPR, 2019.
  8. What comprises a good talking-head video generation?: A survey and benchmark. arXiv preprint arXiv:2005.03201, 2020.
  9. Adaspeech: Adaptive text to speech for custom voice. In Proc. ICLR, 2021.
  10. Reprogramming audio-driven talking face synthesis into text-driven. In Proc. ICASSP, 2024.
  11. Out of time: automated lip sync in the wild. In Proc. ACCV, 2017.
  12. You said that? In Proc. BMVC., 2017.
  13. Voxceleb2: Deep speaker recognition. In Proc. Interspeech, 2018.
  14. Speech-driven facial animation using cascaded GANs for learning of motion and texture. In Proc. ECCV, 2020.
  15. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proc. CVPR, 2019.
  16. Photo-real talking head with deep bidirectional LSTM. In Proc. ICASSP, 2015.
  17. A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications, 2016.
  18. Text-based editing of talking-head video. ACM Transactions on Graphics, 2019.
  19. Generative adversarial networks. Communications of the ACM, 2020.
  20. Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image. In Proc. Interspeech, 2020.
  21. Conformer: Convolution-augmented transformer for speech recognition. In Proc. Interspeech, 2020.
  22. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech synthesis. In Proc. NeurIPS, 2022.
  23. Curricularface: Adaptive curriculum learning loss for deep face recognition. In Proc. CVPR, 2020.
  24. Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions. In Proc. ICASSP, 2023.
  25. That’s what i said: Fully-controllable talking face generation. In Proc. ACM MM, 2023.
  26. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proc. NeurIPS, 2018.
  27. Disentangled representation learning for 3D face shape. In Proc. CVPR, 2019.
  28. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. In Proc. NeurIPS, 2020.
  29. Adam: A method for stochastic optimization. Proc. ICLR, 2014.
  30. Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. NeurIPS, 2020.
  31. Towards automatic face-to-face translation. In Proc. ACM MM, 2019.
  32. Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442, 2017.
  33. Imaginary voice: Face-styled diffusion model for text-to-speech. In Proc. ICASSP, 2023.
  34. PVAE-TTS: Adaptive text-to-speech via progressive style adaptation. In Proc. ICASSP, 2022.
  35. Multi-spectroGAN: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In Proc. AAAI, 2021.
  36. Expressive talking head generation with granular audio-visual control. In Proc. CVPR, 2022.
  37. Flow matching for generative modeling. In Proc. ICLR, 2023.
  38. Parallel and high-fidelity text-to-lip generation. In Proc. AAAI, 2022.
  39. Styletalk: One-shot talking head generation with controllable speaking styles. In Proc. AAAI, 2023.
  40. Matcha-TTS: A fast TTS architecture with conditional flow matching. In Proc. ICASSP, 2024.
  41. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In Proc. ICML, 2021.
  42. Styletalker: One-shot style-based audio-driven talking head video generation. arXiv preprint arXiv:2208.10922, 2022.
  43. Uniflg: Unified facial landmark generator from text or speech. In Proc. Interspeech, 2023.
  44. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In Proc. AAAI, 2022.
  45. Pytorch: An imperative style, high-performance deep learning library. In Proc. NeurIPS, 2019.
  46. Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. ICML, 2021.
  47. A lip sync expert is all you need for speech to lip generation in the wild. In Proc. ACM MM, 2020.
  48. Robust speech recognition via large-scale weak supervision. In Proc. ICML, 2023.
  49. Fastspeech: Fast, robust and controllable text to speech. In Proc. NeurIPS, 2019.
  50. Encoding in style: A stylegan encoder for image-to-image translation. In Proc. CVPR, 2021.
  51. Fine-grained head pose estimation without keypoints. In Proc. CVPR, 2018.
  52. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In Proc. ICASSP, 2018.
  53. First order motion model for image animation. In Proc. NeurIPS, 2019.
  54. Motion representations for articulated animation. In Proc. CVPR, 2021.
  55. Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, 2015.
  56. X-vectors: Robust dnn embeddings for speaker recognition. In Proc. ICASSP, 2018.
  57. Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 2022.
  58. Talking face generation by conditional recurrent adversarial network. Proc. IJCAI, 2019.
  59. Dynamic units of visual speech. In Proc. ACM SIGGRAPH, 2012.
  60. Neural voice puppetry: Audio-driven facial reenactment. In Proc. ECCV, 2020.
  61. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 2008.
  62. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proc. CVPR, 2022a.
  63. Residual-guided personalized speech synthesis based on face image. In Proc. ICASSP, 2022b.
  64. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. In Proc. IJCAI, 2021a.
  65. One-shot talking face generation from single-speaker audio-visual correlation learning. In Proc. AAAI, 2022c.
  66. One-shot free-view neural talking-head synthesis for video conferencing. In Proc. CVPR, 2021b.
  67. Latent image animator: Learning to animate images via latent space navigation. In Proc. ICLR, 2022d.
  68. Text-to-video: A two-stage framework for zero-shot identity-agnostic talking-head generation. In Proc. KDD, 2023.
  69. Ada-TTA: Towards adaptive high-quality text-to-talking avatar synthesis. In Proc. ICMLW, 2023.
  70. Audio-driven talking face video generation with learning-based personalized head pose. IEEE Trans. on Multimedia, 2020.
  71. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. CVPR, 2018.
  72. Text2video: Text-driven talking-head video synthesis with personalized phoneme-pose dictionary. In Proc. ICASSP, 2022.
  73. Sadtalker: Learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation. In Proc. CVPR, 2023.
  74. Talking face generation by adversarially disentangled audio-visual representation. In Proc. AAAI, 2019.
  75. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proc. CVPR, 2021.
  76. Makeittalk: speaker-aware talking-head animation. ACM Transactions on Graphics, 2020.
  77. Deep audio-visual learning: A survey. International Journal of Automation and Computing, 2021.
Citations (5)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 8 likes about this paper.