Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.
- Image2styleGAN++: How to edit the embedded images? In Proc. CVPR, 2020.
- Deep audio-visual speech recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018a.
- LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018b.
- Statistical parametric speech synthesis. In Proc. ICASSP, 2007.
- How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proc. ICCV, 2017.
- Neural head reenactment with latent pose descriptors. In Proc. CVPR, 2020.
- Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proc. CVPR, 2019.
- What comprises a good talking-head video generation?: A survey and benchmark. arXiv preprint arXiv:2005.03201, 2020.
- Adaspeech: Adaptive text to speech for custom voice. In Proc. ICLR, 2021.
- Reprogramming audio-driven talking face synthesis into text-driven. In Proc. ICASSP, 2024.
- Out of time: automated lip sync in the wild. In Proc. ACCV, 2017.
- You said that? In Proc. BMVC., 2017.
- Voxceleb2: Deep speaker recognition. In Proc. Interspeech, 2018.
- Speech-driven facial animation using cascaded GANs for learning of motion and texture. In Proc. ECCV, 2020.
- Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proc. CVPR, 2019.
- Photo-real talking head with deep bidirectional LSTM. In Proc. ICASSP, 2015.
- A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications, 2016.
- Text-based editing of talking-head video. ACM Transactions on Graphics, 2019.
- Generative adversarial networks. Communications of the ACM, 2020.
- Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image. In Proc. Interspeech, 2020.
- Conformer: Convolution-augmented transformer for speech recognition. In Proc. Interspeech, 2020.
- Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech synthesis. In Proc. NeurIPS, 2022.
- Curricularface: Adaptive curriculum learning loss for deep face recognition. In Proc. CVPR, 2020.
- Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions. In Proc. ICASSP, 2023.
- That’s what i said: Fully-controllable talking face generation. In Proc. ACM MM, 2023.
- Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proc. NeurIPS, 2018.
- Disentangled representation learning for 3D face shape. In Proc. CVPR, 2019.
- Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. In Proc. NeurIPS, 2020.
- Adam: A method for stochastic optimization. Proc. ICLR, 2014.
- Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. NeurIPS, 2020.
- Towards automatic face-to-face translation. In Proc. ACM MM, 2019.
- Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442, 2017.
- Imaginary voice: Face-styled diffusion model for text-to-speech. In Proc. ICASSP, 2023.
- PVAE-TTS: Adaptive text-to-speech via progressive style adaptation. In Proc. ICASSP, 2022.
- Multi-spectroGAN: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In Proc. AAAI, 2021.
- Expressive talking head generation with granular audio-visual control. In Proc. CVPR, 2022.
- Flow matching for generative modeling. In Proc. ICLR, 2023.
- Parallel and high-fidelity text-to-lip generation. In Proc. AAAI, 2022.
- Styletalk: One-shot talking head generation with controllable speaking styles. In Proc. AAAI, 2023.
- Matcha-TTS: A fast TTS architecture with conditional flow matching. In Proc. ICASSP, 2024.
- Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In Proc. ICML, 2021.
- Styletalker: One-shot style-based audio-driven talking head video generation. arXiv preprint arXiv:2208.10922, 2022.
- Uniflg: Unified facial landmark generator from text or speech. In Proc. Interspeech, 2023.
- Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In Proc. AAAI, 2022.
- Pytorch: An imperative style, high-performance deep learning library. In Proc. NeurIPS, 2019.
- Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. ICML, 2021.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proc. ACM MM, 2020.
- Robust speech recognition via large-scale weak supervision. In Proc. ICML, 2023.
- Fastspeech: Fast, robust and controllable text to speech. In Proc. NeurIPS, 2019.
- Encoding in style: A stylegan encoder for image-to-image translation. In Proc. CVPR, 2021.
- Fine-grained head pose estimation without keypoints. In Proc. CVPR, 2018.
- Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In Proc. ICASSP, 2018.
- First order motion model for image animation. In Proc. NeurIPS, 2019.
- Motion representations for articulated animation. In Proc. CVPR, 2021.
- Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, 2015.
- X-vectors: Robust dnn embeddings for speaker recognition. In Proc. ICASSP, 2018.
- Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 2022.
- Talking face generation by conditional recurrent adversarial network. Proc. IJCAI, 2019.
- Dynamic units of visual speech. In Proc. ACM SIGGRAPH, 2012.
- Neural voice puppetry: Audio-driven facial reenactment. In Proc. ECCV, 2020.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 2008.
- Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proc. CVPR, 2022a.
- Residual-guided personalized speech synthesis based on face image. In Proc. ICASSP, 2022b.
- Audio2head: Audio-driven one-shot talking-head generation with natural head motion. In Proc. IJCAI, 2021a.
- One-shot talking face generation from single-speaker audio-visual correlation learning. In Proc. AAAI, 2022c.
- One-shot free-view neural talking-head synthesis for video conferencing. In Proc. CVPR, 2021b.
- Latent image animator: Learning to animate images via latent space navigation. In Proc. ICLR, 2022d.
- Text-to-video: A two-stage framework for zero-shot identity-agnostic talking-head generation. In Proc. KDD, 2023.
- Ada-TTA: Towards adaptive high-quality text-to-talking avatar synthesis. In Proc. ICMLW, 2023.
- Audio-driven talking face video generation with learning-based personalized head pose. IEEE Trans. on Multimedia, 2020.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proc. CVPR, 2018.
- Text2video: Text-driven talking-head video synthesis with personalized phoneme-pose dictionary. In Proc. ICASSP, 2022.
- Sadtalker: Learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation. In Proc. CVPR, 2023.
- Talking face generation by adversarially disentangled audio-visual representation. In Proc. AAAI, 2019.
- Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proc. CVPR, 2021.
- Makeittalk: speaker-aware talking-head animation. ACM Transactions on Graphics, 2020.
- Deep audio-visual learning: A survey. International Journal of Automation and Computing, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.