Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading (2306.03258v2)

Published 5 Jun 2023 in eess.AS and cs.SD

Abstract: Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the intelligibility of the produced speech, readily perceptible while listening, and is empirically reflected in the substantial reduction of the WER metric. We demonstrate the effectiveness of LipVoicer through human evaluation, which shows that it produces more natural and synchronized speech signals compared to competing methods. Finally, we created a demo showcasing LipVoicer's superiority in producing natural, synchronized, and intelligible speech, providing additional evidence of its effectiveness. Project page and code: https://github.com/yochaiye/LipVoicer

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Deep audio-visual speech recognition. In arXiv:1809.02108, 2018a.
  2. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv, abs/1809.00496, 2018b.
  3. High fidelity speech synthesis with adversarial networks. In International Conference on Learning Representations, 2019.
  4. Audiolm: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022.
  5. Audio-visual efficient conformer for robust speech recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.  2258–2267, January 2023.
  6. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021.
  7. Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding. arXiv preprint arXiv:2308.07787, 2023a.
  8. Intelligible lip-to-speech synthesis with speech units, 2023b.
  9. J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
  10. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.
  11. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120 5 Pt 1:2421–4, 2006.
  12. SVTS: scalable video-to-speech synthesis. In Interspeech, 2022.
  13. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  14. Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR, 2021.
  15. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  16. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, 2022.
  17. Tcd-timit: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 17:603–615, 2015.
  18. More than words: In-the-wild visually-driven prosody for text-to-speech. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10577–10587, 2021.
  19. Deep Residual Learning for Image Recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’16, pp.  770–778. IEEE, June 2016.
  20. Classifier-free diffusion guidance. In NeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021.
  21. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33, 2020.
  22. Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech enhancement. arXiv e-prints, 2022.
  23. Diff-tts: A denoising diffusion model for text-to-speech. INTERSPEECH, 2021.
  24. Efficient neural audio synthesis. In International Conference on Machine Learning, 2018.
  25. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  11119–11133. PMLR, 17–23 Jul 2022.
  26. Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems, 34:2758–2770, 2021.
  27. Lip-to-speech synthesis in the wild with multi-task learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  28. Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10(60):1755–1758, 2009. URL http://jmlr.org/papers/v10/king09a.html.
  29. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021.
  30. Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. In International Conference on Learning Representations, 2021.
  31. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Computer Vision – ECCV 2018, pp.  122–138, 2018.
  32. Towards practical lipreading with distilled and efficient models. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pp.  7608–7612. IEEE, 2021.
  33. Visual Speech Recognition for Multiple Languages in the Wild. Nature Machine Intelligence, 4:930–939, 2022.
  34. Auto-avsr: Audio-visual speech recognition with automatic labels, 2023.
  35. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE Transactions on Cybernetics, 2022.
  36. Specaugment: A simple data augmentation method for automatic speech recognition, 2019. URL http://arxiv.org/abs/1904.08779.
  37. Triaan-vc: Triple adaptive attention normalization for any-to-any voice conversion. arXiv preprint arXiv:2303.09057, 2023.
  38. Clarinet: Parallel wave generation in end-to-end text-to-speech. In International Conference on Learning Representations, 2018.
  39. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, 2021.
  40. Learning individual speaking styles for accurate lip to speech synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  41. Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  886–890, 2022.
  42. Learning audio-visual speech representation by masked multimodal cluster prediction. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Z1Qlm11uOM.
  43. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:132–157, nov 2020. ISSN 2329-9290.
  44. Hifi-gan: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks. In Interspeech, 2020.
  45. Voiceloop: Voice fitting and synthesis via a phonological loop. In International Conference on Learning Representations, 2018.
  46. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. In International Conference on Learning Representations, 2021.
  47. Wavenet: A generative model for raw audio. In 9th ISCA Speech Synthesis Workshop, 2016.
  48. Tacotron: Towards end-to-end speech synthesis. Interspeech, 2017.
  49. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  50. Stoi-net: A deep learning based non-intrusive speech intelligibility assessment model. 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.  482–486, 2020. URL https://api.semanticscholar.org/CorpusID:226281577.
Citations (3)

Summary

We haven't generated a summary for this paper yet.