Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Accurate Lip-to-Speech Synthesis in-the-Wild (2403.01087v1)

Published 2 Mar 2024 in cs.MM, cs.CV, cs.SD, and eess.AS

Abstract: In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust LLM from speech alone, resulting in unsatisfactory outcomes. To overcome this issue, we propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model. The noisy text is generated using a pre-trained lip-to-text model, enabling our approach to work without text annotations during inference. We design a visual text-to-speech network that utilizes the visual stream to generate accurate speech, which is in-sync with the silent input video. We perform extensive experiments and ablation studies, demonstrating our approach's superiority over the current state-of-the-art methods on various benchmark datasets. Further, we demonstrate an essential practical application of our method in assistive technology by generating speech for an ALS patient who has lost the voice but can make mouth movements. Our demo video, code, and additional details can be found at \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw}.

Toward More Accurate Lip-to-Speech Synthesis for In-the-Wild Scenarios

Introduction

The synthesis of speech from silent videos based solely on lip movements introduces a fascinating domain of lip-to-speech (L2S) generation, which is distinct from the more explored area of lip-to-text (L2T) conversion. While L2T focuses on generating textual transcriptions from silent videos, L2S aims to produce intelligible and natural speech, which aligns closely with the visible lip movements of speakers in diverse settings. This paper presents a novel approach to L2S that outperforms existing methods by incorporating text supervision through a pre-trained L2T model, thereby infusing the model with essential language information.

Key Contributions

The research introduces several significant contributions to the field of lip-to-speech synthesis:

  • Challenging Current Lip-to-Speech Approaches: It addresses the limitations of existing L2S models that struggle with learning language attributes solely from speech supervision by using noisy text predictions from a pre-trained L2T model.
  • Visual Text-to-Speech Model: The proposal includes a novel visual text-to-speech (TTS) network that synthesizes speech to match silent video inputs, significantly outperforming current methods in both qualitative and quantitative evaluations.
  • Empowering ALS Patients: Demonstrating a critical practical application, the method was used to generate speech for a patient with Amyotrophic Lateral Sclerosis (ALS), showcasing its potential in assistive technologies.

Methodological Innovations and Experimental Findings

Approach Overview

The paper’s approach integrates noisy text predictions, derived from a state-of-the-art L2T model, with visual features from silent videos to accurately generate speech that is synchronized with the lip movements. This method addresses the speech synthesis challenge from two angles: understanding the content to be spoken (through L2T) and determining the appropriate speaking style (through visual-TTS models conditioned on lip movements and text).

Superior Performance on Benchmarks

Extensive experiments across various datasets revealed that the proposed approach significantly improves upon the existing state-of-the-art methods in L2S. Especially notable is its performance in "in-the-wild" scenarios, which involve diverse speakers, lighting conditions, and backgrounds.

Theoretical and Practical Implications

The findings have broad implications, both theoretically and practically. Theoretically, this work elucidates the importance of incorporating language information via noisy text predictions for enhancing L2S systems' accuracy. Practically, it demonstrates the feasibility of providing a voice to individuals unable to speak due to medical conditions, thereby significantly impacting assistive technology fields.

Future Directions

The paper speculates on future research directions, emphasizing the extension to multiple languages and further refinement of the visual-TTS model for even greater accuracy and naturalness of generated speech. The potential to minimize the reliance on text annotations, perhaps through advancements in self-supervised learning, also offers a promising area for continued exploration.

Conclusion

This research sets a new benchmark for lip-to-speech synthesis, especially in unconstrained, multi-speaker scenarios. By effectively leveraging pre-trained lip-to-text models for language and visual feature extraction, the proposed method achieves unprecedented levels of accuracy and naturalness in generated speech. The demonstrated application for ALS patients serves as a testament to the method's practical utility and its potential to offer substantial benefits to individuals with speech impairments. This work not only advances the state of L2S research but also opens avenues for its application in various user-centric and assistive technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).
  2. Deep Lip Reading: a comparison of models and an online application. In INTERSPEECH.
  3. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).
  4. Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 2516–2520.
  5. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).
  6. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4960–4964. https://doi.org/10.1109/ICASSP.2016.7472621
  7. Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3444–3453.
  8. Joon Son Chung and Andrew Zisserman. 2016a. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87–103.
  9. J. S. Chung and A. Zisserman. 2016b. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.
  10. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421–2424.
  11. Improved Speech Reconstruction from Silent Video. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017), 455–462.
  12. Ariel Ephrat and Shmuel Peleg. 2017. Vid2Speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  13. Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. In NMT@ACL.
  14. BigVGAN: A Universal Neural Vocoder with Large-Scale Training. ArXiv abs/2206.04658 (2022).
  15. Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603–615.
  16. More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 10577–10587.
  17. Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM ’22). Association for Computing Machinery, New York, NY, USA, 6250–6258. https://doi.org/10.1145/3503161.3548081
  18. Neural Dubber: Dubbing for Videos According to Scripts. In NeurIPS.
  19. Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates Inc., 4485–4495.
  20. Lip to Speech Synthesis with Visual Context Attentional GAN. In Neural Information Processing Systems.
  21. Lip-to-Speech Synthesis in the Wild with Multi-task Learning. ArXiv abs/2302.08841 (2023).
  22. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).
  23. SVTS: Scalable Video-to-Speech Synthesis. In Interspeech.
  24. Automatic dense annotation of large-vocabulary sign language videos. In ECCV.
  25. Deep Voice 3: 2000-Speaker Neural Text-to-Speech. In International Conference on Learning Representations. https://openreview.net/forum?id=HJtEm4p6Z
  26. Sub-Word Level Lip Reading With Visual Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5162–5172.
  27. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  28. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM ’20). Association for Computing Machinery, New York, NY, USA, 484–492. https://doi.org/10.1145/3394171.3413532
  29. Visual Keyword Spotting with Attention. In BMVC.
  30. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).
  31. Personalized One-Shot Lipreading for an ALS Patient. arXiv preprint arXiv:2111.01740 (2021).
  32. Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Muti-Person Video. In Proc. Interspeech 2022. 2833–2837. https://doi.org/10.21437/Interspeech.2022-10920
  33. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), 4779–4783.
  34. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. ArXiv abs/2201.02184 (2022).
  35. Attention is All you Need. ArXiv abs/1706.03762 (2017).
  36. Video-Driven Speech Reconstruction using Generative Adversarial Networks. arXiv preprint arXiv:1906.06301 (2019).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sindhu Hegde (2 papers)
  2. Rudrabha Mukhopadhyay (14 papers)
  3. C. V. Jawahar (109 papers)
  4. Vinay Namboodiri (25 papers)
Citations (2)