Towards Accurate Lip-to-Speech Synthesis in-the-Wild (2403.01087v1)
Abstract: In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust LLM from speech alone, resulting in unsatisfactory outcomes. To overcome this issue, we propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model. The noisy text is generated using a pre-trained lip-to-text model, enabling our approach to work without text annotations during inference. We design a visual text-to-speech network that utilizes the visual stream to generate accurate speech, which is in-sync with the silent input video. We perform extensive experiments and ablation studies, demonstrating our approach's superiority over the current state-of-the-art methods on various benchmark datasets. Further, we demonstrate an essential practical application of our method in assistive technology by generating speech for an ALS patient who has lost the voice but can make mouth movements. Our demo video, code, and additional details can be found at \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw}.
- Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).
- Deep Lip Reading: a comparison of models and an online application. In INTERSPEECH.
- LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).
- Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 2516–2520.
- Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).
- Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4960–4964. https://doi.org/10.1109/ICASSP.2016.7472621
- Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3444–3453.
- Joon Son Chung and Andrew Zisserman. 2016a. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87–103.
- J. S. Chung and A. Zisserman. 2016b. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.
- An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421–2424.
- Improved Speech Reconstruction from Silent Video. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017), 455–462.
- Ariel Ephrat and Shmuel Peleg. 2017. Vid2Speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. In NMT@ACL.
- BigVGAN: A Universal Neural Vocoder with Large-Scale Training. ArXiv abs/2206.04658 (2022).
- Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603–615.
- More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 10577–10587.
- Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM ’22). Association for Computing Machinery, New York, NY, USA, 6250–6258. https://doi.org/10.1145/3503161.3548081
- Neural Dubber: Dubbing for Videos According to Scripts. In NeurIPS.
- Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates Inc., 4485–4495.
- Lip to Speech Synthesis with Visual Context Attentional GAN. In Neural Information Processing Systems.
- Lip-to-Speech Synthesis in the Wild with Multi-task Learning. ArXiv abs/2302.08841 (2023).
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).
- SVTS: Scalable Video-to-Speech Synthesis. In Interspeech.
- Automatic dense annotation of large-vocabulary sign language videos. In ECCV.
- Deep Voice 3: 2000-Speaker Neural Text-to-Speech. In International Conference on Learning Representations. https://openreview.net/forum?id=HJtEm4p6Z
- Sub-Word Level Lip Reading With Visual Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5162–5172.
- Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM ’20). Association for Computing Machinery, New York, NY, USA, 484–492. https://doi.org/10.1145/3394171.3413532
- Visual Keyword Spotting with Attention. In BMVC.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).
- Personalized One-Shot Lipreading for an ALS Patient. arXiv preprint arXiv:2111.01740 (2021).
- Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Muti-Person Video. In Proc. Interspeech 2022. 2833–2837. https://doi.org/10.21437/Interspeech.2022-10920
- Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), 4779–4783.
- Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. ArXiv abs/2201.02184 (2022).
- Attention is All you Need. ArXiv abs/1706.03762 (2017).
- Video-Driven Speech Reconstruction using Generative Adversarial Networks. arXiv preprint arXiv:1906.06301 (2019).
- Sindhu Hegde (2 papers)
- Rudrabha Mukhopadhyay (14 papers)
- C. V. Jawahar (110 papers)
- Vinay Namboodiri (25 papers)