Papers
Topics
Authors
Recent
Search
2000 character limit reached

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Published 29 Aug 2023 in eess.AS, cs.AI, cs.LG, and cs.SD | (2308.15256v2)

Abstract: The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Unsupervised Speech Recognition. In NeurIPS.
  2. Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In NeurIPS.
  3. Statistical Parametric Speech Synthesis. In Proc. ICASSP.
  4. Analysis of Emotional Speech Prosody in Terms of Part of Speech Tags. In Proc. Interspeech.
  5. DistilHuBERT: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit BERT. In Proc. ICASSP.
  6. A Simple Framework for Contrastive Learning of Visual Representations. In Proc. ICML.
  7. Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations. In NeurIPS.
  8. Lip Reading Sentences in the Wild. In Proc. CVPR.
  9. An Audio-Visual Corpus for Speech Perception and Automatic Speech Recognition. The Journal of the Acoustical Society of America, 120(5): 2421–2424.
  10. Daubechies, I. 1988. Orthonormal Bases of Compactly Supported Wavelets. Communications on pure and applied mathematics, 41(7): 909–996.
  11. Parallel Tacotron: Non-Autoregressive and Controllable TTS. In Proc. ICASSP.
  12. Vid2Speech: Speech Reconstruction from Silent Video. In Proc. ICASSP.
  13. Exploring Wav2Vec 2.0 on Speaker Verification and Language Identification. In Proc. Interspeech.
  14. Signal Estimation from Modified Short-Time Fourier Transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2): 236–243.
  15. Conformer: Convolution-Augmented Transformer for Speech Recognition. In Proc. Interspeech.
  16. CMT: Convolutional Neural Networks Meet Vision Transformers. In Proc. CVPR.
  17. Flow-Based Unconstrained Lip to Speech Generation. In Proc. AAAI.
  18. Deep Residual Learning for Image Recognition. In Proc. CVPR.
  19. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 29: 3451–3460.
  20. Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database. In Proc. ICASSP.
  21. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 35(1): 221–231.
  22. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. NAACL.
  23. FluentTTS: Text-Dependent Fine-Grained Style Control for Multi-Style TTS. In Proc. Interspeech.
  24. Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis. In Proc. Interspeech.
  25. Lip to Speech Synthesis with Visual Context Attentional GAN. In NeurIPS.
  26. Lip-to-Speech Synthesis in the Wild with Multi-Task Learning. In Proc. ICASSP.
  27. Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading. In Proc. AAAI.
  28. Hifi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In NeurIPS.
  29. Textless Speech Emotion Conversion using Discrete and Decomposed Representations. In Proc. EMNLP.
  30. Lipper: Synthesizing Thy Speech Using Multi-View Lipreading. In Proc. AAAI.
  31. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9: 1336–1354.
  32. Łańcucki, A. 2021. Fastpitch: Parallel Text-to-Speech with Pitch Prediction. In Proc. ICASSP.
  33. Reconstructing Intelligible Audio Speech from Visual Speech Features. In Proc. Interspeech.
  34. Generating Intelligible Audio Speech from Visual Speech. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 25(9): 1751–1761.
  35. VoiceMixer: Adversarial Voice Style Mixup. In NeurIPS.
  36. HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis. In NeurIPS.
  37. Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis. In Proc. AAAI.
  38. Completer: Incomplete Multi-view Clustering via Contrastive Prediction. In Proc. CVPR.
  39. DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. In Proc. AAAI.
  40. Decoupled Weight Decay Regularization. In Proc. ICLR.
  41. Visual Speech Recognition for Multiple Languages in the Wild. Nature Machine Intelligence, 4(11): 930–939.
  42. pYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions. In Proc. ICASSP.
  43. Midgley, N. 2006. New Technology Catches Hitler Off Guard. Telegraph.
  44. SVTS: Scalable Video-to-Speech Synthesis. In Proc. Interspeech.
  45. End-to-End Audiovisual Speech Recognition. In Proc. ICASSP.
  46. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech.
  47. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. In Proc. ICML.
  48. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. In Proc. CVPR.
  49. WaveGlow: A Flow-based Generative Network for Speech Synthesis. In Proc. ICASSP.
  50. Robust Speech Recognition via Large-Scale Weak Supervision. In Proc. ICML.
  51. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In Proc. ICLR.
  52. PortaSpeech: Portable and High-Quality Generative Text-to-Speech. In NeurIPS.
  53. Revisiting Over-Smoothness in Text to Speech. In Proc. ACL.
  54. Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. In Proc. ICASSP.
  55. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In Proc. ICML.
  56. Smith, J. 1987. When Lip Reading between the Lines Had the Subtitles Beat by a Long Sight. LA Times.
  57. Fully-Hierarchical Fine-grained Prosody Modeling for Interpretable Speech Synthesis. In Proc. ICASSP.
  58. Attention is All You Need. In NeurIPS.
  59. SUPERB: Speech Processing Universal Performance Benchmark. In Proc. Interspeech.
  60. Speech Prediction in Silent Videos Using Variational Autoencoders. In Proc. ICASSP.
  61. Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. In Proc. ICASSP.
  62. Investigation of Enhanced Tacotron Text-to-Speech Synthesis Systems with Self-Attention for Pitch Accent Language. In Proc. ICASSP.
  63. Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech. In Proc. Interspeech.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.