Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis (2005.08209v1)

Published 17 May 2020 in cs.CV, cs.LG, cs.SD, and eess.AS

Abstract: Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space. Please check out our demo video for a quick overview of the paper, method, and qualitative results. https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.be

View on arXiv

Authors (4)

K R Prajwal (11 papers)
Rudrabha Mukhopadhyay (14 papers)
Vinay Namboodiri (25 papers)
C V Jawahar (19 papers)

Citations (98)

View on Semantic Scholar

Summary

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

The paper "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis" presents a novel approach in the domain of audiovisual machine learning, focusing on the conversion of silent lip movements into comprehensible speech. Leveraging a data-driven methodology, the authors aim to accurately reproduce speech from lip sequences by accommodating individual speaking styles in unconstrained environments with diverse vocabularies. This research significantly improves upon previous work by focusing on personalized and natural speech synthesis, and introduces several key innovations to achieve this goal.

Methodology

The authors propose a sequence-to-sequence learning model named Lip2Wav, which capitalizes on a novel architecture optimized for individual speaker variation. The model utilizes a 3D convolutional neural network (3D-CNN) encoder to extract spatio-temporal features from face sequences. This encoder ensures that the model captures critical facial movements, particularly those of the lips, across temporal frames. A Tacotron 2-based attention-driven decoder translates these features into auditorily rich melspectrogram outputs, which are eventually converted into waveform audio using the Griffin-Lim algorithm.

The integration of extensive context windows and gradual teacher forcing decay are key design choices in Lip2Wav, significantly enhancing the model's ability to disambiguate phonemes and reduce overfitting to the training data. The context window, extended to three seconds compared to prior shorter windows, allows the model to better interpret linguistic cues. Meanwhile, the tailored decay in teacher forcing ensures robust predictive capability by preventing the model from relying excessively on prior outputs.

Dataset

An essential contribution of this work is the introduction of a new dataset, tailored to support individual speaker lip-to-speech synthesis in unconstrained settings. The Lip2Wav dataset, featuring approximately 120 hours of video data across five distinct speakers, is a significant advancement over existing datasets, such as GRID and TIMIT, in terms of scale and diversity. This dataset encompasses a broad vocabulary, substantial speaker-specific data, and naturalistic conditions, thus providing a more realistic baseline for model training and evaluation than previously available.

Results and Evaluation

The Lip2Wav model demonstrates significant improvements in both constrained and unconstrained settings. For constrained datasets like GRID and TIMIT, Lip2Wav achieved superior results across objective metrics, including STOI, ESTOI, and PESQ, as well as word error rates (WER). The performance leap is even more profound in unconstrained settings with the Lip2Wav dataset, where the model's intelligibility was reported to be almost four times higher than comparable state-of-the-art methods.

A rigorous set of human evaluations corroborates these quantitative findings, indicating substantial reductions in mispronunciation and word omission rates. Importantly, Lip2Wav’s subjective intelligibility and naturalness ratings indicate effective preservation of speaker individuality, aligning more closely with actual human speech.

Implications and Future Directions

This paper lays crucial groundwork for future advancements in lip-to-speech synthesis, offering a robust framework that can adapt to individual speaker characteristics. The implications extend to various practical applications, including improved communication aids for those with hearing impairments and enhanced surveillance capabilities.

Future research should focus on addressing enduring challenges such as homophene disambiguation and vocabulary generalization beyond the typical domains of speakers. Moreover, exploration into multimodal harmonization could further streamline the integration of visual and auditory cues, potentially enhancing synthesis accuracy across diverse environments and contexts.

Overall, the work presents a comprehensive and technically solid advancement in synthesizing speech from lip movements, significantly pushing the frontier of audiovisual speech processing.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos