Large-Scale Visual Speech Recognition (1807.05162v3)

Published 13 Jul 2018 in cs.CV and cs.LG

Abstract: This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on other lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.

Citations (147)

View on Semantic Scholar

Summary

The paper introduces the V2P model and the LSVSR dataset, achieving a 40.9% word error rate that significantly outperforms existing lipreading methods.
The authors use 3D convolutional layers and bidirectional LSTMs to extract spatiotemporal features and aggregate temporal information.
The research demonstrates practical benefits for assisting speech-impaired individuals while suggesting future work on enhancing system robustness in diverse visual contexts.

Large-Scale Visual Speech Recognition: An Examination

In the discussed paper, the authors introduce a significant advancement in the domain of visual speech recognition (VSR) by presenting a scalable solution to tackle the limitations of existing lipreading technologies. The research addresses previous constraints such as narrow vocabularies and limited datasets in visual speech recognition systems, putting forth a comprehensive framework that bridges these gaps effectively.

The paper articulates the development and implementation of the LSVSR dataset, which is currently the largest dataset for visual speech recognition. This dataset comprises 3,886 hours of video and text pairs, extracted partially from YouTube videos, providing an unprecedented resource for research in this domain. The significance of the dataset cannot be overstated as it represents a broader and more diverse vocabulary compared to its predecessors, encompassing 2.2 times more distinct words.

The core architectural innovation introduced by the authors is the Vision to Phoneme (V2P) model, designed to transform raw video sequences into phoneme distributions and subsequently to word sequences. The V2P model utilizes a deep neural network architecture composed of 3D convolutional layers for extracting spatiotemporal features and bidirectional LSTMs for temporal aggregation. This strategic architectural choice enables the model to excel in recognizing phoneme sequences without misclassification, particularly important in handling homophones— a common challenge in character-level recognition models.

One remarkable outcome of this research is the evaluation results, demonstrating that the proposed system achieves a word error rate (WER) of 40.9% on a held-out test set. This significantly surpasses the performance of professional lipreaders with WERs of 86.4% and 92.9% who relied on contextual information. Moreover, V2P significantly improves upon existing methods like LipNet and Watch, Attend, and Spell (WAS), which yielded WERs of 89.8% and 76.8%, respectively.

The implications of these results extend beyond academia into practical applications, particularly for individuals with speech impairments. The authors underscore the potential impact of their system in assisting those who have lost the ability to generate voice due to medical conditions, providing them with an alternative means of communication.

Despite the strides made, the authors acknowledge the limitations in terms of generalization across a wider range of visual contexts, indicating that the system performs optimally under certain angles and video qualities. This suggests a direction for future research to enhance the robustness of VSR systems in varying environmental conditions.

In prospect, the authors' approach opens multiple avenues for further exploration within AI and deep learning, especially in the refinement of visual speech recognition technologies. Continued improvements in model architecture and dataset expansion promise to yield even more versatile and accurate systems, potentially integrating seamlessly with existing speech recognition frameworks.

This research solidifies the foundation for future developments in open-vocabulary continuous visual speech recognition, underscoring the importance of large datasets and sophisticated neural network architectures in progressing the frontiers of lipreading technologies.

PDF Markdown

Related Papers

YouTube

Show All Videos