Lipreading with Long Short-Term Memory (1601.08188v1)

Published 29 Jan 2016 in cs.CV and cs.CL

Abstract: Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy than conventional methods. Feed-forward and recurrent neural network layers (namely Long Short-Term Memory; LSTM) are stacked to form a single structure which is trained by back-propagating error gradients through all the layers. The performance of such a stacked network was experimentally evaluated and compared to a standard Support Vector Machine classifier using conventional computer vision features (Eigenlips and Histograms of Oriented Gradients). The evaluation was performed on data from 19 speakers of the publicly available GRID corpus. With 51 different words to classify, we report a best word accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural network-based solution (11.6% improvement over the best feature-based solution evaluated).

Citations (204)

View on Semantic Scholar

Summary

The paper demonstrates that an LSTM model processing raw 40x40 pixel images achieves 79.6% word accuracy, an 11.6% improvement over SVM-based approaches.
The research contrasts LSTM performance with traditional methods that use Eigenlips and HOG features combined with SVM classifiers.
It further indicates that LSTM-based lipreading can enhance automatic speech recognition, especially in noisy conditions and for silent speech applications.

Lipreading with Long Short-Term Memory: A Technical Overview

The presented paper investigates a neural network-based approach for lipreading, aligning it with improvements in automatic speech recognition (ASR) facilitated by visual cues. It specifically explores the application of Long Short-Term Memory (LSTM) networks to perform lipreading directly from video frames, eliminating the need for the traditional feature extraction process. The research is conducted on the GRID audiovisual corpus, an established dataset for testing audiovisual speech processing systems.

Methodological Approach

The paper compares the efficacy of conventional feature extraction methods coupled with Support Vector Machines (SVMs) against an LSTM-based pipeline. Traditional approaches utilize Eigenlips and Histograms of Oriented Gradients (HOG) as the feature extraction techniques, subsequently feeding these features into an SVM for classification. Conversely, the LSTM model utilizes raw 40x40 pixel images of mouth regions as input, bypassing manual feature extraction, and processes temporal sequences to classify words.

Experimental Design and Results

The research involved extensive experimentation using the GRID corpus, which consists of video and audio recordings of 34 speakers, each uttering 1,000 sentences. For the experiments, speakers were bifurcated into a development set and an evaluation set to determine optimal parameters and evaluate the models. Results demonstrate that the LSTM-based lipreader significantly outperformed the SVM classifiers using conventional features. Specifically, LSTM achieved a word accuracy of 79.6% on evaluation speakers, marking an improvement of 11.6% over the best feature-based method (HOG + SVM).

A detailed analysis of confusion matrices indicates that the LSTM model achieved higher accuracy for longer words compared to single letters, which were more prone to misclassification. This variation is primarily attributed to the limited temporal data available for individual letters and the inherent visual similarity between certain phonemes.

Implications and Future Directions

The paper asserts that neural network architectures, particularly those based on LSTM, provide a compelling alternative to traditional lipreading systems due to their ability to learn complex temporal dynamics directly from raw visual data. The implications for ASR are significant, especially under noisy conditions where visual data can complement acoustic signals, potentially benefiting hearing-impaired individuals or applications requiring silent speech interfaces.

Moving forward, the paper hints at exploring other deep learning architectures, such as Convolutional Neural Networks (CNNs), potentially integrated with LSTMs, to enhance performance. Assessing speaker-independent models could also expand applicability by removing the constraint of speaker-specific training. Both avenues of exploration could further advance the utility of neural networks in visual speech processing.

In summary, this research underscores the viability of LSTMs in visual ASR tasks, offering a robust framework that outperforms conventional techniques, with promising applications in both practical and research contexts of speech recognition.

PDF Markdown