Deep Audio-Visual Speech Recognition (1809.02108v2)

Published 6 Sep 2018 in cs.CV

Abstract: The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.

PDF Abstract

An Analysis of "Deep Audio-Visual Speech Recognition"

Overview

The paper "Deep Audio-Visual Speech Recognition" presents a comprehensive paper on recognizing phrases and sentences from visual data alone, specifically focusing on lip reading. In contrast to earlier research, which often restricted itself to a limited set of predetermined phrases, this work adopts an "open-world" approach by recognizing unconstrained natural language sentences in real-world video contexts.

Key Contributions

This research makes several significant contributions to the field of audio-visual speech recognition:

Model Comparison: The paper compares two models based on the transformer self-attention architecture: one utilizing Connectionist Temporal Classification (CTC) loss and the other employing a sequence-to-sequence (seq2seq) loss. The head-to-head comparison offers insights into the strengths and weaknesses of each on common architectural grounds.
Complementary Use of Lip Reading: The paper investigates the extent to which lip reading complements audio speech recognition, particularly in noisy environments where traditional audio methods may struggle.
Introduction of LRS2-BBC Dataset: A new dataset, LRS2-BBC, is introduced to the research community, comprising thousands of natural sentences from British television, thus facilitating significant advancements in the training of AVSR models.

Results and Implications

The models developed surpass previous benchmarks on lip reading datasets by a notable margin. The seq2seq transformer model, in particular, demonstrates a 22% absolute improvement in Word Error Rate (WER) over earlier efforts. This enhances the understanding of the potential for machine learning systems to perform complex audio-visual transcription tasks effectively.

The paper also highlights practical applications in improving automated speech recognition, especially in challenging conditions with high ambient noise, by integrating visual information. This can have profound implications for technologies used in environments where audio clarity cannot be guaranteed, such as in public transport or crowded places.

Theoretical and Practical Implications

Theoretically, the paper advances the use of transformer architectures in audio-visual processing, validating the application of such models beyond traditional natural language processing tasks. Practically, the development and release of the LRS2-BBC dataset present a resource that could accelerate future research and development in the domain of multi-modal speech recognition.

Speculation on Future Developments

Future developments could explore refining the transformer models further or integrating more advanced LLMs to enhance decoding accuracy. Additionally, expanding datasets to include more diverse and representative samples could help to develop systems that generalize better across varied contexts. The exploration of combining these models with real-time processing systems could lead to significant advancements in live transcription applications.

Conclusion

The paper's contributions underscore a meaningful advancement in audio-visual speech recognition, pushing the boundaries of what can be achieved with contemporary deep learning architectures. As these models continue to evolve, they hold promise for enhancing accessibility and bridging communication gaps in noisy or sound-restricted environments.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Triantafyllos Afouras (29 papers)
Joon Son Chung (106 papers)
Andrew Senior (8 papers)
Oriol Vinyals (116 papers)
Andrew Zisserman (248 papers)

Citations (637)

View on Semantic Scholar