Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction (2201.02184v2)

Published 5 Jan 2022 in eess.AS, cs.CV, and cs.SD

Abstract: Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert

PDF Abstract

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

The paper "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction" introduces AV-HuBERT, a novel self-supervised learning framework targeting the integration of audio and visual input for enhanced speech recognition tasks. The framework embodies a multi-step iterative learning process aimed at improving the quality of audio-visual speech representations, leveraging both modalities for greater performance in both lip reading and automatic speech recognition (ASR).

Methodology

AV-HuBERT employs a masked multimodal cluster prediction approach, where video recordings are masked in both audio and visual streams to enable predictive learning. The paper outlines the use of a hybrid ResNet-transformer to encode these inputs into audio-visual features, which are then used for predicting discrete cluster assignments that are iteratively refined over multiple cycles.

Initially relying on Mel-frequency cepstral coefficients (MFCC) for cluster assignments, the framework later utilizes the capability of AV-HuBERT to generate more refined targets, leading to better clustering quality and improved feature representations.

Experimental Results

The empirical evaluation on the LRS3 dataset, a comprehensive lip-reading benchmark, highlights significant improvements in word error rate (WER). The results are notable with AV-HuBERT achieving a 32.5% WER using merely 30 hours of labeled data, outperforming traditional approaches that required substantially more transcribed data. Furthermore, AV-HuBERT with self-training — a semi-supervised learning technique — achieves a WER of 26.9% using the full set of labeled LRS3 data.

The framework also benefits audio-based speech recognition. Utilizing audio-visual clusters for pre-training leads to a reduction in WER for ASR tasks, surpassing previous state-of-the-art models, demonstrating the efficacy of the multimodal clustering approach.

Implications and Future Work

The implications of AV-HuBERT are multifaceted. On a practical level, the framework provides a more data-efficient method for training robust lip-reading and ASR models, particularly beneficial for languages and contexts where labeled data is scarce. Theoretically, the paper reinforces the potential of multimodal inputs in enhancing speech recognition systems and encourages further exploration into cross-modal learning paradigms.

Looking to the future, AV-HuBERT's improved representation learning could play a pivotal role in multilingual speech recognition systems, especially for under-resourced languages. Additionally, extending this approach to other applications, such as keyword spotting in sign language or speech enhancement, represents promising avenues for research.

In summary, the AV-HuBERT framework stands as a substantial contribution to the field of self-supervised learning in speech and audio-visual processing, elucidating the potential of multimodal cluster prediction in advancing the field of automatic speech and lip-reading recognition.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Bowen Shi (82 papers)
Wei-Ning Hsu (76 papers)
Kushal Lakhotia (15 papers)
Abdelrahman Mohamed (59 papers)

Citations (254)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - facebookresearch/av_hubert: A self-supervised learning framework for audio-visual speech (832 stars)