Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation (2406.10082v3)

Published 14 Jun 2024 in eess.AS, cs.CV, and cs.SD

Abstract: Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data difference motivates us to adapt Whisper to handle video inputs. Inspired by Flamingo which injects visual features into LLMs, we propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our models achieve state-of-the-art ASR WER (0.68%) and AVSR WER (0.76%) on LRS3, and state-of-the-art ASR WER (1.3%) and AVSR WER (1.4%) on LRS2. Audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions. Moreover, Whisper-Flamingo is versatile and conducts all of these tasks using one set of parameters, while prior methods are trained separately on each language.

Citations (1)

View on Semantic Scholar

Summary

The paper proposes Whisper-Flamingo, which integrates visual features into Whisper to enhance noise robustness for speech recognition and translation.
The method employs gated cross attention by freezing audio components and fine-tuning with visual data, significantly reducing error rates in noisy environments.
Experimental results on LRS3 and MuAViC datasets demonstrate lower word error rates and higher BLEU scores compared to audio-only models.

An Expert Overview of "Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation"

The paper "Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation" by Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, and James Glass addresses the challenges and advancements in Audio-Visual Speech Recognition (AVSR) and translation. This research proposes a novel model, Whisper-Flamingo, which integrates visual features to improve noise robustness in speech recognition and translation tasks.

Introduction and Motivation

AVSR aims to enhance speech recognition performance in noisy conditions by incorporating visual inputs, particularly lip movements. Traditional Automatic Speech Recognition (ASR) models, despite significant advancements, suffer performance degradation in noisy environments. Whisper-Flamingo seeks to overcome this by merging visual features from AV-HuBERT into the Whisper model, a speech recognition and translation model pre-trained on a vast dataset.

Methodology

The paper leverages the concept of gated cross attention from Flamingo, an architecture for injecting visual features into LLMs. The methodology involves two primary steps:

Fine-tuning Whisper using audio-only data.
Integrating visual features with gated cross attention layers while freezing Whisper's audio components.

Whisper-Flamingo Architecture

Whisper-Flamingo incorporates visual features through gated cross attention layers inserted into Whisper's decoder blocks. These layers are initialized to perform identity mapping and learn to attend to visual features during fine-tuning. The approach allows the model to fuse modalities without requiring aligned feature rates for audio and video, offering flexibility in processing asynchronous data streams.

Experimental Results

The model was evaluated on the LRS3 dataset for English speech recognition and the MuAViC dataset for English-to-multilingual translation.

Speech Recognition

Clean Conditions: Whisper-Flamingo achieved a Word Error Rate (WER) of 1.5%, comparable to state-of-the-art AV-HuBERT models.
Noisy Conditions: The model significantly reduced WER to 5.6%, demonstrating strong noise robustness compared to the audio-only Whisper baseline (11.7%).

Speech Translation

Clean Conditions: Whisper-Flamingo attained an average BLEU score of 22.9, outperforming audio-only Whisper and competitive with bilingual AV-HuBERT.
Noisy Conditions: Whisper-Flamingo showed a marked improvement in noise robustness with an average BLEU score of 20.5, superior to the audio-only Whisper (18.6) and slightly trailing bilingual AV-HuBERT (20.8).

Implications and Future Work

The proposed Whisper-Flamingo model demonstrates enhanced performance in both AVSR and speech translation under noisy conditions. Its ability to handle multiple languages using a single set of parameters highlights its versatility and potential for practical deployment in real-world scenarios with multilingual data.

Potential Directions

Future advancements could explore:

Enhancing visual encoders: Utilizing more sophisticated models like u-HuBERT.
Scaling with larger datasets: Adapting Whisper-Flamingo with models pre-trained on extensive multimodal datasets.
Extending to other tasks: Applying the gated cross attention mechanism to integrate various modalities beyond audio-visual inputs.

Conclusion

Whisper-Flamingo sets a new benchmark in AVSR by effectively combining the strengths of audio-only models and visual feature encoders. Its robust performance in noisy conditions and capability to handle multilingual tasks with a unified model demonstrates a solid step forward in the field. As research progresses, further refinements and broader applications of this methodology could unlock new possibilities in multimodal machine learning.

For more details, the code and models are available at https://github.com/roudimit/whisper-flamingo.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arouditchenko/status/1802696731689091529