mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition (2502.01547v3)

Published 3 Feb 2025 in eess.AS, cs.CV, and cs.SD

Abstract: Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.

Authors (5)

Andrew Rouditchenko (21 papers)
Samuel Thomas (42 papers)
Hilde Kuehne (69 papers)
Rogerio Feris (105 papers)
James Glass (173 papers)

Summary

The paper presents a model that combines Whisper and AV-HuBERT to improve speech recognition in noisy multilingual environments.
It introduces decoder modality dropout, training on paired and unpaired audio/visual data to simulate real-world noise conditions.
Numerical results on the MuAViC dataset demonstrate state-of-the-art word error rates across nine languages, notably aiding low-resource scenarios.

An Assessment of mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

The paper presented in this paper proposes a novel approach to enhance the robustness of multilingual audio-visual speech recognition (AVSR) through the model termed mWhisper-Flamingo. This model synergistically combines pre-trained architectures for audio and visual inputs—Whisper for audio and AV-HuBERT for video—to address the challenges associated with noisy conditions in multilingual automatic speech recognition (ASR).

Methodological Advances

mWhisper-Flamingo represents a significant advancement in AVSR through several innovative methodologies. Primarily, the paper addresses the limitation of insufficient large-scale multilingual video datasets. It notably integrates Whisper, a robust pre-trained audio recognition model, with AV-HuBERT, which is adept at processing visual information related to lip movements. To optimize the model's performance across varying noise conditions, the paper introduces a novel mechanism called decoder modality dropout. This technique trains the mWhisper-Flamingo model with paired and individual audio/visual inputs, simulating robustness against noise by improving its ability to adapt to instances where either audio or video data might be compromised.

Strong Numerical Results

The results reveal that mWhisper-Flamingo achieves state-of-the-art (SOTA) word error rates (WER) on the MuAViC dataset, which encompasses nine languages. In noisy environments, mWhisper-Flamingo outperforms audio-only models consistently, demonstrating the effectiveness of incorporating visual modalities. The paper meticulously provides performance metrics showing significant WER reductions in both clean and noisy settings, particularly enhancing recognition in low-resource language conditions. These numerical outcomes emphasize mWhisper-Flamingo’s efficacy over previous models, underscoring improvements in noise-handling capabilities and multilingual applicability.

Implications and Future Prospects

The model’s ability to incorporate multilingual capabilities while maintaining robust performance in noisy conditions has profound implications. It suggests pathways for future ASR systems to not only effectively handle noise but also adapt to diverse linguistic inputs using pre-trained large-scale models. The introduction of modality dropout potentially paves the way for more adaptive ASR systems capable of dynamically adjusting to varying input conditions without necessitating exhaustive training on extensive datasets.

In the context of future developments, this research invites further exploration into optimizing cross-modal interactions within ASR systems and adapting these methodologies to emerging languages and dialects. Furthermore, enhancing the model’s scalability and efficiency can provide additional practical benefits, potentially extending its application to real-time speech recognition and translation tasks across multiple domains.

Conclusion

The mWhisper-Flamingo model and associated techniques mark a valuable contribution to the field of speech recognition, specifically in enhancing the robustness and adaptability of AVSR systems across multilingual and noisy environments. The paper effectively demonstrates significant performance enhancements, providing a foundation for future research aimed at developing more comprehensive and versatile ASR technologies.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/AudioAndSpeech/status/1920613728820527328

YouTube

Show All Videos