MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation (2303.00628v2)

Published 1 Mar 2023 in cs.CL and eess.AS

Abstract: We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic.

Citations (25)

View on Semantic Scholar

Summary

The paper introduces the MuAViC corpus, a novel dataset that enhances audio-visual speech recognition and speech-to-text translation.
The corpus features 1200 hours of transcribed TED talks from over 8000 speakers in 9 languages, offering extensive multilingual and multimodal data.
Baseline results with the AV-HuBERT model show a marked reduction in word error rates, confirming the benefits of integrating visual inputs in noisy conditions.

Overview of the MuAViC Corpus for Multilingual Audio-Visual Speech Recognition and Translation

MuAViC is a novel multilingual audio-visual corpus introduced by researchers at Meta AI, designed to enhance the development of robust speech recognition and speech-to-text translation systems. This corpus encompasses 1200 hours of audio-visual speech from TED and TEDx talks, spanning over 8000 speakers in 9 languages, including English, Arabic, German, Greek, Spanish, French, Italian, Portuguese, and Russian. The corpus stands out as the largest open benchmark for multilingual audio-visual speech recognition (AVSR) and the first for audio-visual speech-to-text translation (AVST).

Corpus Composition

MuAViC's comprehensive dataset is fully transcribed and includes text translations for six English-to-X and six X-to-English language pairs. The dataset distinguishes itself with its breadth, providing extensive audio-visual resources beyond the traditionally English-centric AVSR datasets.

Methodology and Baseline Findings

The researchers employed the AV-HuBERT model to establish baseline benchmarks for both AVSR and AVST. They demonstrated that this corpus significantly improves the resilience of models to noisy environments typical in real-world applications. Notably, the research found that incorporating visual data substantially reduces word error rate (WER) and enhances translation accuracy compared to audio-only models.

Key Numerical Results

For English AVSR, an impressive WER of 2.3 was achieved when utilizing both audio and video inputs, highlighting the efficacy of visual data in reducing errors.
In challenging noisy conditions with added multilingual babble noise, audio-visual models demonstrated a notable 32% reduction in WER compared to their audio-only counterparts.

Implications and Future Directions

The MuAViC corpus presents significant theoretical and practical implications. Theoretically, it advances the paper of multilingual AVSR and AVST, promoting the development of models that leverage multimodal data to withstand noise interference. Practically, it paves the way for more robust real-world applications of speech recognition and translation systems across diverse languages and contexts.

Future research could explore enhancements in self-supervised learning algorithms leveraging MuAViC to undertake effective multilingual pre-training. In addition, the availability of this benchmark encourages the exploration of novel architectures and techniques that utilize both audio and video modalities to achieve superior robustness and accuracy.

The MuAViC corpus, accessible publicly, offers a valuable resource for researchers aiming to innovate and refine speech processing technologies in multilingual contexts. This work lays a firm foundation for continued advances in how machines process and understand human language in diverse and dynamic environments.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/muavic: MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation (384 stars)