Towards Robust Speech Representation Learning for Thousands of Languages (2407.00837v2)

Published 30 Jun 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/.

PDF Abstract

Towards Robust Speech Representation Learning for Thousands of Languages

The research paper presents XEUS, a novel cross-lingual encoder for universal speech (pronounced "Zeus") that significantly extends the language coverage in self-supervised learning (SSL) for speech representation. XEUS demonstrates the capacity to process over 4,057 languages, leveraging more than 1 million hours of speech data. This includes a sizeable portion from public corpora combined with a newly curated corpus covering 7,413 hours from 4,057 languages, marking the widest language coverage in any current speech-processing dataset.

A critical component in the development of XEUS is the incorporation of a dereverberation task during pre-training. This task improves upon traditional SSL by enhancing the robustness of the model against noisy data—a common issue in multilingual datasets, particularly concerning low-resource languages often affected by less controlled recording environments. The dereverberation task supplements HuBERT-style masked prediction and WavLM-style denoising mechanisms, thereby fortifying the model's ability to generate clean predictions from corrupted audio inputs.

The experimental results indicate that XEUS sets a new state-of-the-art (SOTA) on the ML-SUPERB benchmark, demonstrating superior performance despite utilizing fewer parameters and less pre-training data than current leading models. XEUS outshines models such as MMS and w2v-BERT 2.0 v2, achieving an improvement of 0.8% to 4.4% in Character Error Rate (CER) and task accuracy. Moreover, it showcases robust performance across a variety of tasks, including automatic speech recognition (ASR), speech translation (ST), and language identification (LID), among others.

The paper's contributions are manifold:

Data Release: Public release of a novel corpus with over 7,413 hours of speech data covering 4,057 languages, broadening the scope of research possible with multilingual datasets.
New SSL Task: Introduction of dereverberation as a novel self-supervised learning task, crucial for enhancing model robustness against real-world audio conditions.
Public Release of XEUS: Alongside XEUS itself, the paper promises to make available all training configurations and code, fostering reproducibility and encouraging further research.
Benchmark Performance: Comprehensive evaluation of XEUS highlights its ability to outperform existing SOTA models in a plethora of speech-to-text and generative tasks.

The implications of this work are substantive both in theoretical and practical domains. Theoretically, XEUS extends the boundaries of what is possible in SSL by reducing hardware and data annotation constraints that have historically limited the scaling to lower-resourced languages. Practically, XEUS advances the vision of universal language inclusion, paving the way for more inclusive global communication tools.

Future research may explore further extending XEUS' capabilities in real-time applications and more robust handling of even noisier and more complex audio environments. Furthermore, continuing to increase the dataset's diversity will be crucial in pursuing generalization across all of the world's languages. This work provides a solid foundation, marking an essential step toward truly universal speech models.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

William Chen (49 papers)
Wangyou Zhang (35 papers)
Yifan Peng (147 papers)
Xinjian Li (26 papers)
Jinchuan Tian (33 papers)
Jiatong Shi (82 papers)
Xuankai Chang (61 papers)
Soumi Maiti (26 papers)
Karen Livescu (89 papers)
Shinji Watanabe (416 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/chenwanch1/status/1857277549505634813