MLS: A Large-Scale Multilingual Dataset for Speech Research (2012.03411v2)

Published 7 Dec 2020 in eess.AS, cs.CL, and cs.SD

Abstract: This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide LLMs (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.

Citations (418)

View on Semantic Scholar

Summary

The paper introduces MLS, a multilingual speech dataset derived from LibriVox that significantly advances ASR and TTS research across multiple languages.
It details the creation process using pseudo-label generation and Time-Depth Separable Convolutions to efficiently segment 44.5K hours of English and 6K hours in seven other languages.
Baseline ASR experiments with transformer models and 5-gram language models demonstrate improved performance over existing benchmarks.

An Overview of MLS: A Large-Scale Multilingual Dataset for Speech Research

The paper "MLS: A Large-Scale Multilingual Dataset for Speech Research" introduces the Multilingual LibriSpeech (MLS) dataset, which is a significant contribution to the field of speech research. The dataset is derived from LibriVox audiobooks and encompasses approximately 44.5K hours of English and 6K total hours distributed across seven other languages: German, Dutch, Spanish, French, Portuguese, Italian, and Polish. The availability of such a large and diverse multilingual dataset under an open license represents an important resource for advancing Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) research.

Creation of the MLS Dataset

The process of creating the Multilingual LibriSpeech dataset involved several key steps. Initially, raw audiobooks from LibriVox were downloaded and segmented into shorter audio clips to align better with acoustic model training requirements. Pseudo label generation was employed along with audio segmentation, using Time-Depth Separable Convolutions for efficient audio processing. The text corresponding to the audiobooks was retrieved from multiple sources, with English benefiting significantly from resources like Project Gutenberg. For non-English languages, a combination of manual and automated methods was utilized to ensure accuracy in transcript retrieval.

In dealing with transcription errors and inconsistencies, the paper implemented techniques to handle specific challenges such as number normalization and discrepancies arising from hyphens and apostrophes in different scripts. To maintain high-quality outputs, transcripts for the development and test sets underwent human verification.

Dataset Composition and Applications

The MLS dataset's structure includes balanced train, development, and test partitions across languages, ensuring no overlap of speakers among partitions. The researchers also created limited supervision datasets to facilitate a standardized benchmark for low-resource language training. A significant portion of the dataset's appeal is its suitability for large-scale, multilingual ASR model training and testing.

Pre-trained LLMs were provided with the dataset, covering both 3-gram and 5-gram configurations for each language. These models were developed using text data sourced and normalized from extensive public domain book records. This aspect broadens the MLS dataset's applicability beyond ASR to potentially influencing advancements in multilingual TTS systems by extending datasets like LibriTTS.

Baseline ASR Experiments

The paper details baseline ASR experiments conducted using the wav2letter++ framework. The architecture leveraged Transformers and connectionist temporal classification loss to build robust acoustic models. The results obtained, particularly for languages with substantial OOV rates such as Polish, provide insights into the complex dynamics of multilingual speech recognition. Baseline WER figures are reported for various decoding strategies, illustrating the impact of adding a 5-gram LLM during decoding. MLS-trained models show improvement over existing datasets like LibriSpeech on established ASR benchmarks.

Implications and Future Directions

The introduction of the MLS dataset marks an important step towards enabling comprehensive multilingual speech recognition research. By making this dataset widely accessible, the authors pave the way for innovations in multilingual ASR modeling techniques, allowing for analysis of cross-lingual transfer learning and adaptation methodologies in resource-scarce environments.

The paper lays groundwork for potential enhancements in TTS research through the dataset's scalability. Future developments in AI could leverage the richness of the MLS dataset to refine low-resource language technologies and foster integration of disparate linguistic resources into unified systems.

Overall, the Multilingual LibriSpeech dataset is poised to become an essential tool for researchers aiming to push the boundaries of speech technology across multiple languages.

PDF Markdown