Multilingual LibriSpeech (MLS) Corpus
- Multilingual LibriSpeech (MLS) is a large-scale public corpus for multilingual ASR featuring diverse languages and standardized train/dev/test splits.
- The dataset employs rigorous preprocessing techniques, including audio segmentation, text normalization, and speaker balancing to ensure high-quality transcriptions.
- MLS serves as a benchmark for multilingual speech models, accelerating research in self-supervised learning, cross-lingual transfer, and domain adaptation.
Multilingual LibriSpeech (MLS) is a large-scale, public-domain corpus designed for multilingual automatic speech recognition (ASR) and related speech processing research. It was introduced by Pratap et al. as a benchmark for training and evaluating end-to-end speech models across typologically and resource-diverse languages using a fixed set of partitions (Pratap et al., 2020). MLS has rapidly become a central resource for benchmarking and advancing methods in multilingual ASR, speech translation, and self-supervised pretraining.
1. Dataset Composition and Structure
MLS consists of read-speech audio and corresponding text transcripts derived from public-domain audiobooks (LibriVox) and open textual sources (Project Gutenberg and others). The release covers eight languages: English (en), German (de), Dutch (nl), French (fr), Spanish (es), Italian (it), Portuguese (pt), and Polish (pl). Each language is partitioned into train/dev/test splits with disjoint speakers per split and, for dev/test, gender balance.
Table: MLS Corpus Size (by Language) (Pratap et al., 2020)
| Language | Train (h) | Dev (h) | Test (h) | Total (h) |
|---|---|---|---|---|
| English | 44,659.7 | 15.75 | 15.55 | 44,691.0 |
| German | 1,966.5 | 14.28 | 14.29 | 1,995.1 |
| Dutch | 1,554.2 | 12.76 | 12.76 | 1,579.8 |
| French | 1,076.6 | 10.07 | 10.07 | 1,096.7 |
| Spanish | 917.7 | 9.99 | 10.00 | 937.7 |
| Italian | 247.4 | 5.18 | 5.27 | 257.8 |
| Portuguese | 160.96 | 3.64 | 3.74 | 168.3 |
| Polish | 103.7 | 2.08 | 2.14 | 107.9 |
All audio is 16 kHz, 16-bit PCM; transcription normalization includes NFKC, punctuation removal, and language-specific orthographic filtering (Pratap et al., 2020). Segment durations range 10–20 sec. The corpus also provides limited-supervision subsets (1 h, 10 h) and professionally verified dev/test transcripts with measured human WER between 1.9–12.6% across languages.
2. Data Collection, Preprocessing, and Quality Control
MLS construction involved semi-automated retrieval, segmentation, and alignment. The pipeline comprises:
- Audio and text download using Libri-Light tools.
- Audio segmentation to ~10–20 sec utterances using streaming inference with language-specific acoustic models (TDS/Auto-Segmentation Criterion).
- Pseudo-label generation using beam-search decoding with a 4-gram LM.
- Text normalization (NFKC, punctuation removal, hyphen joining).
- Transcript retrieval via TF-IDF bigram matching and Smith-Waterman alignment, requiring WER < 40% between pseudo-labels and retrieved text.
- LLM (LM) text preparation from filtered Gutenberg books.
- Gender classification (RBF-SVM over log-filterbank features) for speaker balancing.
- Manual correction of dev/test transcripts.
Filtering eliminates multi-speaker/corrupt recordings, incomplete alignment, transcript mismatches, and violates split disjointness. MLS dev/test sets undergo human verification.
For paired speech–text translation tasks, Mueller et al. constructed a 236 h English–French parallel extension by aligning LibriSpeech English speech with French translations from matched public-domain e-books using hunAlign and Gentle forced-alignment. Human evaluation yielded high alignment quality (textual alignment average score 3.84/5, Cohen’s κ=0.76) (Kocabiyikoglu et al., 2018).
3. Baseline Models and Metrics
MLS provides baseline n-gram LLMs and ASR systems for every language. LLMs are 3-gram and 5-gram models trained with KenLM on normalized, de-duplicated corpus text.
Table: Example LM Statistics (English) (Pratap et al., 2020)
| #Books (filtered) | Word Types | Token Count (M) | OOV Rate | 5-gram PPL |
|---|---|---|---|---|
| 36,866 | 4,120,000 | 2,380 | 0.18% | 190.76 |
ASR baselines use a wav2letter++ architecture: 1-D conv frontend, 36 Transformer blocks (4-head attention), CTC loss, and output to |vocab| grapheme classes. Models are trained per language with SpecAugment for silent and noisy segments, evaluated using WER: where S=substitutions, D=deletions, I=insertions, N=reference word count.
Table: Baseline Monolingual WERs (%), Test Sets (Pratap et al., 2020)
| Language | Viterbi | Zero LM | 5-gram LM |
|---|---|---|---|
| English | 6.99 | 6.76 | 5.88 |
| German | 6.93 | 7.10 | 6.49 |
| Dutch | 13.18 | 13.09 | 12.02 |
| French | 6.88 | 6.58 | 5.58 |
| Spanish | 6.90 | 6.68 | 6.07 |
| Italian | 12.35 | 11.78 | 10.54 |
| Portuguese | 21.70 | 20.52 | 19.49 |
| Polish | 19.40 | 21.66 | 20.39 |
4. MLS as a Benchmark for Multilingual and Low-Resource ASR
MLS’s size, multilinguality, and extreme language/resource imbalance (e.g., 44.5k h English vs. 0.1–2k h other languages) make it a canonical benchmark for scalable multilingual ASR and self-supervised learning.
Key methods and results:
- Citrinet (CTC, English-only): Citrinet-1024 achieves 8.46% WER (greedy) on MLS English test. With 6-gram and Transformer LM rescoring, WER drops to 6.79% and 6.39%, respectively. All training is on 42.97k h filtered English audio with a 1,024-subword SentencePiece tokenizer (Majumdar et al., 2021).
- JUST (Joint Unsupervised and Supervised Training, 8 languages): End-to-end jointly optimized RNN-T loss with contrastive and MLM self-supervision. On all eight MLS languages, JUST (β=0.07) yields avg WER 7.2% (vs monolingual baseline 11.8%), and with pure fine-tuning achieves 6.5%. On Polish (100 h), JUST reduces WER to 9.1% (0.07) and 6.6% (β→0), less than half the monolingual baseline (Bai et al., 2021).
- Massively Multilingual RNN-T (70 languages): Zero-shot evaluation on MLS achieves WER 9.5% with mixed char+subword tokenization and language-specific output heads (vs. monolingual baseline 13.7%). Fine-tuning on MLS reduces WER to 7.5% (Tjandra et al., 2022).
- IPA-Guided HuBERT Pretraining: Multilingual IPA pseudo-labels improve HuBERT pretraining robustness under limited fine-tuning. For 7 MLS languages (excluding en), HuBERT-LARGE-IPA achieves avg WER 8.55% (standard HuBERT: 9.63%, XLSR-53: 10.6%) with 6k h pretrain data (Feng et al., 2023).
5. Tokenization, Model Architecture, and Training Paradigms
MLS has been central in evaluating tokenization and architectural scaling for multilingual ASR. Main findings:
- Tokenization: SentencePiece subword units (e.g., 1,024 for English in Citrinet) yield improved coverage and mitigate sequence-length variance. In multilingual settings, combining raw characters for large-character languages (zh, ja, ko) with 512-unit language-specific subword vocabularies for others balances error rates and decoding efficiency (Tjandra et al., 2022, Majumdar et al., 2021).
- Acoustic Features: All models standardize on 80-dim log-Mel representations (25 ms window, 10 ms stride), per-utterance normalization, and SpecAugment (Tjandra et al., 2022, Majumdar et al., 2021).
- Architectures: Deep residual CTC, RNN-Transducer with Conformer encoder and language-specific output heads, and self-supervised BERT-style pretraining using masked-prediction and/or contrastive objectives. Parameter scales range from ~100M (HuBERT BASE) to ~1B (RNN-T) (Tjandra et al., 2022, Feng et al., 2023).
- Training: MLS supports massive-parallel distributed training, e.g. Citrinet on 256 GPUs/batch 8,192, and RNN-T on 64 GPUs. Regularization (e.g. SpecAugment), distributed FSDP, and optimizer scheduling (Adam, NovoGrad) follow best practices in large-scale speech modeling (Majumdar et al., 2021, Tjandra et al., 2022).
6. Downstream Transfer and Domain Adaptation
MLS-trained acoustic encoders generalize across languages and domains. For example, models like Citrinet, pretrained only on English MLS, enable rapid fine-tuning to new ASR domains (e.g., TED-LIUM, AISHELL), showing that deep separable-convolution+SE context stacks capture language-universal features with minimal external LM dependence (Majumdar et al., 2021). Massively multilingual RNN-T encoders, pretrained on >150k h, exhibit strong zero-shot WER on MLS and robust adaptation when fine-tuned (Tjandra et al., 2022).
In speech translation, the augmented English–French MLS provides ~236 h of sentence-aligned audio, facilitating direct E2E speech translation research. Human evaluations demonstrated substantial alignment quality (Pearson ρ=0.41 between hunAlign and scores), and initial encoder–decoder models yielded BLEU ≈ 15 (Kocabiyikoglu et al., 2018).
7. Licensing, Distribution, and Impact
All MLS data (audio/text) are public domain, with unrestricted CC0 licensing. The corpus is distributed via OpenSLR (Pratap et al., 2020), with tools, documentation, and splits enabling reproducibility. The MLS preprocessing/partitioning methodology—large, speaker-disjoint, carefully quality-controlled splits, uniform feature pipelines—sets a contemporary standard for multi-language speech benchmarks.
MLS is now the standard validation set for scalable end-to-end ASR, multilingual self-supervised pretraining, and cross-linguistic transfer learning. Its extreme resource imbalance and broad linguistic coverage underpin comparative evaluations of architecture design, tokenization strategies, joint/unified training, and resource-efficient adaptation, accelerating progress in open-vocabulary, language-universal, cross-domain speech recognition.