Multilingual LibriSpeech (MLS) Corpus

Updated 23 January 2026

Multilingual LibriSpeech (MLS) is a large-scale public corpus for multilingual ASR featuring diverse languages and standardized train/dev/test splits.
The dataset employs rigorous preprocessing techniques, including audio segmentation, text normalization, and speaker balancing to ensure high-quality transcriptions.
MLS serves as a benchmark for multilingual speech models, accelerating research in self-supervised learning, cross-lingual transfer, and domain adaptation.

Multilingual LibriSpeech (MLS) is a large-scale, public-domain corpus designed for multilingual automatic speech recognition (ASR) and related speech processing research. It was introduced by Pratap et al. as a benchmark for training and evaluating end-to-end speech models across typologically and resource-diverse languages using a fixed set of partitions (Pratap et al., 2020). MLS has rapidly become a central resource for benchmarking and advancing methods in multilingual ASR, speech translation, and self-supervised pretraining.

1. Dataset Composition and Structure

MLS consists of read-speech audio and corresponding text transcripts derived from public-domain audiobooks (LibriVox) and open textual sources (Project Gutenberg and others). The release covers eight languages: English (en), German (de), Dutch (nl), French (fr), Spanish (es), Italian (it), Portuguese (pt), and Polish (pl). Each language is partitioned into train/dev/test splits with disjoint speakers per split and, for dev/test, gender balance.

Table: MLS Corpus Size (by Language) (Pratap et al., 2020)

Language	Train (h)	Dev (h)	Test (h)	Total (h)
English	44,659.7	15.75	15.55	44,691.0
German	1,966.5	14.28	14.29	1,995.1
Dutch	1,554.2	12.76	12.76	1,579.8
French	1,076.6	10.07	10.07	1,096.7
Spanish	917.7	9.99	10.00	937.7
Italian	247.4	5.18	5.27	257.8
Portuguese	160.96	3.64	3.74	168.3
Polish	103.7	2.08	2.14	107.9

All audio is 16 kHz, 16-bit PCM; transcription normalization includes NFKC, punctuation removal, and language-specific orthographic filtering (Pratap et al., 2020). Segment durations range 10–20 sec. The corpus also provides limited-supervision subsets (1 h, 10 h) and professionally verified dev/test transcripts with measured human WER between 1.9–12.6% across languages.

2. Data Collection, Preprocessing, and Quality Control

MLS construction involved semi-automated retrieval, segmentation, and alignment. The pipeline comprises:

Audio and text download using Libri-Light tools.
Audio segmentation to ~10–20 sec utterances using streaming inference with language-specific acoustic models (TDS/Auto-Segmentation Criterion).
Pseudo-label generation using beam-search decoding with a 4-gram LM.
Text normalization (NFKC, punctuation removal, hyphen joining).
Transcript retrieval via TF-IDF bigram matching and Smith-Waterman alignment, requiring WER < 40% between pseudo-labels and retrieved text.
LLM (LM) text preparation from filtered Gutenberg books.
Gender classification (RBF-SVM over log-filterbank features) for speaker balancing.
Manual correction of dev/test transcripts.

Filtering eliminates multi-speaker/corrupt recordings, incomplete alignment, transcript mismatches, and violates split disjointness. MLS dev/test sets undergo human verification.

For paired speech–text translation tasks, Mueller et al. constructed a 236 h English–French parallel extension by aligning LibriSpeech English speech with French translations from matched public-domain e-books using hunAlign and Gentle forced-alignment. Human evaluation yielded high alignment quality (textual alignment average score 3.84/5, Cohen’s κ=0.76) (Kocabiyikoglu et al., 2018).

3. Baseline Models and Metrics

MLS provides baseline n-gram LLMs and ASR systems for every language. LLMs are 3-gram and 5-gram models trained with KenLM on normalized, de-duplicated corpus text.

Table: Example LM Statistics (English) (Pratap et al., 2020)

#Books (filtered)	Word Types	Token Count (M)	OOV Rate	5-gram PPL
36,866	4,120,000	2,380	0.18%	190.76

ASR baselines use a wav2letter++ architecture: 1-D conv frontend, 36 Transformer blocks (4-head attention), CTC loss, and output to |vocab| grapheme classes. Models are trained per language with SpecAugment for silent and noisy segments, evaluated using WER: $\mathrm{WER} = \frac{S + D + I}{N}$ where S=substitutions, D=deletions, I=insertions, N=reference word count.

Table: Baseline Monolingual WERs (%), Test Sets (Pratap et al., 2020)

Language	Viterbi	Zero LM	5-gram LM
English	6.99	6.76	5.88
German	6.93	7.10	6.49
Dutch	13.18	13.09	12.02
French	6.88	6.58	5.58
Spanish	6.90	6.68	6.07
Italian	12.35	11.78	10.54
Portuguese	21.70	20.52	19.49
Polish	19.40	21.66	20.39

4. MLS as a Benchmark for Multilingual and Low-Resource ASR

MLS’s size, multilinguality, and extreme language/resource imbalance (e.g., 44.5k h English vs. 0.1–2k h other languages) make it a canonical benchmark for scalable multilingual ASR and self-supervised learning.

Key methods and results:

Citrinet (CTC, English-only): Citrinet-1024 achieves 8.46% WER (greedy) on MLS English test. With 6-gram and Transformer LM rescoring, WER drops to 6.79% and 6.39%, respectively. All training is on 42.97k h filtered English audio with a 1,024-subword SentencePiece tokenizer (Majumdar et al., 2021).
JUST (Joint Unsupervised and Supervised Training, 8 languages): End-to-end jointly optimized RNN-T loss with contrastive and MLM self-supervision. On all eight MLS languages, JUST (β=0.07) yields avg WER 7.2% (vs monolingual baseline 11.8%), and with pure fine-tuning achieves 6.5%. On Polish (100 h), JUST reduces WER to 9.1% (0.07) and 6.6% (β→0), less than half the monolingual baseline (Bai et al., 2021).
Massively Multilingual RNN-T (70 languages): Zero-shot evaluation on MLS achieves WER 9.5% with mixed char+subword tokenization and language-specific output heads (vs. monolingual baseline 13.7%). Fine-tuning on MLS reduces WER to 7.5% (Tjandra et al., 2022).
IPA-Guided HuBERT Pretraining: Multilingual IPA pseudo-labels improve HuBERT pretraining robustness under limited fine-tuning. For 7 MLS languages (excluding en), HuBERT-LARGE-IPA achieves avg WER 8.55% (standard HuBERT: 9.63%, XLSR-53: 10.6%) with 6k h pretrain data (Feng et al., 2023).

5. Tokenization, Model Architecture, and Training Paradigms

MLS has been central in evaluating tokenization and architectural scaling for multilingual ASR. Main findings:

Tokenization: SentencePiece subword units (e.g., 1,024 for English in Citrinet) yield improved coverage and mitigate sequence-length variance. In multilingual settings, combining raw characters for large-character languages (zh, ja, ko) with 512-unit language-specific subword vocabularies for others balances error rates and decoding efficiency (Tjandra et al., 2022, Majumdar et al., 2021).
Acoustic Features: All models standardize on 80-dim log-Mel representations (25 ms window, 10 ms stride), per-utterance normalization, and SpecAugment (Tjandra et al., 2022, Majumdar et al., 2021).
Architectures: Deep residual CTC, RNN-Transducer with Conformer encoder and language-specific output heads, and self-supervised BERT-style pretraining using masked-prediction and/or contrastive objectives. Parameter scales range from ~100M (HuBERT BASE) to ~1B (RNN-T) (Tjandra et al., 2022, Feng et al., 2023).
Training: MLS supports massive-parallel distributed training, e.g. Citrinet on 256 GPUs/batch 8,192, and RNN-T on 64 GPUs. Regularization (e.g. SpecAugment), distributed FSDP, and optimizer scheduling (Adam, NovoGrad) follow best practices in large-scale speech modeling (Majumdar et al., 2021, Tjandra et al., 2022).

6. Downstream Transfer and Domain Adaptation

MLS-trained acoustic encoders generalize across languages and domains. For example, models like Citrinet, pretrained only on English MLS, enable rapid fine-tuning to new ASR domains (e.g., TED-LIUM, AISHELL), showing that deep separable-convolution+SE context stacks capture language-universal features with minimal external LM dependence (Majumdar et al., 2021). Massively multilingual RNN-T encoders, pretrained on >150k h, exhibit strong zero-shot WER on MLS and robust adaptation when fine-tuned (Tjandra et al., 2022).

In speech translation, the augmented English–French MLS provides ~236 h of sentence-aligned audio, facilitating direct E2E speech translation research. Human evaluations demonstrated substantial alignment quality (Pearson ρ=0.41 between hunAlign and scores), and initial encoder–decoder models yielded BLEU ≈ 15 (Kocabiyikoglu et al., 2018).

7. Licensing, Distribution, and Impact

All MLS data (audio/text) are public domain, with unrestricted CC0 licensing. The corpus is distributed via OpenSLR (Pratap et al., 2020), with tools, documentation, and splits enabling reproducibility. The MLS preprocessing/partitioning methodology—large, speaker-disjoint, carefully quality-controlled splits, uniform feature pipelines—sets a contemporary standard for multi-language speech benchmarks.

MLS is now the standard validation set for scalable end-to-end ASR, multilingual self-supervised pretraining, and cross-linguistic transfer learning. Its extreme resource imbalance and broad linguistic coverage underpin comparative evaluations of architecture design, tokenization strategies, joint/unified training, and resource-efficient adaptation, accelerating progress in open-vocabulary, language-universal, cross-domain speech recognition.