ML-SUPERB: Multilingual Speech Benchmark

Updated 28 March 2026

ML-SUPERB is a comprehensive benchmark suite that evaluates multilingual ASR and LID tasks using both self-supervised and supervised speech models.
The benchmark introduces configurable evaluation protocols and parameter-efficient adaptation strategies, such as LoRA and adapters, to balance performance with resource constraints.
It covers over 200 language varieties, emphasizing low-resource, dialect, and accent diversity, and utilizes transparent metrics like macro-CER and worst-language CER.

The Multilingual Speech Universal PERformance Benchmark (ML-SUPERB) is a suite of standardized benchmarks and open challenges for evaluating the generalization, language coverage, and efficiency of self-supervised learning (SSL) and supervised speech foundation models on multilingual automatic speech recognition (ASR) and language identification (LID). Originating from the SUPERB initiative, which focused on English-centric tasks, ML-SUPERB addresses the evaluation gap for low-resource and diverse linguistic settings across hundreds of languages, dialects, and accents. It has evolved from a fixed downstream evaluation protocol (ML-SUPERB 1.0) to configurable benchmarks (ML-SUPERB 2.0) and large-scale, unconstrained challenge tracks covering over 200 language varieties.

1. Benchmark Design, Scope, and Evolution

ML-SUPERB extends the SUPERB framework to multilingual settings, with the explicit aim of benchmarking speech SSL and supervised models across a wide linguistic spectrum, including high-resource, low-resource, endangered, dialectal, and accent varieties. The original release (ML-SUPERB 1.0) provided a leaderboard over 143 languages, with fixed training regimes (10 min, 1 h per language) to compare frozen SSL feature extractors plus a shallow downstream model across multilingual ASR and LID tasks (Shi et al., 2023). Subsequent updates (ML-SUPERB 2.0) introduced parameter budgeting, multiple downstream architectures, and efficient fine-tuning/adaptation methods (Shi et al., 2024). The Interspeech 2025 ML-SUPERB 2.0 Challenge advanced the evaluation to over 200 language varieties (149 ISO-coded languages, 93 accents/dialects), relaxing training constraints while introducing robustness and inclusivity metrics (Chen et al., 8 Sep 2025).

The benchmark suite also introduced a "New Language Track" that invites corpus contributions from language-resource researchers, fostering extensibility and adaptation to ever broader speech communities (Shi et al., 2023).

2. Formal Task Definitions and Evaluation Metrics

ML-SUPERB focuses on two primary tasks: language identification (LID) and multilingual automatic speech recognition (ASR).

Language Identification (LID): Given utterance $x$ , predict the language label $\ell \in \{1, ..., L\}$ :

$\hat{\ell} = \arg\max_{\ell} P(\ell|x)$

Metric: Classification accuracy (ACC):

$\mathrm{ACC} = \frac{1}{N} \sum_{i=1}^N 1[\hat{\ell}_i = \ell_i]$

Multilingual ASR: Given input $x$ $x$ , predict token sequence $y = (y_1, ..., y_T)$ $y = (y_{1}, ..., y_{T})$ , sometimes prepending a language ID token.
- Decoding frameworks: CTC-only, or CTC-attention hybrid (CTC-ATT; Transformer decoder trained jointly with CTC).
- Metric: Character Error Rate (CER):

$\mathrm{CER} = \frac{S + D + I}{N}$

where $S$ , $D$ , $I$ are substitutions, deletions, and insertions between hypothesis and reference; $\ell \in \{1, ..., L\}$ 0 is the total reference characters.

Aggregation across languages: Key metrics include macro-CER (mean over languages), SD-CER (standard deviation), and worst-language CER (maximum language-wise CER). Dialectal and few-shot track metrics are also reported for accent/low-resource robustness (Shi et al., 2024, Chen et al., 8 Sep 2025).

3. Downstream Architectures and Adaptation Strategies

Initial ML-SUPERB evaluation (1.0) fixed the downstream architecture: a 2-layer Transformer encoder (~6 M parameters) with CTC loss. ML-SUPERB 2.0 establishes a modular approach, evaluating:

Encoder architectures: Transformer, Conformer, E-Branchformer.
Decoding paradigms: CTC-only or CTC-ATT hybrid.
Parameter budget: All configurations constrained to <100M tunable parameters.

Partial and efficient adaptation strategies are core to the ML-SUPERB design:

Full vs. partial fine-tuning: Update all or selected blocks of SSL backbone (e.g., only middle 6 layers), balancing adaptation potential vs. overfitting and compute (Shi et al., 2024, Wang et al., 30 May 2025).
Adapters: Inserted as small trainable modules into each pretrained encoder layer (Houlsby configuration; hidden size 64).
LoRA: Low-rank adaptation in attention projections; only the added $\ell \in \{1, ..., L\}$ 1, $\ell \in \{1, ..., L\}$ 2 matrices are trained ( $\ell \in \{1, ..., L\}$ 3, $\ell \in \{1, ..., L\}$ 4).
Downstream encoder depth is reduced to meet the parameter cap when using parameter-efficient mechanisms.

4. Dataset Construction, Language and Voice-Type Coverage

ML-SUPERB datasets span hundreds of languages, dialects, and accents:

Language coverage: Public sets comprise 143–154 languages; challenge tracks expand to 149 languages plus 93 dialectal/accents, spanning over 200 varieties (Chen et al., 8 Sep 2025, Shi et al., 2023).
Speech sources: Multilingual LibriSpeech, CommonVoice, FLEURS, VoxForge, VoxPopuli, regional and endangered language corpora, formal and conversational speech (Shi et al., 2023, Shi et al., 2023).
Split protocol: Strict balancing (e.g., 10 min/1 h per language in public splits, 10 min of dev/test for dialects). Few-shot languages and datasets with <5 utterances are included to probe low-resource ASR.
Accented/dialectal data: Challenge track includes development and hidden test corpora with labeled accent/dialect but hides these labels in evaluation for unbiased scoring.

Voice-type variation (read, conversational, singing) is explicitly considered in the evaluation of robustness, and domain mismatch is shown to induce wide CER variability (illustrated by Urdu: 21.8% CER on CommonVoice vs 56.9% on FLEURS) (Shi et al., 2024, Shi et al., 2023).

5. Comparative Results and Performance Trends

Model and algorithmic advances drove measurable improvements under ML-SUPERB protocols:

SSL models outperform classical FBANK baselines (e.g., XLSR-128: CER=29.2% vs. 62.4% for FBANK in 10 min regime; LID ACC ≈66.9%) (Shi et al., 2023).
ML-SUPERB 2.0 downstream upgrades (E-Branchformer + CTC-ATT) yield LID ACC up to 94.7%, CER down to 16.9%, hitting a ~32% relative CER drop vs. 1.0 (Shi et al., 2024).
Fine-tuning strategies: Full middle-layer tuning (e.g., layers 9–14 for XLS-R) consistently outperformed bottom/top-layer adaptation. LoRA was competitive with partial fine-tuning.
Parameter-efficient adaptation: LoRA outperformed adapters by ~0.3–0.7 CER points and provided near-parity with partial FT; crucial when compute budgets preclude full parameter updates (Shi et al., 2024, Wang et al., 30 May 2025).
Supervised pretrained models (e.g., Whisper, OWSM): Did not uniformly surpass SSL with parameter constraints; required careful adaptation of decoders (Shi et al., 2024, Wang et al., 30 May 2025).

Challenge benchmark outcomes: In the ML-SUPERB 2.0 Challenge, the best submissions reduced CER by up to 30.2 pp and improved LID ACC by 23 pp versus XEUS SSL baselines; dialectal robustness remained challenging (e.g., dialect LID ACC as low as 27.2% for WavLM vs. 56.6% for top hybrid systems) (Chen et al., 8 Sep 2025, Alumäe et al., 2 Jun 2025).

6. Key Insights, Limitations, and Recommendations

Empirical observations from ML-SUPERB have led to several technical insights:

Downstream architecture selection has critical impact. E-Branchformer + CTC-ATT dominates in full-data, while CTC-only often generalizes better in few-shot (Shi et al., 2024).
Mid-layer adaptation is optimal for both LID and ASR, balancing task transfer and retaining generalization (Wang et al., 30 May 2025).
Data augmentation and targeted adaptation policies dramatically improve low-resource/dialect performance, highlighting the necessity for ongoing dataset expansion and tuning (Wang et al., 30 May 2025).
Aggregate reporting: Macro-CER, SD, and worst-language CER are necessary for transparency in multilingual fairness and robustness assessment (Shi et al., 2024, Chen et al., 8 Sep 2025).
Supervised pretraining requires adaptation: Supervised ASR backbones do not always generalize to unseen languages/dialects, especially in settings with few adaptation parameters.
Voice-type and domain mismatch remain dominant contributors to performance variability, motivating the inclusion of conversational, spontaneous, or non-standard speech in both pre-training and evaluation (Shi et al., 2023).
Parameter scaling alone is insufficient: Smaller, well-tuned multilingual SSL models often rival or surpass much larger models, especially under challenging, diverse, or hidden test sets (Shi et al., 2023).

A plausible implication is that the benchmark's systematic analysis of adaptation, efficient fine-tuning, and generalization gaps will continue to shape the evaluation and training of robust multilingual speech models beyond simple scale-up approaches.

7. Future Directions and Community Impact

ML-SUPERB has driven the field toward increasingly inclusive, diverse, and robust speech technology evaluation. Planned and open research directions include:

Further extension to ultra-low-resource languages, code-switching, spoken language understanding, and speech translation.
Robustness to domain variations such as conversational speech, noisy/far-field audio, and speaker/style variability.
Continuous, server-based leaderboards leveraging blind-evaluation protocols (e.g., DynaBench) for scalable and fair benchmarking (Chen et al., 8 Sep 2025).
Integration of new model adaptation and parameter-efficient learning techniques for resource-constrained and edge use cases.

Practitioners are advised to prioritize architectures and adaptation recipes that are validated under ML-SUPERB's stringent parameter, resource, and fairness constraints, using tools and ESPnet recipes provided by the benchmark maintainers (Shi et al., 2024). As the benchmark continues to evolve, it is positioned as the primary reference for evaluating progress in inclusive, universal speech representation learning.