ML-SUPERB 2.0: Multilingual Speech Benchmark

Updated 30 June 2025

ML-SUPERB 2.0 Challenge is a comprehensive benchmark that evaluates self-supervised and supervised speech models across 142 languages and 15 datasets.
It applies strict parameter constraints and innovative adaptation strategies, including LoRA and adapters, to ensure fair and robust model comparisons.
The challenge emphasizes real-world performance with metrics for fairness and efficiency, guiding research in low-resource and diverse acoustic environments.

The ML-SUPERB 2.0 Challenge is an extensive multilingual benchmark designed to objectively evaluate speech processing models—especially self-supervised and supervised foundation models—across diverse languages, datasets, adaptation regimes, and modeling constraints. Building on the lineage of SUPERB and ML-SUPERB, this iteration introduces rigorous methodological extensions, new adaptation protocols, and enhanced fairness metrics, shaping the evaluation landscape for universal speech technologies.

1. Benchmarking Framework and Methodology

ML-SUPERB 2.0 employs a comprehensive evaluation protocol modeled to reflect real-world conditions for multilingual ASR (Automatic Speech Recognition) and LID (Language Identification). The evaluation suite spans 142 languages and 15 datasets, using a 1-hour training and 10-minute dev/test split per language-dataset pairing, resulting in a corpus of ≈300 hours. A distinctive few-shot scenario is included, assigning only five training utterances to 20 reserved languages to probe data efficiency and generalization.

All model submissions must satisfy a strict parameter constraint: no more than 100 million tunable parameters per configuration. This constraint enables fair comparison—especially when massive foundation models are included via frozen representations.

The primary metrics are macro-averaged Character Error Rate (CER) for ASR and accuracy for LID, with additional focus on:

Standard deviation (SD) of CER across languages, highlighting robustness,
CER on the worst-performing language (WL),
Cross-dataset CER range for languages appearing in multiple corpora,
LID accuracy,
Performance in few-shot (ultra-low resource) settings.

Model adaptation and downstream architectures are systematically varied. Configurations include:

Shallow two-layer downstream models (as in previous ML-SUPERB benchmarks),
Larger architectures such as Transformer, Conformer, and E-Branchformer, used in both CTC and CTC-attention hybrid discriminative frameworks.

2. Model Adaptation Approaches and Practical Strategies

ML-SUPERB 2.0 systematically explores adaptation strategies for both SSL and supervised models. The approaches include:

Frozen encoder with shallow heads: The backbone remains fixed, only the small downstream model is trained.
Full fine-tuning: All model layers are updated, achieving the best results in the standard setting but requiring more resources.
Partial fine-tuning: Only specific model layers (often middle layers) are updated, exploiting representational stratification within deep models. Empirically, middle-layer partial tuning outperforms both top-layer and bottom-layer fine-tuning.
Adapters: Small trainable networks (e.g., per-layer bottleneck adapters, dimension 64) are inserted into the main model and only those are updated.
LoRA (Low-Rank Adaptation): Low-rank matrices are added to certain attention layers, updated during adaptation while the main parameters are kept frozen. For example, LoRA rank 16 with scaling α = 16.

Results show LoRA slightly outperforms adapters for equivalent parameter budgets and can approach full fine-tuning performance in standard settings, providing dramatic gains over frozen baselines. The empirical finding is that LoRA and adapters offer strong trade-offs for efficient adaptation, especially valuable for rapid deployment or low-resource scenarios.

Data augmentation is employed to compensate for few-shot limitations, with supplementing utterances from external sources greatly improving downstream LID and ASR performance in ultra-low-resource languages.

3. Impact of Downstream Architecture on Performance

The downstream model design crucially influences both absolute performance and the comparative ranking of backbone models. E-Branchformer architectures consistently outperform both standard Transformers and Conformers, achieving lower CER and improved LID accuracy (e.g., CER: 16.6% for MMS + E-Branchformer vs 24.8% for XLS-R + baseline Transformer under original ML-SUPERB settings). Hybrid CTC-attention frameworks typically outperform pure CTC in full-data regimes, while pure CTC retains an edge in few-shot scenarios.

The parameter constraint (≤100M) ensures the comparability and accessibility of submissions, but further highlights architectural scaling limits: increasing model size confers diminishing returns in low-data regimes, favoring more parameter-efficient architectures and adaptation methods.

4. Language and Dataset Robustness

A key insight of ML-SUPERB 2.0 is the pronounced variability in ASR performance across languages and datasets:

Standard deviations of CER across languages range from 10% to as high as 22%, underlining large gaps in model robustness.
The worst-performing languages (e.g., Lao, Min Nan Chinese) routinely exhibit CERs more than twice the overall mean.
Within-language, cross-dataset variance is also substantial. For example, Urdu receives a CER of 21.8% on one dataset (Common Voice) but 56.9% on another (Fleurs), reflecting major domain and acoustic biases.

These findings emphasize a genuine need for targeted adaptation methods and fairness-aware evaluation metrics, as universal models remain highly sensitive to language typology, training resource coverage, and dataset/domain characteristics.

5. Challenge Results, Baseline Progress, and Techniques

Recent results from ML-SUPERB 2.0 and its associated Interspeech 2025 challenge demonstrate significant advances:

The winning TalTech system uses a hybrid LID approach (combining deep language embeddings from a frozen SeamlessM4T encoder with a phonotactic, bigram LM-driven ASR rerank system), modular ASR model selection (finetuned SeamlessM4T, MMS-1B-all with language adapters, MMS-zeroshot) per language, and achieves 86.8% LID accuracy and 27.4% CER (vs. baselines of 53.8%/51.9%)(Alumäe et al., 2 Jun 2025). Language adapters and custom fine-tuning are employed for languages with poor baseline coverage.
The runner-up (Xiaomi et al.) (Wang et al., 30 May 2025) employs data augmentation for few-shot languages, partial fine-tuning (especially of middle encoder layers), and auxiliary LID CTC regularization, attaining 14% relative LID accuracy and 30% CER improvement over frozen baselines.
Model adaptation techniques and data supplementing strategies—both supervised and self-supervised—are empirically validated. LoRA and partial layer tuning are consistently strong, especially when combined with targeted data augmentation and auxiliary regularization.
No single adaptation or modeling approach dominates across all settings; robust performance depends on synergizing model selection, resource-aware adaptation, and aggressive augmentation.

6. Practical Considerations, Limitations, and Future Directions

ML-SUPERB 2.0 is explicitly designed to drive methodological rigor and real-world impact:

The inclusion of robustness/fairness metrics shifts focus from mean performance to outlier and least-served cases, explicitly addressing the risk of systemic underperformance on low-resource, dialectal, or out-of-domain data.
The challenge exposes the limitations of simply scaling model size, confirming that intelligent adaptation, architectural choices, and domain-matched training are required for equitable generalization.
Efficient adaptation protocols (adapters, LoRA) are not only critical for real-world system deployment but are also essential for scalable evaluation and transfer learning in resource-constrained environments.
Persistent challenges remain, including handling extreme low-resource languages, robustly bridging cross-domain gaps, and addressing the pronounced acoustic and linguistic diversity present in global speech data.

Open research avenues include:

Systematic development of language-aware adaptation and learning strategies,
Improved fairness metrics and mitigation techniques,
Stronger benchmarks for data and model documentation, supply chain transparency, and reproducibility, especially as demanded by large-scale model sharing and foundation model evaluation.

7. Summary Table: ML-SUPERB 2.0 Key Technical Dimensions

Aspect	Features/Requirements	Key Metric(s)
Languages	142+ (>15 datasets, 300 h total)	Macro CER, LID acc., SD, WL
Downstream models	Shallow baseline, Transformer/Conformer/E-Branchformer (CTC/CTC-ATT)	CER, LID, param. count
Model adaptation	Frozen, partial/full fine-tune, Adapters, LoRA	CER/LID vs. adaptation type
Parameter constraint	≤100M tunable	-
Robustness/fairness	Macro SD, worst-CER, cross-set range	SD, WL, few-shot LID/CER
Efficiency	Evaluation of adaptation cost, methods for low-resource deployment	-

ML-SUPERB 2.0 establishes a new benchmark paradigm for the evaluation of multilingual speech models—setting explicit standards for comprehensive, equitable, and efficient performance across a global spectrum of languages and real-world domains.

PDF Markdown Chat (Upgrade)

References (2)

1.

TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge (2025)

2.

Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC (2025)