Interspeech 2025 ML-SUPERB 2.0 Challenge

Updated 12 September 2025

ML-SUPERB 2.0 Challenge is a community-driven evaluation that benchmarks multilingual ASR and LID across over 200 languages and dialects.
The test suite comprises a standard multilingual set and an accented/dialectal subsuite, ensuring robust, fairness-oriented measurements.
Submitted systems leveraged diverse architectures and data augmentation to achieve notable improvements in error rates and language identification accuracy.

The Interspeech 2025 ML-SUPERB 2.0 Challenge is a community-driven evaluation designed to advance the inclusivity and robustness of multilingual automatic speech recognition (ASR) and language identification (LID) systems. By constructing a comprehensive test suite that spans more than 200 languages, dialects, and accents—and by removing architectural restrictions in system design—the challenge fosters innovations that address underrepresented language varieties, increased fairness, and real-world robustness in speech technology (Chen et al., 8 Sep 2025).

1. Objectives and Scope

The principal objective of ML-SUPERB 2.0 Challenge is to catalyze improvements in state-of-the-art ASR models across the broadest spectrum of language varieties yet attempted in the field. The design emphasizes:

Extensive Linguistic Coverage: More than 200 language varieties, including major world languages, low-resource languages, dialects, and regional accents, to foreground disparities in ASR and LID performance.
Inclusive and Open Benchmarking: No restrictions are placed on modeling approaches, training data, or system architectures, thus promoting innovation in data selection, model adaptation, and evaluation strategies.
Focus on Robustness and Fairness: Special attention is paid to language and dialectal robustness through targeted test splits, explicit performance metrics for hardest cases, and robust aggregation of metrics.

By leveraging this scope, the challenge aims to push ASR technology toward truly inclusive, global applicability (Chen et al., 8 Sep 2025).

2. Test Suite Design and Data Splits

The test suite is bifurcated into two principal components:

Partition	Languages/Varieties	Data Volume (per lang/variety)	Primary Use
Standard Multilingual Set	149–154 languages	1 hour train, 10min dev/test	General evaluation
Accented/Dialectal Subsuite	93 language varieties	10min dev/test	Robustness evaluation

The standard multilingual set consists of 149 base languages, with 10 minutes reserved for development and test per language, and additional hidden languages (up to 154 total) in selected splits.
The accented/dialectal subsuite evaluates system robustness across 93 distinct accent or dialect groups, each with its own dev/test split, sourced from diverse public speech corpora.
All audio and reference transcriptions in the test set are kept strictly hidden; evaluation is administered through a controlled, online DynaBench server where participants upload model inference code and checkpoints (Chen et al., 8 Sep 2025).

This structure ensures a full-spectrum assessment—both across languages and within non-standard language varieties.

3. Evaluation Methodology and Metrics

A comprehensive, multi-dimensional evaluation is deployed:

Core Metrics:

ASR Character Error Rate (CER): Calculated as $CER = (S + D + I)/N$ , where $S$ , $D$ , $I$ are substitutions, deletions, insertions, and $N$ is reference character count.
LID Accuracy: Percentage of correctly identified languages.

Robustness Metrics:

Worst-15-Language CER: Macro-averaged CER computed over the 15 worst-performing languages, to directly measure failures not evident in means alone.
CER Standard Deviation: Across all languages, to gauge cross-language consistency in system performance.
Dialectal CER and LID: Computed in the same fashion as above, but on the accented/dialectal part of the evaluation data.

Scoring and Ranking:

For each of the six metrics, participating systems are ranked; the final system ranking is an average of per-metric ranks. Micro-average tie-breakers are applied if necessary to mitigate dominance by any single metric or skew from disparate dynamic ranges (Chen et al., 8 Sep 2025).

Infrastructure:

The DynaBench-based online server ensures blind, fair, and reproducible model evaluation, with no access to test audio or references for participants.

4. Submitted Systems and Performance Outcomes

The 2025 edition attracted 5 submissions from 3 teams, each substantially surpassing the official baselines (which included strong SSL models such as WavLM, XLSR-53, MMS-1B, XEUS):

System Rank	LID Accuracy (relative improvement)	CER Reduction (relative/absolute)	Dialectal CER / LID gains	Notes
1st	+23% (standard set)	-18%	-30.2% / +15.7%	Best submission—markedly more robust to dialects
Baseline	–	–	–	WavLM, XLSR-53, MMS-1B, XEUS SSL baselines

Notably, on accented/dialectal data, the best submission demonstrated a 30.2% lower CER and 15.7% higher LID accuracy compared to the top baseline; on the main set, improvements were 23% (LID accuracy) and 18% (CER). All systems improved measured robustness, not just average performance (Chen et al., 8 Sep 2025).

5. Model Design, Adaptation, and Training

The absence of architectural restrictions in the challenge motivated a diversity of approaches:

Systems used hybrid and hierarchical architectures, with language identification front-ends feeding downstream language-specific ASR backends.
State-of-the-art models such as SeamlessM4T, MMS-1B (with language-specific adapters), and MMS-zeroshot were employed, with adaptive selection per language based on dev set performance (Alumäe et al., 2 Jun 2025).
Robust generative LID modules leveraging bigram LLMs (trained on up to 100,000 samples per language with Kneser-Ney smoothing) were combined with language embedding classifiers via linear interpolation for improved robustness to low-resource and noisy scenarios (Alumäe et al., 2 Jun 2025).
Data augmentation and external corpora (e.g., Common Voice) were integrated to address the well-known gap in few-shot and low-resource ASR/LID settings (Wang et al., 30 May 2025).

This variety directly contributed to the observed improvements in both mean and worst-case metrics.

6. Implications for Inclusivity, Fairness, and Future Directions

ML-SUPERB 2.0 Challenge marks a significant progression toward evaluating ASR systems on inclusivity, language and dialectal fairness, and cross-language robustness:

Inclusivity: The evaluation across 149–154 languages and 93 dialects/accents ensures that performance disparities are observed, quantified, and targeted, countering the traditional focus on only high-resource languages (Chen et al., 8 Sep 2025).
Fairness-Oriented Design: Ranking based on per-metric averages and robust metrics prevents over-optimizing for global means at the expense of failure cases.
Research Catalysis: Open methodology and transparent evaluation have resulted in measurable community progress—with all submitted systems outperforming the strongest contemporary baselines.
Broader Impacts: For real-world applications, such as virtual assistants or captioning, improved robustness across language varieties translates directly into equitable technology access and usability.

A plausible implication is that future research will further emphasize transfer learning, adaptation to low-resource varieties, and fairness-aware evaluation strategies, enabled by the continued expansion of inclusive, blind-evaluation community challenges. The ML-SUPERB 2.0 Challenge establishes both methodological and empirical precedents for such work (Chen et al., 8 Sep 2025).