Interspeech 2025 MLC-SLM Challenge

Updated 7 August 2025

Interspeech 2025 MLC-SLM Challenge is a comprehensive effort advancing multilingual ASR and diarization using a 1,500-hour corpus across 11 languages.
It evaluates end-to-end systems with key metrics like WER, DER, and tcpWER, emphasizing modular LLM-based integration and context-aware decoding.
The challenge promotes innovative architectures and adapter-based fine-tuning methods to efficiently address real-world issues such as code-switching and speaker overlap.

The Interspeech 2025 Multilingual Conversational Speech LLM Challenge (MLC-SLM) represents a significant collaborative effort in the academic and industrial communities to advance state-of-the-art solutions in multilingual conversational automatic speech recognition (ASR), speaker and language diarization, and related modeling tasks. Building on a lineage of multilingual and multi-speaker benchmarks, this challenge focuses on end-to-end robust speech LLMs capable of addressing diverse real-world conversational phenomena, including code-switching, rapid speaker turn-taking, speaker overlap, and dynamic language boundaries in highly variable acoustic environments.

1. Historical Evolution and Benchmark Datasets

The foundation for the Interspeech 2025 MLC-SLM Challenge is built on past diarization and multilingual challenges such as DISPLACE 2023 and 2024 (Baghel et al., 2023, Kalluri et al., 13 Jun 2024). Early benchmarks emphasized speaker and language diarization in overlapping, code-mixed conversational Indian speech, using natural far-field recordings encompassing Hindi, Kannada, Bengali, Malayalam, Telugu, Tamil, and Indian English. The DISPLACE corpora introduced annotated datasets with detailed segment-level labels for speaker identity, fine-grained language switching, and code-mixed utterances, annotated in RTTM format and designed to stress-test diarization and LID systems in overlapping and noisy speech.

The 2024 iteration further expanded these datasets to over 158 hours (supervised and unsupervised), introduced a broader multilingual spectrum (9 Indian languages plus English), and added a challenging ASR track with multi-accent, code-mixed conversational speech (Kalluri et al., 13 Jun 2024). Each iteration prioritized realistic environments: single-channel far-field microphone configurations, strong environmental noise and reverberation, and conversational sessions lasting 30–60 minutes with 3–5 participants.

This evolution set the stage for MLC-SLM 2025, which—through its base corpus of 1,500 hours covering 11 languages and various world English accents—enabled large-scale model pretraining, adaptation, and rigorous evaluation under conditions likely to be encountered in deployment scenarios.

2. Key Tasks and Metrics

The challenge encompasses several interlocking tasks:

Multilingual ASR: Accurate transcription of code-switched, multi-accented speech across a range of languages.
Speaker and Language Diarization: Precise segmentation of speech streams into speaker- and language-homogeneous segments, critical for multi-party conversational understanding.
Joint Diarization and Recognition: End-to-end systems tasked with both diarization (without oracle segmentation) and ASR, necessitating modeling of natural turn transitions, overlaps, and ambiguous boundaries.
Speech Separation and Enhancement: For certain tracks, source separation and quality enhancement are evaluated, especially in noisy, conversational settings.

Principal evaluation metrics include:

Diarization Error Rate (DER): $DER = \frac{Missed + Falsely Detected + Speaker Error}{Total Reference Speech}$
Word Error Rate (WER) and Character Error Rate (CER)
Mix Error Rate (MER): A unified metric introduced in the NTU Speechlab and other publications to jointly quantify token- and segmentation-level errors.
time-constrained permutation WER (tcpWER/tcpCER): Captures the impact of both diarization and ASR errors, applying temporal collars to boundary mismatches (Saengthong et al., 26 Jun 2025).

These metrics are complemented by leaderboard evaluations, ablation studies, and breakdowns on few-shot languages and dialects, as illustrated in the ML-SUPERB 2.0 results (Wang et al., 30 May 2025, Alumäe et al., 2 Jun 2025).

3. Architectures and Modeling Advances

A variety of architectural strategies are represented among top-ranking teams and academic groups, reflecting ongoing developments in speech foundation models, efficient adaptation methods, and context integration.

System/Team	Core Architecture	Notable Innovations
TEA-ASLP (Xue et al., 24 Jul 2025)	Dual Encoder (Whisper, MMS) + MoE-mLoRA-adapted LLM	Language-adapted fusion, CTC-token prompting, 180k hour training
Triple X (Gao et al., 23 Jul 2025)	Whisper-large-v3 Encoder + Adapter + Qwen LLM (LoRA-tuned)	Frame splicing adapter, staged LLM fine-tuning, 30k hour augmentation
SHNU-mASR (Mei et al., 4 Jul 2025)	Parallel Speech Encoders (Whisper, mHuBERT) + Qwen2.5-7B	Tri-stage training, LoRA parameterization, language-aware prompting
NTU Speechlab (Peng et al., 16 Jun 2025)	Whisper-large-v3 + Modality Adapter + Gemma-2-2B LLM	Full LLM parameter tuning, language-specific prompts, checkpoint averaging
MARS (Mu et al., 2 Aug 2025)	Whisper LLM-ASR + Multi-modal Retrieval-Selection	Near-ideal historical context ranking, acoustic & text similarity
Unified Speech LLM (Saengthong et al., 26 Jun 2025)	Whisper + Subsampling Projector + Llama-3.2-3B	Learnable diarization prompts, context carry-over inference

Common traits include:

Use of multi-head or parallel encoders (Whisper, mHuBERT, MMS) to exploit both supervised and self-supervised features.
Parameter-efficient adaptation strategies: LoRA and mixture-of-experts adapters, with language-specific routing (TEA-ASLP, SHNU, ILT (Meng et al., 11 Jul 2025)).
Explicit context modeling: language or task-specific prompts, and fine-grained contextual augmentation via retrieval/selection, as in MARS.
Modular training regimes that stabilize encoder adaptation, adapter optimization, and LLM fine-tuning.
Integration of CTC-derived tokens during generation to reduce hallucinations and insertion errors (TEA-ASLP).

Notably, the MARS approach (Mu et al., 2 Aug 2025) demonstrates that advanced context retrieval and selection—leveraging both acoustic and semantic history—can yield error rates rivaling or surpassing much larger models trained on orders of magnitude more data.

4. Training Regimes, Adaptation, and Data Utilization

Training pipelines in the challenge converge on diverse strategies to maximize multilingual and multi-accented generalization:

Massive-Scale Pretraining: As in TEA-ASLP’s 180k-hour regime, and also in ML-SUPERB 2.0 where MMS-1B-all and SeamlessM4T encoders are fine-tuned or adapted (Wang et al., 30 May 2025, Alumäe et al., 2 Jun 2025).
LoRA/Adapter-Based Fine-Tuning: Employed both for model compactness and parameter efficiency—allowing incremental domain/language adaptation while freezing large foundations (Meng et al., 11 Jul 2025, Mei et al., 4 Jul 2025, Wang et al., 30 May 2025).
Iterative Training Frameworks: Iterative LoRA Training (ILT) with Focus, Feedback, and Fix stages (MegaAIS, (Meng et al., 11 Jul 2025)) addresses overfitting and enables ensemble pseudo-label feedback for robust adaptation.
Synthetic Data and Transfer Learning: Several entries, such as Instituto de Telecomunicações (Attanasio et al., 20 Jun 2025), leverage high-quality synthetic ST/SQA examples filtered by automatic metrics (e.g., COMETKiwi) and augmented by pseudolabeling for low-resource tasks.
Contrastive and Contextual Learning: Prepending conversational context and employing contrastive objectives to couple current utterances with their dialogue histories enhances recognition under ambiguous or overlapping conditions (Concina et al., 25 Jul 2025).

These approaches coalesce into best practices emphasizing iterative, modular adaptation; selection of high-quality or in-domain data; and context-enhanced, language-tailored decoding.

5. Context Integration and Dialogue-Aware Decoding

One of the defining advances of MLC-SLM 2025 is the explicit and systematic integration of conversational context at both training and inference:

Language-Specific Prompts: Shown by NTU Speechlab (Peng et al., 16 Jun 2025) and SHNU-mASR (Mei et al., 4 Jul 2025), prompts such as “Transcribe speech to text” in the target language conditionalize LLM generation, aligning autoregressive output with the language of the utterance and reducing cross-lingual hallucinations.
Historical and Future Context: Bi-directional context-enhanced models (Peng et al., 16 Jun 2025) and the MARS retrieval/selection system (Mu et al., 2 Aug 2025) highlight that intelligently selected context—when judiciously masked or retrieved—enables error reductions of up to 18% relative to competitive baselines, and in the case of MARS, enables smaller-data models to outperform much larger-data systems (Mu et al., 2 Aug 2025, Peng et al., 16 Jun 2025).
Contrastive Contextual Objectives: The Eloquence submission (Concina et al., 25 Jul 2025) implements contrastive learning between utterance-context pairs, further bolstering context-robust alignment.

These findings indicate that naive use of the entire conversation history or fixed-length context windows is suboptimal. Instead, context selection guided by multi-modal similarity and near-ideal ranking, as in MARS, supports both computational efficiency and accuracy.

6. Joint Modeling of Diarization and Recognition

Task II of the MLC-SLM Challenge (joint diarization and ASR) prompts direct integration of segmentation and labeling mechanisms:

Unified Prompt-Based Diarization and ASR: The approach in (Saengthong et al., 26 Jun 2025) uses ground-truth-derived speaker prompt tokens and timestamp markers in the input sequence, so the LLM decodes the full dialogue with interleaved diarization and ASR labels.
Sliding Window with Context: Windowed inference with dynamic context carry-over maintains discourse coherency and reduces boundary ambiguity in real applications.
Global Alignment Post-processing: Outputs from locally windowed inference are further aligned with RTTM diarization outputs to maximize temporal overlap and merge nearly identical boundaries.

This demonstrates the utility of task reformulation—combining semantic, speaker, and temporal segmentation into a single LLM-driven generation process.

7. Impact, Significance, and Future Directions

The MLC-SLM Challenge catalyzes progress in multilingual conversational speech modeling by providing a realistic, high-diversity evaluation suite, clear task definitions, and rigorous leaderboards. Key impacts include:

The demonstrated ability of context-refined and modular models (e.g., MARS, SHNU-mASR, Triple X) to achieve error rates below 10% in WER/CER on highly challenging, code-switched conversational data—often outperforming much larger models trained on more extensive data.
Clear advances in diarization error rate (DER) and tcpWER, via both architectural and inference innovations (Saengthong et al., 26 Jun 2025, Mei et al., 4 Jul 2025, Kalluri et al., 13 Jun 2024).
Enhanced methods for both high-resource and low-resource language adaptation, using augmented and synthetic datasets filtered for quality.
Pervasive use of LoRA, mixture-of-experts, and task/language-aware adapters, supporting efficient scaling.

Anticipated future developments include joint end-to-end diarization and recognition architectures; richer context adaptation (potentially leveraging multi-modal cues); and evaluation on even more dynamic, naturalistic conversational scenarios (Kalluri et al., 13 Jun 2024, Peng et al., 16 Jun 2025).

References

Relevant works include (Baghel et al., 2023, Kalluri et al., 13 Jun 2024, Wang et al., 30 May 2025, Alumäe et al., 2 Jun 2025, Peng et al., 16 Jun 2025, Peng et al., 16 Jun 2025, Attanasio et al., 20 Jun 2025, Saengthong et al., 26 Jun 2025, Mei et al., 4 Jul 2025, Meng et al., 11 Jul 2025, Gao et al., 23 Jul 2025, Xue et al., 24 Jul 2025, Concina et al., 25 Jul 2025), and (Mu et al., 2 Aug 2025). Each advances specific aspects of multilingual conversational ASR, diarization, adaptation strategy, or context integration for the Interspeech 2025 MLC-SLM Challenge.

In sum, the Interspeech 2025 Multilingual Conversational Speech LLM Challenge consolidates the latest advances in multilingual, multi-speaker speech modeling, illustrating the efficacy of modular, context-sensitive LLM-based pipelines and robust adaptation strategies. It sets a new benchmark for practical, scalable, and highly accurate conversational speech processing in global, polyglot real-world environments.