BUT System for the MLC-SLM Challenge

Published 16 Jun 2025 in eess.AS | (2506.13414v1)

Abstract: We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW -- a diarization-conditioned variant of Whisper -- with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out-of-domain (OOD) multilingual scenarios without any fine-tuning. In this scenario, DiariZen consistently outperforms the baseline Pyannote diarization model, demonstrating strong generalization. Despite being fine-tuned on English-only data for target-speaker ASR, DiCoW retains solid multilingual performance, indicating that encoder modifications preserve Whisper's multilingual capabilities. We then fine-tune both DiCoW and DiariZen on the MLC-SLM challenge data. The fine-tuned DiariZen continues to outperform the fine-tuned Pyannote baseline, while DiCoW sees further gains from domain adaptation. Our final system achieves a micro-average tcpWER/CER of 16.75% and ranks second in Task 2 of the MLC-SLM challenge. Lastly, we identify several labeling inconsistencies in the training data -- such as missing speech segments and incorrect silence annotations -- which can hinder diarization fine-tuning. We propose simple mitigation strategies to address these issues and improve system robustness.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a novel ASR architecture that integrates diarization-conditioned modifications to Whisper (DiCoW) with a local EEND-based diarization pipeline (DiariZen) for robust multilingual, multi-talker performance.
The methodology leverages transformer adaptations through Frame-Level Diarization-Dependent Transformations (FDDT) and combines weighted WavLM and Conformer embeddings for precise speaker clustering.
Experimental results demonstrate improved tcpWER/CER metrics and reduced diarization error rates, overcoming challenges from dataset annotation inconsistencies.

BUT System for the MLC-SLM Challenge

Introduction

The "BUT System for the MLC-SLM Challenge" presents a robust architecture for ASR in multilingual and multi-talker settings. This research integrates DiCoW, a diarization-conditioned Whisper variant, with DiariZen, a diarization pipeline built on Pyannote, forming an innovative ASR system designed for the challenging environment of the MLC-SLM challenge. Through extensive experimentation in diverse, out-of-domain multilingual scenarios and subsequent domain adaptation, the study shows superior performance over existing methodologies.

DiariZen Architecture

DiariZen serves as a foundational component of the system, handling speaker diarization tasks through a local end-to-end neural diarization (EEND) framework (Figure 1). This pipeline, built upon Pyannote, segments audio into shorter chunks and applies local EEND. DiariZen uses WavLM and Conformer, leveraging weighted aggregation from WavLM to drive the Conformer layer inputs, culminating in per-speaker embeddings. These embeddings inform a clustering process to yield diarization results that surpass baseline performance.

Figure 1: Framework of local EEND module for DiariZen. Figure adapted from~\cite{han2024leveraging}.

DiCoW: Diarization-Conditioned Whisper

The DiCoW component modifies Whisper to incorporate frame-level diarization via Frame-Level Diarization-Dependent Transformations (FDDT) (Figure 2). This schema captures contextual speaker activities through an STNO mask, distinguishing between silence, target speaker activity, non-target speaker presence, and overlap. These probabilistic representations are integrated within the Whisper encoder by adapting each Transformer layer with probability-weighted transformations, thereby preserving the core strengths of the Whisper architecture while enhancing its handling of multi-speaker environments.

Figure 2: Overview of the DiCoW model architecture. The model is based on the Whisper architecture, with modifications to incorporate frame-level diarization information through Frame-Level Diarization Dependent Transformations (FDDT). Figure adapted from~\cite{polok2024dicowdiarizationconditionedwhispertarget}.

Experimental Setup and Results

Evaluations were conducted using both zero-shot and fine-tuned settings in multilingual conditions. DiariZen displayed consistent superiority over Pyannote in diarization error rates (DER), and when integrated with DiCoW, significantly advanced tcpWER/CER metrics on ground-truth and diarization-driven segmentations. However, labeling inconsistencies posed challenges, which fine-tuning strategies sought to alleviate, yielding a micro-average tcpWER/CER improvement to 16.75% and a second-place ranking in the Task 2 of the challenge.

Impact of Labeling Inconsistencies

A critical analysis of dataset annotations revealed substantial inconsistencies, such as omitted speech segments, which skew training and evaluation. A proposed mitigation leverages auxiliary voice activity detection (VAD) to recalibrate speech and silence boundaries, aligning training more closely with ideal test conditions. This adjustment led to noticeable performance enhancements, demonstrating robustness in diarization and ASR outputs (Figure 3).

Figure 3: Example of the ground truth diarization; our system before fine-tuning on MLC; the same system after fine-tuning; and the fine-tuned system with probabilities merged using auxiliary VAD.

Conclusions

This study introduces a comprehensive, non-LLM framework for multilingual, multi-talker ASR, integrating advanced diarization-conditioned strategies with significant improvements in performance metrics across languages. Despite observed dataset annotation issues, this dual-strategy system demonstrates notable efficacy and potential expansion through introduction of diarization-conditioned speech LLMs. Future endeavors demand investigation into optimized training routines for inaccurately labeled data and potential integration of more advanced diarization techniques for further performance gains.

Markdown Report Issue