Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition (2510.03723v1)

Published 4 Oct 2025 in eess.AS, cs.CL, and cs.SD

Abstract: We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).

Summary

The paper introduces SA-DiCoW, integrating speaker attribution and diarization conditioning to enhance transcription accuracy in overlapping speech.
It leverages Serialized Output Training and frame-level diarization masks to achieve significant cpWER reductions across synthetic and real-world datasets.
Experimental results indicate that embedding concatenation and targeted speaker loss improve performance compared to conventional ASR models.

Adapting Diarization-Conditioned Whisper for Multi-Talker Speech Recognition

Introduction

Recent advancements in Automatic Speech Recognition (ASR) have extended capabilities beyond traditional single-speaker models to accommodate the increasingly complex demands of real-world multi-talker scenarios. This paper proposes an innovative solution to the multi-talker ASR problem, leveraging the Diarization-Conditioned Whisper (DiCoW) framework. By integrating speaker-attributed modeling with Serialized Output Training (SOT), the presented architecture facilitates simultaneous transcription of overlapping speech streams, marked by speaker tags and timestamps. The proposed Speaker-Attributed Diarization-Conditioned Whisper (SA-DiCoW) model demonstrates enhanced performance over existing methodologies in synthetic and real-world datasets, underscoring its potential in advancing multi-speaker ASR.

Figure 1: Overall architecture of proposed SA-DiCoW.

Methods

Whisper-based Speech Recognition

Whisper, a state-of-the-art multilingual encoder-decoder ASR model, underpins the proposed architecture. Its attention-based encoder-decoder structure, fortified with Transformer blocks, processes input audio as log mel-filterbank features, mapping these into hidden embeddings. Despite its efficacy across various single-talker domains, Whisper requires adaptation for multi-talker scenarios—specifically to accommodate speaker attribution. Enhancements through the DiCoW framework, enriched with components for speaker-aware modeling, address performance lapses in overlapping speech environments.

Target-Speaker Conditioning with DiCoW

DiCoW's encoder, adapted for target-speaker ASR, uses diarization masks encoding speaker activity across four classes: silence ( $\mathcal{S}$ ), target speaker active ( $\mathcal{T}$ ), non-target speaker active ( $\mathcal{N}$ ), and overlap ( $\mathcal{O}$ ). Applying frame-level diarization-dependent transformations to adjust internal representations based on speaker context, DiCoW encapsulates speaker-specific acoustic features within dedicated representations, thereby facilitating joint decoding and improving overlap robustness.

Multi-Talker ASR with Speaker-Attributed Whisper

By incorporating Serialized Output Training (SOT), Whisper is adapted for speaker-attributed ASR, enabling it to process multichannel speaker-timestamp tokens indicative of speaker identity and timing. This facilitates segment ordering and timestamp progression within each speaker's turn, allowing for effective modeling of overlapping speech conditions. The architecture's holistic design permits the decoder to condition on the global conversational context, thereby enhancing transcription accuracy in complex speech scenarios.

Figure 2: Example of cross-attentions from last decoder layer: x-axis depicts time in seconds, y-axis shows tokens.

Experimental Setup

Datasets

Experiments utilize English-based multi-talker datasets, including NOTSOFAR, AMI, and LibriMix, with oracle diarization derived from available annotations. These datasets provide a diverse array of synthetic and real-world conversational settings essential for comprehensive model evaluation. The data preparation adheres strictly to segment duration constraints inherent to Whisper’s input processing.

Evaluation Metrics

The model's performance is assessed using concatenated minimum-Permutation Word Error Rate (cpWER), a metric reflecting both recognition and diarization errors crucial for evaluating speaker-attributed ASR. This standardized evaluation across synthetic and real-world datasets offers a robust benchmark against competing methodologies.

Results

Impact of Encoder Embedding Aggregation

The investigation into encoder embedding aggregation strategies reveals concatenation as superior, yielding significant cpWER reductions, especially in multi-speaker settings. This method preserves the acoustic and temporal distinctions necessary for accurate speaker attribution, outperforming averaging-based approaches.

Improvements and Comparison with Existing Systems

Speaker attribution accuracy is bolstered through increased loss emphasis on speaker timestamp tokens, resulting in favorable comparisons with established ASR systems across varied datasets. The analysis of cross-attention weights substantiates the model's dynamic engagement with speaker information during decoding, underscoring robustness in highly overlapped conditions.

Conclusion

The integration of target-speaker modeling and serialized output training within the DiCoW framework marks a significant stride in multi-talker ASR, enabling sophisticated speaker-attributed transcriptions in challenging conversational contexts. The proposed SA-DiCoW model, despite its advantages in synthetic and controlled settings, illustrates the necessity of continued research into speaker assignment optimization for diverse, real-world applications. Future endeavors may explore more adaptive and semi-supervised learning paradigms to mitigate reliance on oracle diarization, striving towards more generalized and flexible ASR solutions.