DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition (2501.00114v1)

Published 30 Dec 2024 in eess.AS and cs.SD

Abstract: Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model's target-speaker ASR capabilities while maintaining Whisper's accuracy and robustness on single-speaker data.

Summary

The paper presents DiCoW, a novel approach that leverages diarization labels to condition the Whisper ASR model for improved target speaker transcription in overlapping speech.
The methodology employs Frame-Level Diarization-Dependent Transformations and Query-Key Biasing to isolate the target speaker and mitigate interference from non-target speakers.
Experimental results on datasets like AMI, CHiME-8, Libri2Mix, and LibriCSS demonstrate significant improvements in speaker-attributed accuracy with reduced reliance on extensive speaker-specific training data.

Overview of DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

The paper introduces "Diarization-Conditioned Whisper" (DiCoW), a technique aimed at enhancing speaker-attributed automatic speech recognition (ASR) in multi-speaker environments. This method seeks to address a persistent challenge in ASR: the accurate transcription and speaker attribution in environments where multiple speakers may be present, and their voice characteristics may not be previously encountered by the system.

Approach

DiCoW builds upon the Whisper ASR model, incorporating speaker diarization outputs instead of traditional speaker embeddings. By doing so, it leverages diarization labels as conditioning information—a marked departure from methods that rely heavily on speaker-specific embeddings which often demand extensive training data to generalize effectively to new speakers.

Frame-Level Diarization-Dependent Transformations (FDDT): This technique introduces transformations at the frame level that are dependent on diarization, enhancing the ASR model's focus on the target speaker while managing overlapping speech more effectively.
Query-Key Biasing (QKb): QKb involves manipulating the attention mechanism within the model by biasing it against non-target speakers' utterances, ensuring the ASR system better ignores irrelevant speech.
CTC Integration: The paper also explores augmenting Whisper with a connectionist temporal classification (CTC) head to improve transcription efficiency with hybrid decoding practices.

Key Findings

The researchers validate DiCoW using datasets such as AMI, CHiME-8's NOTSOFAR-1, Libri2Mix, and LibriCSS. The results indicate that DiCoW not only enhances target-speaker ASR capabilities in overlapping speech conditions but also maintains Whisper's traditional strengths in single-speaker scenarios. Specifically, substantial progress is noted in achieving reliable speaker-attribution and transcript accuracy in challenging multi-speaker recordings.

Implications and Future Directions

The implementation of DiCoW highlights significant potential for improving the generalization of ASR systems to unseen speakers through the strategic use of diarization information. This approach reduces the dependency on large scale, speaker-specific training data, and advances practical and theoretical understandings of ASR capabilities in complex acoustic environments.

Future work may explore the extension of this framework across diverse language and dialect groups, particularly where speaker data is scarce. Moreover, further optimizations in diarization techniques coupled with advancements in attention mechanisms within ASR systems could pave the way for even more robust and adaptable multi-speaker recognition solutions.

In conclusion, DiCoW represents a significant step toward more sophisticated, flexible, and high-performance speaker-attributed ASR systems. It bridges gaps between diarization and recognition tasks effectively, offering a promising avenue for future research and development in automatic speech recognition technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ButSpeech/status/1877010937493786861