Target Speaker ASR with Whisper

Published 14 Sep 2024 in eess.AS and cs.SD | (2409.09543v2)

Abstract: We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single-speaker ASR models into target-speaker ASR models. Our approach also supports speaker-attributed ASR by sequentially generating transcripts for each speaker in a diarization output. This simplified method outperforms baseline speech separation and diarization cascade by 12.9 % absolute ORC-WER on the NOTSOFAR-1 dataset.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a diarization-conditioned Whisper model that leverages frame-level diarization outputs to isolate target speaker features in multi-speaker audio.
The paper employs FDDT modules to adapt ASR model features, achieving marked improvements in word error rates, especially in overlapping speech scenarios.
The paper demonstrates robust performance across datasets like Libri2Mix, underscoring its potential to enhance ASR in real-world multi-speaker environments.

Target Speaker ASR with Whisper

Introduction

The paper presents an innovative approach to enhancing Automatic Speech Recognition (ASR) systems for multi-speaker environments using the Whisper architecture. Traditional ASR systems face challenges in real-world scenarios where multiple speakers interact within a single recording. The proposed method shifts from conventional speaker embedding models to a diarization-conditioned model, emphasizing relative differences among speakers without requiring detailed speaker embeddings. By incorporating frame-level diarization outputs, the method transforms single-speaker ASR models into systems capable of generating accurate transcriptions attributed to individual speakers.

Methodology

Diarization-Conditioned Architecture

Central to the method is the adaptation of Whisper-based models to utilize frame-level diarization outputs. Each audio frame is classified into one of four categories: Silence (S), Target speaker alone (T), Non-target speaker(s) alone (N), and Overlapping speech (O). These classifications form the STNO mask, which guides the transformation of the model's internal representations. The system efficiently learns these transformations through Frame-Level Diarization Dependent Transformations (FDDT) modules, which modify the ASR model's features prior to passing them through the transformer layers.

Frame-Level Diarization Dependent Transformations (FDDT)

The FDDT modules are critical for distinguishing among control categories, silencing irrelevant noise, and emphasizing target speaker features. Different initialization strategies are explored for these transformations, including identity and suppressive approaches, facilitating the robust performance of the model without significant distortion of original representations.

Experimental Evaluation

Datasets and Baseline Performance

The proposed model was fine-tuned on various datasets, including NOTSOFAR-1, AMI, and Libri2Mix, to assess its efficacy across different acoustic environments and speaker arrangements. Experiments demonstrated substantial performance improvements over traditional input masking, achieving notable results in challenging scenarios characterized by overlapping speech.

Performance Analysis

Performance metrics (e.g., Word Error Rate, ORC-WER) significantly favored the diarization-conditioned approach, particularly in multi-party conversation datasets. The study highlighted the model's resilience to diarization errors and its capability to outperform state-of-the-art systems on synthetic datasets such as Libri2Mix, which simulates complex acoustic conditions.

Implications and Future Work

The approach delineated in the paper bridges the existing gap between end-to-end ASR models and practical deployment in heterogenous multi-speaker environments. It introduces a paradigm shift from embedding-rich models to diarization-conditioned models, capable of finer speaker differentiation using frame-level analysis. Further exploration in varying dialects and additional datasets is warranted to establish broader applicability and robustness. Additionally, the potential to extend these techniques to other ASR architectures offers new directions for research and development.

Conclusion

In summary, the study advances target speaker ASR by leveraging diarization-conditioned transformations within Whisper models. The approach not only minimizes resource requirements but also enhances performance in real-world settings characterized by overlapping speech and constrained recording inputs. This methodology holds promise for refining ASR capabilities and broadening their application scope, with potential future extensions into comprehensive multi-domain and multilingual landscapes.