- The paper introduces a diarization-conditioned Whisper model that leverages frame-level diarization outputs to isolate target speaker features in multi-speaker audio.
- The paper employs FDDT modules to adapt ASR model features, achieving marked improvements in word error rates, especially in overlapping speech scenarios.
- The paper demonstrates robust performance across datasets like Libri2Mix, underscoring its potential to enhance ASR in real-world multi-speaker environments.
Target Speaker ASR with Whisper
Introduction
The paper presents an innovative approach to enhancing Automatic Speech Recognition (ASR) systems for multi-speaker environments using the Whisper architecture. Traditional ASR systems face challenges in real-world scenarios where multiple speakers interact within a single recording. The proposed method shifts from conventional speaker embedding models to a diarization-conditioned model, emphasizing relative differences among speakers without requiring detailed speaker embeddings. By incorporating frame-level diarization outputs, the method transforms single-speaker ASR models into systems capable of generating accurate transcriptions attributed to individual speakers.
Methodology
Diarization-Conditioned Architecture
Central to the method is the adaptation of Whisper-based models to utilize frame-level diarization outputs. Each audio frame is classified into one of four categories: Silence (S), Target speaker alone (T), Non-target speaker(s) alone (N), and Overlapping speech (O). These classifications form the STNO mask, which guides the transformation of the model's internal representations. The system efficiently learns these transformations through Frame-Level Diarization Dependent Transformations (FDDT) modules, which modify the ASR model's features prior to passing them through the transformer layers.
The FDDT modules are critical for distinguishing among control categories, silencing irrelevant noise, and emphasizing target speaker features. Different initialization strategies are explored for these transformations, including identity and suppressive approaches, facilitating the robust performance of the model without significant distortion of original representations.
Experimental Evaluation
The proposed model was fine-tuned on various datasets, including NOTSOFAR-1, AMI, and Libri2Mix, to assess its efficacy across different acoustic environments and speaker arrangements. Experiments demonstrated substantial performance improvements over traditional input masking, achieving notable results in challenging scenarios characterized by overlapping speech.
Performance metrics (e.g., Word Error Rate, ORC-WER) significantly favored the diarization-conditioned approach, particularly in multi-party conversation datasets. The study highlighted the model's resilience to diarization errors and its capability to outperform state-of-the-art systems on synthetic datasets such as Libri2Mix, which simulates complex acoustic conditions.
Implications and Future Work
The approach delineated in the paper bridges the existing gap between end-to-end ASR models and practical deployment in heterogenous multi-speaker environments. It introduces a paradigm shift from embedding-rich models to diarization-conditioned models, capable of finer speaker differentiation using frame-level analysis. Further exploration in varying dialects and additional datasets is warranted to establish broader applicability and robustness. Additionally, the potential to extend these techniques to other ASR architectures offers new directions for research and development.
Conclusion
In summary, the study advances target speaker ASR by leveraging diarization-conditioned transformations within Whisper models. The approach not only minimizes resource requirements but also enhances performance in real-world settings characterized by overlapping speech and constrained recording inputs. This methodology holds promise for refining ASR capabilities and broadening their application scope, with potential future extensions into comprehensive multi-domain and multilingual landscapes.