One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition (2310.01688v1)

Published 2 Oct 2023 in eess.AS, cs.CL, and cs.SD

Abstract: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training andWhisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.

References (50)

Authors (4)

Samuele Cornell (41 papers)
Jee-weon Jung (69 papers)
Shinji Watanabe (416 papers)
Stefano Squartini (17 papers)

Citations (13)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition (2310.01688v1)

Summary

Related Papers