Joint Speech Recognition and Speaker Diarization via Sequence Transduction (1907.05337v1)

Published 9 Jul 2019 in cs.CL, cs.SD, and eess.AS

Abstract: Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when. The task of assigning words to speakers is typically addressed by merging the outputs of two separate systems, namely, an automatic speech recognition (ASR) system and a speaker diarization (SD) system. The two systems are trained independently with different objective functions. Often the SD systems operate directly on the acoustics and are not constrained to respect word boundaries and this deficiency is overcome in an ad hoc manner. Motivated by recent advances in sequence to sequence learning, we propose a novel approach to tackle the two tasks by a joint ASR and SD system using a recurrent neural network transducer. Our approach utilizes both linguistic and acoustic cues to infer speaker roles, as opposed to typical SD systems, which only use acoustic cues. We evaluated the performance of our approach on a large corpus of medical conversations between physicians and patients. Compared to a competitive conventional baseline, our approach improves word-level diarization error rate from 15.8% to 2.2%.

PDF Abstract

Joint Speech Recognition and Speaker Diarization via Sequence Transduction: A Detailed Analysis

In the field of conversational systems, accurately transcribing spoken words and determining the speaker is of paramount importance. Traditionally, these tasks have been achieved using separate automatic speech recognition (ASR) and speaker diarization (SD) systems. In their significant contribution to this domain, El Shafey, Soltau, and Shafran propose a novel approach that integrates ASR and SD into a unified system using recurrent neural network transducers (RNN-T). This paper provides an in-depth evaluation of their implementation, particularly within the medical conversation context between physicians and patients.

Methodology and Implementation

The authors leverage the RNN-T framework to process conversations in a sequential manner, where both linguistic and acoustic cues are used to infer speaker roles. Unlike conventional SD systems that rely solely on acoustics, this joint method integrates speech and speaker information into a single transduction problem. This is achieved by augmenting the output symbol set with speaker role tokens, such as <spk:dr> for doctors and <spk:pt> for patients, thus enabling the direct generation of a speaker-decorated transcript from raw audio.

The model architecture comprises a transcription network that reduces the input time resolution, a prediction network for using past non-blank symbols, and a joint network that combines these inputs for final predictions. Training is conducted using the Adam optimizer on extensive data encompassing 15,000 hours of medical conversations. This comprehensive dataset facilitates the superior performance of the integrated model by offering both acoustic and rich transcript data, which conventional systems lack.

Experimental Analysis

Experimental evaluations on a dedicated clinical corpus reveal remarkable improvements. Notably, the paper reports a reduction in word-level diarization error rate (WDER) from 15.8% to 2.2%, marking an 86% relative enhancement over the baseline. Meanwhile, a slight compromise on ASR performance is noted, with the word error rate (WER) rising marginally from 18.7% to 19.3%.

This performance shift is attributed to the joint model’s capacity to eliminate intermediate reconciliation errors present in conventional systems. Additionally, the robustness of the model is highlighted by the consistent distribution of WDER across diverse clinical conversations, unlike the baseline which exhibits higher variability.

Implications and Future Directions

The presented methodology is poised to substantially reshape applications with specific conversational roles, given its adaptability in scenarios with defined speaker categories. However, the model focuses on labeling roles rather than identities, emphasizing the necessity for role-matched training datasets.

Looking forward, the authors propose extending this approach to other conversational settings with distinct speaker roles. Furthermore, the integration with additional conversational features like punctuation and non-verbal cues, aiming for richer transcripts, is under exploration.

This paper emphasizes a notable shift towards end-to-end solutions in conversational AI, demonstrating significant advancements in combining speech recognition and diarization within a single, cohesive framework. Such developments promise to enhance the accuracy and applicability of conversational analysis in specialized fields like healthcare, with potential non-verbal extensions paving the way for innovative AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Laurent El Shafey (15 papers)
Hagen Soltau (19 papers)
Izhak Shafran (30 papers)

Citations (97)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos