Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection (2109.11641v3)

Published 23 Sep 2021 in eess.AS, cs.LG, and cs.SD

Abstract: In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.

Citations (50)

View on Semantic Scholar

Summary

The paper presents Turn-to-Diarize, a novel online speaker diarization system using a transformer transducer for speaker turn detection and constrained spectral clustering.
Evaluations showed Turn-to-Diarize achieved substantial Diarization Error Rate reductions of 12.20% and 35.25% on inbound and outbound call datasets respectively.
The system improves efficiency and scalability by deriving speaker embeddings from continuous turns, significantly reducing computational complexity compared to fixed-segment methods.

Evaluation of "Turn-to-Diarize" Speaker Diarization System

The paper "Turn-to-Diarize: Online speaker diarization constrained by transformer transducer speaker turn detection" presents a methodical approach to speaker diarization, aimed specifically at enhancing streaming on-device applications. Speaker diarization, the process of partitioning an audio stream into homogeneous segments according to speakers' identity, is integral to various speech processing tasks. Traditional methods often require intensive annotation and exhibit substantial computational costs, largely due to exhaustive clustering procedures. This paper proposes a novel system leveraging a transformer transducer for improved speaker turn detection, speaker embedding extraction, and constrained spectral clustering.

Overview of Contributions

The Turn-to-Diarize system introduces several noteworthy advancements:

Transformer Transducer Model: A transformer transducer is deployed for joint ASR and speaker turn detection, offering semantic leverage and facilitating speaker turn identification with lower annotation overhead. By utilizing special tokens inserted during transcription, the semantic context of dialogues is harnessed.
Speaker Embedding Efficiency: Speaker embeddings are derived from continuous speaker turns instead of uniform short segments, thus reducing computational complexity while maintaining fidelity. This reduces the clustering workload due to the higher sparsity of embeddings when compared to conventional segment-wise extraction.
Constrained Spectral Clustering: Incorporation of speaker turn-derived constraints into the spectral clustering process enhances the accuracy of clustering by guiding which embeddings should be linked or separated. This is especially beneficial in minimizing false acceptances or rejections during the clustering phase.

Experimental Results and Computational Impact

The experimental analysis presented in the paper validates the efficacy of the Turn-to-Diarize system against baseline dense d-vector systems. Evaluated across datasets including internal Inbound and Outbound call center datasets and the Callhome corpus, the system demonstrates a substantial reduction in Diarization Error Rate (DER). Notably, DER reductions of 12.20% and 35.25% were observed for the inbound and outbound datasets respectively. This decrease translates to more accurate speaker mapping in dialog settings with multiple interlocutors.

The computational complexity analysis underscores the advantages of handling embeddings at the granularity of speaker turns rather than fixed-length segments. Spectral clustering, traditionally computationally demanding due to the size of affinity matrices, becomes more feasible with the sparse embedding approach. This makes the Turn-to-Diarize system particularly viable for real-time applications where latency reduction is crucial.

Implications and Future Developments

This paper posits significant implications for the field of speaker diarization. Practically, improved efficiency and decreased annotation requirements pave the way for more scalable deployment across devices with limited processing capabilities. Theoretically, the integration of semantic data into speaker turn detection methodologies presents promising avenues for future research. The consideration of multimodal signals, such as visual data, could further enhance the robustness of speaker diarization systems beyond audio-only signals.

The Turn-to-Diarize system sets a foundation by combining advanced ASR and speaker diarization techniques in a unified architecture. Future developments may consider integrating multilingual data to extend the applicability of the system across different linguistic domains, potentially enhancing its generalizability across diverse conversational contexts.

In conclusion, the paper presents a relevant contribution to the field of speaker diarization by reducing computational complexity, improving accuracy, and minimizing annotation demands through its innovative approach. As advancements in this domain continue, integrating adaptive machine learning models with real-time responsiveness will likely be at the forefront of upcoming research endeavors.

Related Papers

YouTube

Show All Videos