Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information (2111.13694v1)

Published 28 Nov 2021 in cs.SD, cs.LG, and eess.AS

Abstract: Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech features and given speaker embeddings. Our method is further extended and integrated with downstream tasks by utilizing the textual information, which has not been well studied in previous literature. The experimental results show that our method achieves lower diarization error rate than the target-speaker voice activity detection. When textual information is involved, the diarization errors can be further reduced. For the real meeting scenario, our method can achieve 34.11% relative improvement compared with the Bayesian hidden Markov model based clustering algorithm.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel speaker embedding-aware framework that transforms overlapping speech diarization into a single-label prediction problem using power-set encoding.
It employs dedicated speech and speaker encoders, a similarity module, and a post-network to effectively model speaker combinations per audio frame.
Experiments in simulated and real meeting scenarios show a 34.11% relative improvement over traditional methods, with textual cues further enhancing accuracy.

Understanding Speaker Embedding-Aware Neural Diarization with Textual Information

Speaker diarization is an essential component in analyzing audio recordings, particularly in scenarios where multiple participants are present, such as meetings or interviews. Traditional methods for speaker diarization face challenges with overlapping speech and the need for manual threshold setting. This blog post will discuss how a new approach, Speaker Embedding-Aware Neural Diarization (SEND), addresses these challenges and enhances performance by incorporating textual information.

From Multi-Label to Single-Label Prediction Problem

The innovative SEND framework introduces a shift in tackling overlapping speech diarization. Rather than treating it as a conventional multi-label classification problem—which sees each speaker’s activity as independent events and requires manually set thresholds—SEND approaches the task as a single-label prediction issue through a technique called power-set encoding. This methodology not only captures the relationships between the speakers’ activities but also simplifies the problem so that it can handle scenarios with varying numbers of speakers without needing threshold adjustments.

Architecture of SEND

SEND consists of several key components, including a speech encoder, a speaker encoder, a similarity calculation module, and a post-network (post-net). The speech encoder processes the acoustic features, while the speaker encoder deals with the given speaker embeddings. An activated dot product calculation then measures similarities between speech encodings and speaker encodings. Finally, the post-net models the probable speaker combination for each audio frame.

An extension of SEND, named SEND-Ti, also incorporates textual information coming from downstream or manual transcripts. By using a text encoder and self-attention mechanism, SEND-Ti aligns and enriches the acoustic data with textual embeddings, significantly improving diarization accuracy at the word level.

Experimental Results and Improvements

SEND was rigorously tested in simulated datasets as well as real meeting scenarios and was compared against several existing diarization approaches. SEND demonstrated superior performance, especially when dealing with overlapping speech where traditional methods faltered. Moreover, when textual information was used in conjunction with SEND, the diarization errors were further reduced, proving the utility of integrating textual data into speaker diarization tasks.

In real meeting environments, SEND achieved a remarkable 34.11% relative improvement compared to the traditional Bayesian Hidden Markov Model based clustering algorithms. These improvements reveal the power of the SEND framework to enhance the effectiveness of speaker diarization in multi-speaker audio recordings.

Conclusion and Future Directions

This innovative approach to speaker diarization through the SEND framework showcases a significant step forward in the field of audio analysis. By reformulating a traditionally complex multi-label problem into a more manageable single-label task and leveraging the power of textual information, SEND holds the potential to revolutionize how we interpret multi-speaker recordings.

While the current results are promising, future work will continue to explore and refine the influence of textual info on diarization in various real-world applications. SEND's flexibility and robustness make it a compelling choice for enhancing the accuracy and utility of speaker diarization systems across different domains.

PDF Markdown