End-to-End Neural Speaker Diarization with Self-attention (1909.06247v1)

Published 13 Sep 2019 in eess.AS, cs.CL, and cs.SD

Abstract: Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLSTM) network directly outputs speaker diarization results given a multi-talker recording, was recently proposed. In this study, we enhance EEND by introducing self-attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-the-art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.

PDF Abstract

End-to-End Neural Speaker Diarization with Self-Attention

This paper discusses an enhancement of End-to-End Neural Speaker Diarization (EEND) by incorporating self-attention mechanisms to address the limitations of traditional clustering-based speaker diarization methods. The conventional approaches predominantly rely on clustering speaker embeddings, which present challenges such as indirect optimization concerning diarization errors and inefficiencies in handling overlapping speakers. The EEND method, leveraging a bidirectional long short-term memory (BLSTM) network, was initially proposed to tackle these issues by directly outputting diarization results from multi-talker recordings. However, this paper introduces a self-attention-based model to facilitate more efficient speaker diarization.

Proposed Methodology

The core advance presented in this work is the replacement of BLSTM blocks with self-attention blocks in the EEND framework, termed as Self-Attentive End-to-End Neural Diarization (SA-EEND). Unlike BLSTM, which derives outputs conditioned only on adjacent hidden states, self-attention assesses the entire frame sequence to produce results. This comprehensive assessment empowers the model to capture both localized speech dynamics and overarching speaker characteristics. The self-attention layer computes a pairwise similarity, enabling the model to maintain global awareness of speaker features, which is essential for resolving speaker overlaps and improving the diarization capability.

The paper implements this framework on simulated mixtures, real telephone calls, and dialogue recordings, comparing performance against both BLSTM-based EEND and x-vector clustering methods. The results reveal that the self-attention mechanism significantly improves performance across varied datasets, including better handling of overlapping speech.

Experimental Results

The experimental evaluation leverages both simulated and real-world datasets to validate the efficacy of SA-EEND. Quantitatively, the self-attention-based approach outperformed conventional methods with a marked improvement in diarization error rates across multiple conditions. For instance, SA-EEND achieved lower error rates than BLSTM-based EEND and x-vector clustering in various overlapping scenarios, highlighting the model's robustness and adaptability to different audio characteristics.

Decisively, visualizations of learned representations indicated that self-attention mechanisms could discern speaker-dependent global traits as well as temporal features within speech signals, affirming the model's comprehensive understanding of multi-speaker audio contexts.

Implications and Future Directions

This research posits significant implications for the future of speaker diarization technologies. By demonstrating the superiority of self-attention over traditional sequential processing models like BLSTM, this approach offers a compelling paradigm shift towards fully utilizing non-sequential relationships in audio processing. It suggests a direction where attention-based architectures can be further refined and potentially leveraged across other domains requiring robust multi-entity identification and separation.

In future endeavors, refinements could involve exploring varying depths of self-attentive networks and combinations with other architectural innovations such as transformer-based models, which may offer even greater flexibility and performance in complex audio environments. Additionally, the integration of more diverse, real-world datasets with varying levels of complexity could provide a broader validation base and potentially unveil new insights into optimizing these models for everyday applications, spanning from automated transcription services to advanced AI-driven communication systems.

This research contributes a significant step in advancing automatic speaker diarization by resolving key limitations of current systems and setting a foundation for more explorative studies in AI-driven audio processing technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yusuke Fujita (37 papers)
Naoyuki Kanda (61 papers)
Shota Horiguchi (45 papers)
Yawen Xue (10 papers)
Kenji Nagamatsu (19 papers)
Shinji Watanabe (416 papers)

Citations (230)

View on Semantic Scholar

End-to-End Neural Speaker Diarization with Self-attention (1909.06247v1)