End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors (2005.09921v3)

Published 20 May 2020 in eess.AS, cs.CL, and cs.SD

Abstract: End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69 % diarization error rate (DER) on simulated mixtures and a 8.07 % DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56 % and 9.54 %, respectively. In unknown numbers of speakers conditions, our method attained a 15.29 % DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43 % DER.

Authors (5)

Shota Horiguchi (45 papers)
Yusuke Fujita (37 papers)
Shinji Watanabe (416 papers)
Yawen Xue (10 papers)
Kenji Nagamatsu (19 papers)

Citations (180)

View on Semantic Scholar

Summary

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

The paper presents an innovative approach to speaker diarization by addressing limitations in existing end-to-end systems regarding flexibility in the number of speakers. Traditional clustering-based speaker diarization systems have been surpassed by End-to-End Neural Speaker Diarization systems (EEND) in performance metrics; however, the rigid architecture of EEND systems limits them to a predetermined maximum number of speakers. This is a significant constraint in situations where actual speaker counts exceed this number.

Proposed Methodology

This paper introduces a method for encoder-decoder based attractor calculation (EDA) designed to break this limitation. EDA dynamically determines the number of speaker attractors from a speech embedding sequence, thereby enhancing speaker activity prediction capabilities. Attractors are computed using a LSTM-based encoder-decoder framework which allows for a theoretically infinite number of attractors. This is achieved without prior knowledge of the number of speakers, adjusting adaptively to the input data.

The process is handled through permutation invariant training (PIT), which optimizes the diarization model by calculating diarization results for each speaker from input features—irrespective of the speaker order in the input. This permutation-free aspect is essential for accurately assessing speaker overlap and flexible speaker counts.

Evaluation and Results

The effectiveness of this approach is demonstrated via rigorous testing on simulated mixtures and real recordings.

Simulated Data: Under controlled two-speaker conditions, EDA achieved a diarization error rate (DER) of 2.69% versus 4.56% with traditional self-attentive EEND, illustrating a solid improvement. Similarly, on multiple speaker conditions, DER from unknown speaker numbers improved from 19.43% (x-vector clustering) to 15.29% (EDA-based approach).
Real World Data: On datasets such as CALLHOME, real-world constraints did not diminish the robustness of EDA, maintaining superior performance levels compared to legacy clustering methods.

Implications and Future Directions

EDA offers significant advantages in speaker diarization, with implications for applications including automatic speech recognition (ASR) in multi-talker environments like meetings, telephone conversations, and media content analysis. By advancing flexibility in neural network architectures, this approach promises notable improvements in broader ASR systems' accuracy, especially where overlapping speakers complicate conventional models.

Looking forward, continued exploration into optimizing the EDA framework could enhance its scalability and efficiency, including potential fusion with spatial audio features and multisensory inputs. These avenues hold promise in refining speaker separation in increasingly complex auditory environments, moving towards fully adaptive and real-time processing capabilities without sacrificing computational efficiency.

In summary, this approach represents a meaningful contribution to speaker diarization methodologies, suggesting pathways toward overcoming current limitations and improving ASR performance in real-world scenarios.

PDF Markdown