End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
The paper presents an innovative approach to speaker diarization by addressing limitations in existing end-to-end systems regarding flexibility in the number of speakers. Traditional clustering-based speaker diarization systems have been surpassed by End-to-End Neural Speaker Diarization systems (EEND) in performance metrics; however, the rigid architecture of EEND systems limits them to a predetermined maximum number of speakers. This is a significant constraint in situations where actual speaker counts exceed this number.
Proposed Methodology
This paper introduces a method for encoder-decoder based attractor calculation (EDA) designed to break this limitation. EDA dynamically determines the number of speaker attractors from a speech embedding sequence, thereby enhancing speaker activity prediction capabilities. Attractors are computed using a LSTM-based encoder-decoder framework which allows for a theoretically infinite number of attractors. This is achieved without prior knowledge of the number of speakers, adjusting adaptively to the input data.
The process is handled through permutation invariant training (PIT), which optimizes the diarization model by calculating diarization results for each speaker from input features—irrespective of the speaker order in the input. This permutation-free aspect is essential for accurately assessing speaker overlap and flexible speaker counts.
Evaluation and Results
The effectiveness of this approach is demonstrated via rigorous testing on simulated mixtures and real recordings.
- Simulated Data: Under controlled two-speaker conditions, EDA achieved a diarization error rate (DER) of 2.69% versus 4.56% with traditional self-attentive EEND, illustrating a solid improvement. Similarly, on multiple speaker conditions, DER from unknown speaker numbers improved from 19.43% (x-vector clustering) to 15.29% (EDA-based approach).
- Real World Data: On datasets such as CALLHOME, real-world constraints did not diminish the robustness of EDA, maintaining superior performance levels compared to legacy clustering methods.
Implications and Future Directions
EDA offers significant advantages in speaker diarization, with implications for applications including automatic speech recognition (ASR) in multi-talker environments like meetings, telephone conversations, and media content analysis. By advancing flexibility in neural network architectures, this approach promises notable improvements in broader ASR systems' accuracy, especially where overlapping speakers complicate conventional models.
Looking forward, continued exploration into optimizing the EDA framework could enhance its scalability and efficiency, including potential fusion with spatial audio features and multisensory inputs. These avenues hold promise in refining speaker separation in increasingly complex auditory environments, moving towards fully adaptive and real-time processing capabilities without sacrificing computational efficiency.
In summary, this approach represents a meaningful contribution to speaker diarization methodologies, suggesting pathways toward overcoming current limitations and improving ASR performance in real-world scenarios.