Online End-to-End Neural Diarization with Speaker-Tracing Buffer (2006.02616v2)

Published 4 Jun 2020 in eess.AS and cs.SD

Abstract: This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54% for CALLHOME and 20.77% for CSJ with 1.4s actual latency.

Authors (5)

Yawen Xue (10 papers)
Shota Horiguchi (45 papers)
Yusuke Fujita (37 papers)
Shinji Watanabe (416 papers)
Kenji Nagamatsu (19 papers)

Citations (44)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Online End-to-End Neural Diarization with Speaker-Tracing Buffer (2006.02616v2)

Summary

Related Papers