Low-Latency Speaker-Independent Continuous Speech Separation (1904.06478v1)

Published 13 Apr 2019 in eess.AS, cs.CL, and cs.SD

Abstract: Speaker independent continuous speech separation (SI-CSS) is a task of converting a continuous audio stream, which may contain overlapping voices of unknown speakers, into a fixed number of continuous signals each of which contains no overlapping speech segment. A separated, or cleaned, version of each utterance is generated from one of SI-CSS's output channels nondeterministically without being split up and distributed to multiple channels. A typical application scenario is transcribing multi-party conversations, such as meetings, recorded with microphone arrays. The output signals can be simply sent to a speech recognition engine because they do not include speech overlaps. The previous SI-CSS method uses a neural network trained with permutation invariant training and a data-driven beamformer and thus requires much processing latency. This paper proposes a low-latency SI-CSS method whose performance is comparable to that of the previous method in a microphone array-based meeting transcription task.This is achieved (1) by using a new speech separation network architecture combined with a double buffering scheme and (2) by performing enhancement with a set of fixed beamformers followed by a neural post-filter.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Takuya Yoshioka (77 papers)
Zhuo Chen (319 papers)
Changliang Liu (7 papers)
Xiong Xiao (35 papers)
Hakan Erdogan (32 papers)
Dimitrios Dimitriadis (32 papers)

Citations (28)

View on Semantic Scholar

Low-Latency Speaker-Independent Continuous Speech Separation (1904.06478v1)

Related Papers