Continuous speech separation: dataset and analysis (2001.11482v3)

Published 30 Jan 2020 in cs.SD, cs.LG, and eess.AS

Abstract: This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies on speech separation use pre-segmented signals of artificially mixed speech utterances which are mostly \emph{fully} overlapped, and the algorithms are evaluated based on signal-to-distortion ratio or similar performance metrics. However, in natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. In addition, the signal-based metrics have very weak correlations with automatic speech recognition (ASR) accuracy. We think that not only does this make it hard to assess the practical relevance of the tested algorithms, it also hinders researchers from developing systems that can be readily applied to real scenarios. In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree. A new real recorded dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones. A Kaldi-based ASR evaluation protocol is also established by using a well-trained multi-conditional acoustic model. By using this dataset, several aspects of a recently proposed speaker-independent CSS algorithm are investigated. The dataset and evaluation scripts are available to facilitate the research in this direction.

Authors (9)

Zhuo Chen (319 papers)
Takuya Yoshioka (77 papers)
Liang Lu (42 papers)
Tianyan Zhou (11 papers)
Zhong Meng (53 papers)
Yi Luo (153 papers)
Jian Wu (314 papers)
Xiong Xiao (35 papers)
Jinyu Li (164 papers)

Citations (203)

View on Semantic Scholar

Summary

Continuous Speech Separation: Dataset and Analysis

This paper contributes to the field of speech separation by addressing a critical gap in existing research: the need for evaluation methods suited for continuous audio streams, rather than pre-segmented utterances. The authors introduce the concept of Continuous Speech Separation (CSS), which aims to separate overlapping speech signals from continuous audio streams, distinguishing it from traditional fully overlapped speech separation tasks. By proposing the CSS task and presenting the LibriCSS dataset along with associated evaluation protocols, the research provides a robust framework for advancing speech separation techniques applicable to real-world scenarios.

Motivation and Challenges

Overlapping speech is a common occurrence in natural interactions, posing significant challenges for Automatic Speech Recognition (ASR) and speaker diarization technologies. Traditional speech separation methods focus on fully overlapped segments, often using signal-based metrics like Signal-to-Distortion Ratio (SDR) that poorly correlate with ASR accuracy. The CSS task proposed in this paper targets the segmentation of continuous audio into non-overlapping utterances, accommodating both overlapping and non-overlapping segments to improve downstream ASR performance.

The inherent indeterminacy issue, known as the permutation problem, complicates supervised model training in this domain. Various methods, such as Deep Clustering (DC) and Permutation Invariant Training (PIT), attempt to address this challenge by ensuring speaker separation consistency irrespective of speaker ordering. However, real-world application necessitates the evaluation of these algorithms beyond mere signal-based improvements, focusing on practical ASR outcomes.

LibriCSS Dataset

The LibriCSS dataset is a novel contribution designed to simulate real conversational environments while providing a controlled evaluation setting for CSS methods. Derived from the LibriSpeech corpus, it comprises multi-channel recordings of concatenated utterances replayed in a meeting-like environment. Key features include:

Real-room acoustics, counteracting the oversimplification seen in simulated datasets.
Varied overlap ratios (from 0% to 40%) to test algorithm performance under different conditions.
Continuous recording to evaluate CSS, including ground-truth segmentation for detailed analysis.

This dataset allows for both utterance-wise and continuous input evaluation, aligning testing scenarios with realistic applications, thus serving as a practical bridge between academic research and deployment needs.

Evaluation and Methodology

The paper establishes a Kaldi-based ASR evaluation protocol using a multi-conditional acoustic model to scrutinize CSS performance. Two configurations—utterance-wise evaluation with pre-segmented input and continuous input evaluation—are employed to comprehensively assess the algorithms.

Results from the experiments reveal significant ASR performance improvements using a seven-channel Mask-based Beamforming approach over traditional single-channel systems, especially in overlapping conditions. However, the paper also highlights limitations in accurately addressing non-overlapping regions, particularly when fewer microphones are used, exposing areas for further research.

Implications and Future Directions

This research sets a precedent for future speech separation work by emphasizing the need for datasets and evaluation methods that reflect real acoustic conditions and conversational dynamics. The implications extend beyond achieving higher SDR scores, stressing the importance of reducing distortion in non-overlapping segments and improving ASR integration.

Going forward, the paper suggests exploring more efficient multi-microphone configurations, improved model architectures that can inherently handle crossover between audio streams, and deeper integration of these methods within ASR frameworks. As continuous audio processing becomes increasingly critical in AI-driven communication systems, advances in CSS will play a pivotal role in enhancing human-computer interaction capabilities. The availability of LibriCSS and its evaluation scripts provides researchers with the necessary tools to push the boundaries of speech separation technologies, setting a foundation for practical, scalable applications in this ever-evolving field.

PDF Markdown