Wavesplit: End-to-End Speech Separation by Speaker Clustering (2002.08933v2)

Published 20 Feb 2020 in eess.AS, cs.CL, cs.LG, cs.SD, and stat.ML

Abstract: We introduce Wavesplit, an end-to-end source separation system. From a single mixture, the model infers a representation for each source and then estimates each source signal given the inferred representations. The model is trained to jointly perform both tasks from the raw waveform. Wavesplit infers a set of source representations via clustering, which addresses the fundamental permutation problem of separation. For speech separation, our sequence-wide speaker representations provide a more robust separation of long, challenging recordings compared to prior work. Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2/3mix), as well as in noisy and reverberated settings (WHAM/WHAMR). We also set a new benchmark on the recent LibriMix dataset. Finally, we show that Wavesplit is also applicable to other domains, by separating fetal and maternal heart rates from a single abdominal electrocardiogram.

Citations (248)

View on Semantic Scholar

Summary

The paper introduces Wavesplit, an end-to-end framework that separates concurrent speech signals using speaker clustering to address permutation ambiguity.
It employs a dual-stage convolutional architecture combining a speaker stack with exponential dilation and a separation stack with FiLM conditioning.
The method achieves state-of-the-art SI-SDR and SDR improvements on benchmarks like WSJ0-2mix and WSJ0-3mix, demonstrating robust performance in noisy conditions.

Wavesplit: End-to-End Speech Separation by Speaker Clustering

The paper introduces Wavesplit, an innovative all-encompassing speech separation model utilizing speaker clustering to perform end-to-end source separation. Wavesplit addresses the fundamental challenge of separating multiple concurrent speech signals from a single mixed-waveform input, a problem known to be difficult especially when dealing with similar class sources, due to the inherent permutation ambiguity.

Methodology Overview

Wavesplit employs a dual-stage convolutional neural network architecture composed of a speaker stack and a separation stack. The speaker stack is responsible for generating a sequence of same length vector representations for each speaker present in the mixture, which are then clustered to yield aggregated speaker centroids. These centroids are then employed by the separation stack to conditionally reconstruct the isolated audio streams. This end-to-end process, ranging from raw audio waveform input through clustering to output generation, is trained jointly and is innovative in handling the permutation problem through its clustering mechanism.

The architectural choice relies on residual convolutional networks, with the speaker stack utilizing exponential dilation to capture temporal context efficiently, and a conditioning mechanism based on Feature-wise Linear Modulation (FiLM) in the separation stack for improved performance. Wavesplit sets itself apart from conventional Permutation Invariant Training (PIT) models and deep clustering by resolving prediction permutability issues at the speaker representation level, thus necessitating fewer permutations during training and providing consistency across varying input lengths.

Empirical Evaluation

The model redefines state-of-the-art performance across several speech separation benchmarks, notably WSJ0-2mix and WSJ0-3mix, with demonstrably superior SI-SDR and SDR improvements over previous methods. Notably, the method excels particularly with longer sequences where speaker dominance changes over time, an area where PIT-based systems tend to falter, indicating the robustness of the clustering approach used in Wavesplit.

Additionally, when tested on noisy and reverberated data sets such as WHAM and WHAMR, Wavesplit maintains its superior performance, clearly illustrating its applicability under less controlled conditions. It is also generalizable across domains, as indicated by its application to fetal and maternal ECG separation tasks, showing practical potential beyond speech processing tasks.

Implications and Future Directions

The implications of this research are significant both in theory and practice. It suggests that incorporating speaker identity information during training, even without requiring explicit test-time speaker identity, can enhance source separation efficacy. In situations where source consistency is crucial, such as meeting transcriptions or multi-speaker transcription services, this approach offers substantial improvements over existing methods. Future research could extend this model towards more varied and complex auditory environments, as well as investigate its applicability to other machine learning domains that require signal separation, including medical signal processing, telecommunication, and astronomy.

The introduction of strategically modeled aggregation of speaker vectors to limit permutation ensures a consistent model performance that could inspire new research paths addressing similar permutation challenges in other domains. Consequently, Wavesplit not only advances the state of the art in source separation technology but also robustly expands the potential applications of neural network-based models in diverse, multi-source environments.