Voice Separation with an Unknown Number of Multiple Speakers (2003.01531v4)

Published 29 Feb 2020 in eess.AS, cs.LG, cs.SD, and stat.ML

Abstract: We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.

Citations (168)

View on Semantic Scholar

Summary

The paper introduces a novel model using bi-directional gated RNNs with MulCat blocks to separate voices from mixed audio.
It employs a permutation invariant loss and a compound reconstruction error to achieve significant SI-SNR improvements, notably 20.1 dB for two speakers.
The method demonstrates robustness in noisy and reverberated environments, offering practical enhancements for telecommunication and voice-activated applications.

Voice Separation with an Unknown Number of Multiple Speakers: An Expert Overview

The paper "Voice Separation with an Unknown Number of Multiple Speakers" by Eliya Nachmani, Yossi Adi, and Lior Wolf addresses the challenging problem of separating mixed audio signals with an unspecified number of concurrent speakers. This work introduces a novel approach leveraging gated recurrent neural networks (RNNs) to achieve state-of-the-art performance in the domain of source separation.

Core Contributions and Methodology

The authors present a unique model architecture that departs from traditional masking techniques. Instead, they introduce a method based on bi-directional RNNs with a specific type of residual connection, referred to as the MulCat block. These blocks consist of dual RNN heads that operate in parallel, employing element-wise multiplication and concatenation to integrate the input. This configuration allows the model to process both short-term and long-term audio features within the same framework.

Several key aspects distinguish this paper:

A multi-layer architecture that evaluates reconstruction error after each RNN layer, utilizing a compound loss to improve separation quality.
The incorporation of a permutation invariant loss to handle the inherent uncertainty in speaker identity during separation.
The introduction of a novel loss that leverages a voice representation network, aiming to maintain speaker consistency throughout the separation process.

Numerical Results and Evaluation

The numerical outcomes of this research reveal significant improvements over baseline methods. The authors report that their method greatly surpasses existing models, particularly under conditions where the number of speakers exceeds two. For instance, their approach achieves an SI-SNR improvement (SI-SNRi) of 20.1 dB for two speakers, 16.9 dB for three speakers, and 12.9 dB for four speakers, consistently outperforming all prior art.

Additionally, the method is evaluated under noisy and reverberated scenarios using datasets such as WHAM! and WHAMR!, underscoring its robustness and adaptability.

Implications and Future Prospects

This work has profound implications for both practical applications and theoretical advancements in single-channel source separation. The potential applications range from improving telecommunication systems to enhancing voice-activated devices that need to function in crowded auditory environments.

Theoretically, the introduction of the MulCat RNN block, coupled with the use of embedding-based identity loss, may serve as a foundation for future algorithms addressing other forms of sequential data separation. Moreover, the non-learning-based strategy for selecting the number of speakers demonstrates methodological simplicity while maintaining high accuracy, suggesting a potential for broader applicability across tasks where model adaptability is crucial.

Speculation on AI Developments

Based on the achievements reported in this paper, future advancements in AI-driven audio analysis may revolve around integrating similar modular and flexible architectures. The reliance on task-specific embeddings, like those employed here for speaker consistency, could extend to other domains requiring robust identity preservation amidst complex interference.

In conclusion, this work represents a significant advancement in audio source separation, characterized by a novel architectural approach that effectively handles the complexity of separating an unknown number of concurrent speakers. As research progresses, these methods might inspire broader efforts in fields requiring scalable and adaptable neural architectures.

PDF Markdown