Two-Step Sound Source Separation: Training on Learned Latent Targets (1910.09804v2)

Published 22 Oct 2019 in cs.LG, cs.CL, cs.SD, eess.AS, and stat.ML

Abstract: In this paper, we propose a two-step training procedure for source separation via a deep neural network. In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and we prove that it lower-bounds the SI-SDR in the time domain. We run various sound separation experiments that show how this approach can obtain better performance as compared to systems that learn the transform and the separation module jointly. The proposed methodology is general enough to be applicable to a large class of neural network end-to-end separation systems.

Citations (63)

View on Semantic Scholar

Summary

The paper presents a two-step method that decouples latent space transformation learning from the separation task, offering clear performance benefits.
It employs a scale-invariant signal-to-distortion ratio loss in the latent domain to provide theoretical accuracy bounds and enhanced separation quality.
Experimental results on WSJ0 and ESC50 datasets show SI-SDR improvements ranging from 0.5 to 0.7 dB compared to conventional end-to-end training.

Two-step Sound Source Separation: Training on Learned Latent Targets

The paper authored by Efthymios Tzinis and colleagues proposes an innovative two-step training approach for sound source separation using deep neural networks. This methodology introduces a novel mechanism where the separation process is compartmentalized into two distinct phases: learning a latent space transformation followed by optimizing the separation module within that space. The core idea is based on achieving optimal separation outcomes by decoupling the transformation learning from the separation task. This approach is contrasted with traditional joint end-to-end training methodologies that optimize separation networks directly in the time domain.

Methodological Framework

The framework starts by learning a transformation to a latent space where masking-based separation performance using oracles is deemed optimal. This stage leverages a scale-invariant signal-to-distortion ratio (SI-SDR) loss function recognized for effectively capturing separation performance. One of the paper’s significant contributions is proving that the SI-SDR loss function applied in the latent space contextually bounds SI-SDR performance in the time domain. This proof underpins the operational efficacy of the proposed separation approach by establishing theoretical guarantees on performance bounds.

The latent space learning process centers on identifying transformations that enable straightforward separation via masking. The encoder and decoder modules are thus pre-trained independently, focusing on transforming mixtures and clean sources into latent representations. In the second step, the separation module is optimized to estimate latent representations of the clean sources using permutation-invariant SI-SDR loss applied to the latent domain.

Experimental Validation and Results

The experimental evaluation of the proposed method includes diverse sound source separation tasks, such as separating speech signals, non-speech environmental sounds, and mixed targets incorporating both. The authors conduct rigorous assessments using the WSJ0 and ESC50 datasets, employing configurations known for their robustness in the context of source separation, such as the TDCN and RTDCN architectures.

Results indicate consistent performance improvements across various separation tasks when employing the two-step approach. Notably, the paper reports SI-SDR improvements ranging from 0.5 to 0.7 dB over the baseline end-to-end time-domain training. Additionally, the methodology provides significant upper bounds on separation performance, which surpass those achieved with traditional STFT-based masks, underlining the efficacy of learning representations in custom latent spaces.

Implications and Future Directions

The proposed two-step approach demonstrates practical significance by offering tangible improvements in both computational efficiency and separation performance. The empirical results suggest that pre-training can effectively yield more structured and sparse latent representations, enhancing the separation quality without requiring modifications to existing architecture.

The theoretical proof provided assures researchers of the approach’s viability and positions it as a competitive alternative to traditional end-to-end solutions. Given its applicability to a wide array of mask-based architectures, the approach holds potential for integration into future audio-visual processing systems that necessitate precise source separation.

Potential future research directions entail refining the latent space SI-SDR loss and exploring its applicability to broader neural network structures tailored for audio analysis tasks. The findings of this paper pave the way for further advancements in source separation methodologies by emphasizing the importance of latent space learning for achieving optimal signal decomposition.

PDF Markdown

Related Papers

YouTube

Show All Videos