Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation (1502.04149v4)

Published 13 Feb 2015 in cs.SD, cs.AI, cs.LG, and cs.MM

Abstract: Monaural source separation is important for many real world applications. It is challenging because, with only a single channel of information available, without any constraints, an infinite number of solutions are possible. In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising. The joint optimization of the deep recurrent neural networks with an extra masking layer enforces a reconstruction constraint. Moreover, we explore a discriminative criterion for training neural networks to further enhance the separation performance. We evaluate the proposed system on the TSP, MIR-1K, and TIMIT datasets for speech separation, singing voice separation, and speech denoising tasks, respectively. Our approaches achieve 2.30--4.98 dB SDR gain compared to NMF models in the speech separation task, 2.30--2.48 dB GNSDR gain and 4.32--5.42 dB GSIR gain compared to existing models in the singing voice separation task, and outperform NMF and DNN baselines in the speech denoising task.

Citations (444)

View on Semantic Scholar

Summary

The paper details a joint optimization method integrating DRNNs with time-frequency masks to significantly improve separation quality across various audio tasks.
It introduces multiple DRNN architectures and a discriminative training criterion that minimize interference and enhance source-to-interference ratios.
Experimental results show notable performance gains over NMF models in speech separation, singing voice separation, and speech denoising applications.

Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation

The paper addresses monaural source separation by proposing a framework that combines deep recurrent neural networks (DRNNs) with time-frequency masking for tasks such as speech separation, singing voice separation, and speech denoising. The authors present an approach that optimizes masking functions and neural network parameters concurrently to enhance source separation performance.

Methodology and Approach

The paper introduces DRNNs as a tool to model temporal dependencies in audio, crucial for processing sequential signals like speech and music. DRNNs are explored in different architectures: single and multiple recurrent layers, and fully stacked recurrent structures with various temporal connections.

By leveraging a time-frequency masking layer, the system enforces constraints on the output, ensuring that the predicted separated components sum to match the input mixture. This masking technique facilitates smoother and more accurate source reconstruction.

Additionally, the authors propose a discriminative training criterion, aimed at reducing interference by penalizing incorrect source assignments, further improving the source-to-interference ratio (SIR).

Experimental Results

The paper evaluates the proposed methods across different tasks using datasets such as TSP for speech separation, MIR-1K for singing voice separation, and TIMIT for speech denoising. The DRNN-based approach resulted in:

Speech Separation: Achieved 2.30–4.98 dB SDR gain over NMF models, with improvements in SIR and SAR, particularly for challenging scenarios such as separating speakers of the same gender.
Singing Voice Separation: Yielded 2.30–2.48 dB GNSDR gain compared to existing methods and demonstrated significant GSIR improvements while maintaining GSAR.
Speech Denoising: Outperformed NMF and DNN baselines, showing robustness across different signal-to-noise ratios and generalizing well even to unseen noise types.

Implications and Future Directions

This work has practical relevance for applications where monaural recordings are prevalent, such as mobile communications and portable music players. By optimizing both time-frequency masks and DRNNs jointly, the framework presents a more cohesive and efficient solution compared to traditional models.

Future research could explore more sophisticated recurrent network structures like LSTMs to capture longer temporal dependencies or extend this framework to other signal processing applications. Furthermore, the discriminative training criterion offers a promising avenue for further enhancing separation quality through better handling of interference artifacts.

In conclusion, this paper contributes a comprehensive methodology for tackling monaural source separation challenges using DRNNs, backed by robust performance metrics across diverse datasets and tasks. The integration of masking functions represents a significant step in improving the efficacy of neural networks in complex audio processing scenarios.

PDF Markdown