Supervised Speech Separation Based on Deep Learning: An Overview (1708.07524v2)

Published 24 Aug 2017 in cs.CL, cs.LG, cs.NE, and cs.SD

Abstract: Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multi-talker separation), and speech dereverberation, as well as multi-microphone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

PDF Abstract

Supervised Speech Separation Based On Deep Learning: An Overview

Introduction

The paper "Supervised Speech Separation Based on Deep Learning: An Overview" by DeLiang Wang and Jitong Chen provides a comprehensive examination of the advancements in supervised speech separation facilitated by deep learning methods. Historically, speech separation was primarily addressed as a signal processing issue. However, recent developments have transitioned the focus to supervised learning approaches where deep neural networks (DNNs) have shown remarkable efficacy. The article traces the evolution of methods in this domain, elaborating on learning machines, training targets, and acoustic features, with detailed reviews of various separation algorithms and generalization issues inherent to supervised learning.

Components of Supervised Separation

Learning Machines

DNNs, due to their hierarchical structure, effectively capture complex patterns in data which traditional models struggle with. The article explains several types of DNN architectures employed in speech separation, including Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs). Each architecture showcases distinct benefits:

MLPs are straightforward but suffer from an inability to capture temporal dependencies.
CNNs leverage local feature learning through shared weights, becoming particularly useful when dealing with spectral representations.
RNNs, especially with Long Short-Term Memory (LSTM) cells, overcome temporal limitations by retaining contextual information, making them ideal for capturing speech dynamics.
GANs incorporate adversarial training to refine outputs, shown to be effective in denoising tasks.

Training Targets

The choice of training targets significantly influences the effectiveness of supervised learning for speech separation. The paper discusses the primary distinctions between masking-based targets and mapping-based targets:

Masking-based targets: Ideal Binary Mask (IBM), Ideal Ratio Mask (IRM), and Spectral Magnitude Mask (SMM) leverage binary or ratio-based labels to discriminate between speech and noise.
Mapping-based targets: These targets aim to directly estimate the spectral magnitudes of clean speech from noisy inputs.

Analyses show that masking-based targets often outperform mapping-based targets in terms of intelligibility improvements, while mapping-based targets offer better quality improvements under certain conditions.

Acoustic Features

Exploring a broad array of acoustic features, the research underscores features like Gammatone Frequency Cepstral Coefficients (GFCC) and Multi-Resolution Cochleagram (MRCG) as particularly effective. These features, highlighted for their discriminative power, exceed traditional features in separating noisy signals, especially at low SNRs. The integration of both spatial and spectral features further enhances performance in multi-microphone setups.

Algorithms for Monaural and Array-Based Separation

Monaural Separation

Monaural methods leverage deep learning for tasks such as speech enhancement and dereverberation:

Speech Enhancement: DNNs and RNNs improve speech intelligibility and quality across various noise conditions, with progressive training to manage differing SNR levels.
Dereverberation: Spectral mapping approaches demonstrate the ability to recover anechoic speech, with enhancements like T60-controlled models further improving performance.

Array-Based Separation

For multi-microphone setups, DNN-based approaches integrate spatial features for enhanced separation capabilities:

Spatial Feature Extraction: Utilizing features such as ITD, ILD, and IPD in conjunction with beamforming techniques has shown substantial improvements over traditional methods.
Beamforming: Combining monaural DNN-based mask estimation with beamforming techniques like MVDR and GEV has emerged as a robust strategy, validated through CHiME-3 benchmarks.

Generalization

A recurring theme in the paper is the challenge of generalization in supervised learning for speech separation. Effective generalization across different noise conditions, speakers, and environments is critical. The research indicates that large-scale training with diverse datasets, progressive learning structures, and noise-aware training strategies significantly enhance robustness and applicability to unseen conditions.

Implications and Future Directions

The integration of deep learning techniques in speech separation holds substantial potential both in practical applications (e.g., hearing aids) and theoretical insights into auditory processing. Future directions could explore tighter integration between CASA principles and advanced neural models, efficient end-to-end systems, and more adaptable frameworks to improve generalization further.

In conclusion, the paper provides a foundational understanding and recent advances in supervised speech separation using deep learning. It highlights the importance of sophisticated training targets, effective feature extraction, and robust learning architectures, setting the stage for ongoing research and development in this transformative field.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

DeLiang Wang (43 papers)
Jitong Chen (15 papers)

Citations (1,295)

View on Semantic Scholar

Supervised Speech Separation Based on Deep Learning: An Overview (1708.07524v2)

Supervised Speech Separation Based On Deep Learning: An Overview

Related Papers