Papers
Topics
Authors
Recent
Search
2000 character limit reached

SEGAN: Speech Enhancement Generative Adversarial Network

Published 28 Mar 2017 in cs.LG, cs.NE, and cs.SD | (1703.09452v3)

Abstract: Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.

Citations (1,095)

Summary

  • The paper introduces an end-to-end GAN that processes raw audio directly to achieve effective speech enhancement.
  • It leverages a fully convolutional generator with skip connections and combines adversarial with L1 loss for precise noise suppression.
  • Experimental results demonstrate improved performance over traditional methods, with subjective tests confirming enhanced speech quality.

Speech Enhancement with Generative Adversarial Networks

Introduction

"SEGAN: Speech Enhancement Generative Adversarial Network" presents a novel approach for improving speech quality using Generative Adversarial Networks (GANs). This method operates directly at the waveform level, introducing a significant departure from traditional spectral-domain techniques. The paper tackles several noise conditions in speech signals by leveraging deep learning, particularly GANs, to model complex functions directly on raw waveforms.

Proposed Model

Speech Enhancement GAN (SEGAN)

The SEGAN operates end-to-end, focusing on raw audio inputs to produce enhanced speech outputs. The architecture consists of a generator (G) and a discriminator (D), where G is designed as a fully convolutional network. The generator's role is to transform noisy input signals into clean speech, whereas the discriminator evaluates the output of G to distinguish between real and fake signals.

Key characteristics of SEGAN include:

  • End-to-End Processing: SEGAN processes raw audio without intermediate feature extraction, allowing the model to learn directly from the waveform.
  • Generative Adversarial Framework: The adversarial setup enables the generator to improve its output iteratively by learning from discriminator feedback, integrating noise reduction directly into the model's goals.
  • Skip Connections: These connections aid in preserving low-level signal details and gradients, enhancing performance and training stability.

Generative Adversarial Networks Overview

GANs consist of two networks, G and D, that play a minimax game. The generator tries to fool the discriminator by generating realistic samples, while the discriminator aims to correctly identify real versus generated samples. This adversarial process helps G learn to produce improved speech signals that are closer to the distribution of clean speech data.

Experimental Setup

Data Set

The experiments utilize a dataset from the Voice Bank corpus, incorporating multiple speakers and diverse noise conditions. Training includes 28 speakers and 40 noise types, while testing uses a separate set of 2 speakers and 20 noise conditions to evaluate generalizability and robustness to unforeseen scenarios.

SEGAN Setup

The training process uses RMSprop optimizer across multiple GPUs to handle large batch sizes effectively. The generator utilizes a 22-layer convolutional architecture, with specific hyperparameters for convolutional width and stride to optimize temporal learning. A crucial addition in the model training is an L1L_1 loss component, which helps reinforce the learning of distance between generated and clean speech signals, fine-tuning the model further alongside the adversarial loss.

Results

Objective Evaluation

SEGAN demonstrates effective noise reduction, achieving consistent improvements over Wiener filtering across several metrics. While PESQ scores slightly lag, metrics such as CSIG, CBAK, and the segmental SNR illustrate SEGAN's ability to refine speech with less distortion and intrusiveness, proving its potential as a viable alternative for speech enhancement.

Subjective Evaluation

Subjective listening tests show that listeners preferred SEGAN-enhanced speech over both the noisy and Wiener baselines. With SEGAN signals being favored in a majority of test cases, the subjective evaluations confirm the enhancements' perceived improvements in speech quality and noise suppression.

Conclusion

SEGAN introduces a robust approach for speech enhancement, bridging the gap between theoretical advancements in generative modeling and practical applications in audio processing. By adopting an end-to-end generative architecture, it paves the way for future exploration in speech signal manipulation and enhancement techniques. Ongoing research may focus on advancing convolutional structures and integrating perceptual corrections into the GAN framework to further enhance high-frequency accuracy and reduce artifacts.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.