Improving GANs for Speech Enhancement (2001.05532v3)

Published 15 Jan 2020 in cs.LG, cs.SD, eess.AS, and stat.ML

Abstract: Generative adversarial networks (GAN) have recently been shown to be efficient for speech enhancement. However, most, if not all, existing speech enhancement GANs (SEGAN) make use of a single generator to perform one-stage enhancement mapping. In this work, we propose to use multiple generators that are chained to perform multi-stage enhancement mapping, which gradually refines the noisy input signals in a stage-wise fashion. Furthermore, we study two scenarios: (1) the generators share their parameters and (2) the generators' parameters are independent. The former constrains the generators to learn a common mapping that is iteratively applied at all enhancement stages and results in a small model footprint. On the contrary, the latter allows the generators to flexibly learn different enhancement mappings at different stages of the network at the cost of an increased model size. We demonstrate that the proposed multi-stage enhancement approach outperforms the one-stage SEGAN baseline, where the independent generators lead to more favorable results than the tied generators. The source code is available at http://github.com/pquochuy/idsegan.

Citations (107)

View on Semantic Scholar

Summary

Enhancing Speech Through Multi-Stage Generative Adversarial Networks

The paper presents an innovative approach to speech enhancement using generative adversarial networks (GANs), specifically targeting the challenges of enhancing speech quality against background noise. The authors propose a novel methodology employing multiple generators within the GAN framework to achieve a multi-stage enhancement process, contrasting the traditional single-generator model.

Methodology and Architecture

The paper introduces two main frameworks: Iterated SEGAN (ISEGAN) and Deep SEGAN (DSEGAN). Both frameworks utilize a cascade of generators to iteratively refine the noisy input signal. In the ISEGAN model, all generators share parameters, effectively learning a common enhancement mapping at each stage. In contrast, the DSEGAN allows each generator to learn independent mappings, allowing for greater flexibility at the cost of increased model complexity.

The architecture of each generator follows an encoder-decoder structure similar to the original SEGAN, leveraging fully-convolutional layers. This approach facilitates the processing of raw audio signals into enhanced speech outputs. The discriminator aims to distinguish between real and generated audio, providing feedback to the generators to improve their outputs iteratively.

Results and Analysis

The paper provides rigorous empirical evidence supporting the efficacy of the proposed multi-stage GAN models. DSEGAN, in particular, demonstrates superior performance across several objective metrics, including Perceptual Evaluation of Speech Quality (PESQ), and Scale-Invariant Signal-to-Noise Ratio (SSNR), significantly surpassing the SEGAN baseline and several discriminative models. Furthermore, DSEGAN maintains a competitive edge in terms of speech intelligibility, as measured by the Short-Time Objective Intelligibility (STOI) metric.

Subjective listening tests corroborate the objective findings, validating that both ISEGAN and DSEGAN provide perceptually improved audio quality compared to conventional methods. Notably, DSEGAN exhibits consistent improvements as the signal progresses through each stage of enhancement, underscoring the benefits of independent parameterization in multi-stage processing.

Implications and Future Directions

By incorporating multiple stages of enhancement, the proposed architectures effectively expand the capability of GANs in the domain of speech processing. This paper highlights the value of iterative refinement, allowing for more nuanced correction of noise-degraded speech signals. The multi-stage approach effectively mimics human intuition by enabling more sophisticated noise suppression and signal reconstruction strategies.

Future research could explore optimization techniques for these models to balance computational efficiency with enhancement quality. There is potential for incorporating these frameworks into real-time applications, such as voice communication systems and hearing aids, where enhanced speech clarity is crucial. Additionally, extending these advanced GAN techniques to other audio processing tasks may provide further insights into their applicability across various domains.

In conclusion, the introduction of ISEGAN and DSEGAN marks a significant step in evolving the capabilities of GANs for speech enhancement, potentially paving the way for broader applications within speech technology and artificial intelligence.

Related Papers

GitHub

GitHub - pquochuy/idsegan (47 stars)