Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification (1709.01703v2)

Published 6 Sep 2017 in eess.AS, cs.LG, cs.SD, eess.SP, and stat.ML

Abstract: Improving speech system performance in noisy environments remains a challenging task, and speech enhancement (SE) is one of the effective techniques to solve the problem. Motivated by the promising results of generative adversarial networks (GANs) in a variety of image processing tasks, we explore the potential of conditional GANs (cGANs) for SE, and in particular, we make use of the image processing framework proposed by Isola et al. [1] to learn a mapping from the spectrogram of noisy speech to an enhanced counterpart. The SE cGAN consists of two networks, trained in an adversarial manner: a generator that tries to enhance the input noisy spectrogram, and a discriminator that tries to distinguish between enhanced spectrograms provided by the generator and clean ones from the database using the noisy spectrogram as a condition. We evaluate the performance of the cGAN method in terms of perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and equal error rate (EER) of speaker verification (an example application). Experimental results show that the cGAN method overall outperforms the classical short-time spectral amplitude minimum mean square error (STSA-MMSE) SE algorithm, and is comparable to a deep neural network-based SE approach (DNN-SE).

Authors (2)

Daniel Michelsanti (9 papers)
Zheng-Hua Tan (85 papers)

Citations (204)

View on Semantic Scholar

Summary

The paper introduces a novel cGAN-based method for enhancing noisy speech using a U-Net generator and PatchGAN discriminator.
Experimental results demonstrate improved PESQ scores and competitive EERs compared to traditional SE and deep learning approaches.
The approach paves the way for robust speaker verification and future optimizations of adversarial training in speech enhancement.

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

This paper presents an investigation into the application of Conditional Generative Adversarial Networks (cGANs) for speech enhancement (SE) and noise-robust speaker verification. The paper addresses the persistent challenge of improving speech systems' performance in the presence of noise, which is crucial for applications such as automatic speaker verification (ASV) and speech recognition. The authors adapt the Pix2Pix framework, originally proposed for image-to-image translation, to map noisy speech spectrograms to their enhanced versions. This adaptation represents a novel application of cGANs within the speech enhancement domain.

Methodology

The proposed SE system consists of a generator and a discriminator, trained adversarially. The generator is tasked with enhancing the noisy spectrogram, while the discriminator evaluates how realistic the enhanced spectrogram appears compared to clean data. The authors fine-tuned the architecture by utilizing a U-Net for the generator and a PatchGAN for the discriminator, enabling the model to capture detailed high-frequency information essential for effective speech enhancement.

The researchers evaluated the proposed system using metrics such as the Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Equal Error Rate (EER) of speaker verification systems. These metrics provided a comprehensive evaluation of speech quality, intelligibility, and verification robustness.

Results

The experimental results indicate that the cGAN-based speech enhancement method generally outperforms traditional SE techniques like the short-time spectral amplitude minimum mean square error (STSA-MMSE) and is competitive with deep neural network-based SE methods. Specifically, the cGAN approach achieved superior PESQ scores in most scenarios, indicating an improvement in perceptual speech quality. Although STOI scores of the cGAN method were comparable and sometimes slightly inferior to DNN-SE, the enhancement in speech quality metrics underscores its efficacy. The cGAN framework also achieved competitive EERs, demonstrating potential in enhancing noise robustness for speaker verification tasks.

Implications and Future Work

The application of cGANs for speech enhancement opens several avenues for further research. The framework can be extended or refined by exploring alternative architectures or by integrating specific perceptual losses tailored for speech tasks. The paper highlights the need for further evaluation under more challenging SNR conditions and suggests possible modifications to optimize the Pix2Pix architecture specifically for SE tasks.

In conclusion, this work positions cGANs as a promising tool for speech enhancement and noise-robust speaker verification, suggesting that adversarial training paradigms can significantly enhance speech processing applications. Future research may explore the optimization and integration of cGAN methodologies within broader speech and audio processing frameworks.

PDF Markdown