SEGAN: Speech Enhancement Generative Adversarial Network (1703.09452v3)

Published 28 Mar 2017 in cs.LG, cs.NE, and cs.SD

Abstract: Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.

Citations (1,095)

View on Semantic Scholar

Summary

The paper introduces SEGAN, a novel end-to-end model that processes raw audio waveforms to enhance noisy speech.
SEGAN employs an auto-encoder architecture with skip connections and a combined adversarial and L1 loss to ensure realistic and accurate speech reconstruction.
The model demonstrates robust performance across diverse noise conditions with improved objective metrics and favorable subjective listener ratings compared to traditional methods.

SEGAN: Speech Enhancement Generative Adversarial Network

The paper "SEGAN: Speech Enhancement Generative Adversarial Network" by Santiago Pascual, Antonio Bonafonte, and Joan Serra introduces a novel approach for speech enhancement using Generative Adversarial Networks (GANs). The authors propose a system that enhances speech by operating directly at the waveform level, with an end-to-end training paradigm. This approach contrasts with traditional methods that commonly operate in the spectral domain.

GANs are leveraged for their robust capability in learning complex data distributions, and this work explores their applicability in the speech enhancement domain. The Speech Enhancement GAN (SEGAN) model trains on a dataset comprising 28 speakers and 40 distinct noise conditions, sharing model parameters across these variances. This generalization capability is tested on independent data consisting of different noise conditions and speakers, attaining favorable results in both objective and subjective metrics.

Key Contributions

End-to-End Waveform Processing: SEGAN processes raw audio waveforms, avoiding any need for preprocessing steps such as spectral feature extraction. This design choice advocates for a more streamlined and potentially more versatile model.
Model Structure: The generator in SEGAN adopts an auto-encoder-like architecture comprising fully convolutional layers. Skip connections are employed between the encoding and decoding stages to maintain fine-grained temporal details that are essential for reconstructing clean speech from noisy inputs.
Adversarial and L1 Loss: SEGAN’s generator loss utilizes a combination of Least Squares GAN (LSGAN) loss and an L1 norm. The L1 regularization term encourages the generator to produce outputs that closely match the clean reference speech, while the adversarial component ensures the enhanced speech sounds realistic.
Robust Training Across Multiple Noise Conditions: By training on a diverse set of noise conditions and speakers, SEGAN learns a noise-invariant feature representation, enhancing its robustness and generalization capabilities.

Experimental Results

Objective Metrics

SEGAN’s performance was evaluated using the following metrics:

PESQ (Perceptual Evaluation of Speech Quality)
CSIG (Mean Opinion Score prediction for signal distortion)
CBAK (MOS prediction for background noise intrusiveness)
COVL (MOS prediction for overall quality)
SSNR (Segmental Signal-to-Noise Ratio)

Comparisons were made against noisy input signals and a Wiener filter baseline. Notably, SEGAN achieved superior results in CSIG, CBAK, COVL, and SSNR, underscoring its ability to preserve speech quality while effectively reducing background noise.

Subjective Metrics

The paper also presents results from a listening test involving 16 participants. The listeners rated the overall quality of sentences processed by the in-question methods. SEGAN was preferred over both the noisy and Wiener-enhanced signals, with the Comparative Mean Opinion Score (CMOS) revealing a significant preference for SEGAN-enhanced speech.

Implications and Future Directions

SEGAN sets a precedent for end-to-end speech enhancement models that work directly on waveforms, potentially simplifying the pipeline while retaining high performance. This approach could be further explored and refined by:

Improving Architecture: Investigating more advanced convolutional and attention-based mechanisms tailored for speech signals.
Including Perceptual Losses: Introducing perceptual loss functions to minimize audible artifacts, thereby enhancing the subjective quality of the enhanced signals.
Broadening Applications: Extending the use of such models to other speech-related tasks like speaker recognition and speech synthesis.

Overall, SEGAN’s findings indicate that GANs, when adapted correctly, can serve as a powerful framework for speech enhancement, presenting a significant step forward in the development of practical, high-performance enhancement technologies.

PDF Markdown