- The paper introduces an end-to-end GAN that processes raw audio directly to achieve effective speech enhancement.
- It leverages a fully convolutional generator with skip connections and combines adversarial with L1 loss for precise noise suppression.
- Experimental results demonstrate improved performance over traditional methods, with subjective tests confirming enhanced speech quality.
Speech Enhancement with Generative Adversarial Networks
Introduction
"SEGAN: Speech Enhancement Generative Adversarial Network" presents a novel approach for improving speech quality using Generative Adversarial Networks (GANs). This method operates directly at the waveform level, introducing a significant departure from traditional spectral-domain techniques. The paper tackles several noise conditions in speech signals by leveraging deep learning, particularly GANs, to model complex functions directly on raw waveforms.
Proposed Model
Speech Enhancement GAN (SEGAN)
The SEGAN operates end-to-end, focusing on raw audio inputs to produce enhanced speech outputs. The architecture consists of a generator (G) and a discriminator (D), where G is designed as a fully convolutional network. The generator's role is to transform noisy input signals into clean speech, whereas the discriminator evaluates the output of G to distinguish between real and fake signals.
Key characteristics of SEGAN include:
- End-to-End Processing: SEGAN processes raw audio without intermediate feature extraction, allowing the model to learn directly from the waveform.
- Generative Adversarial Framework: The adversarial setup enables the generator to improve its output iteratively by learning from discriminator feedback, integrating noise reduction directly into the model's goals.
- Skip Connections: These connections aid in preserving low-level signal details and gradients, enhancing performance and training stability.
Generative Adversarial Networks Overview
GANs consist of two networks, G and D, that play a minimax game. The generator tries to fool the discriminator by generating realistic samples, while the discriminator aims to correctly identify real versus generated samples. This adversarial process helps G learn to produce improved speech signals that are closer to the distribution of clean speech data.
Experimental Setup
Data Set
The experiments utilize a dataset from the Voice Bank corpus, incorporating multiple speakers and diverse noise conditions. Training includes 28 speakers and 40 noise types, while testing uses a separate set of 2 speakers and 20 noise conditions to evaluate generalizability and robustness to unforeseen scenarios.
SEGAN Setup
The training process uses RMSprop optimizer across multiple GPUs to handle large batch sizes effectively. The generator utilizes a 22-layer convolutional architecture, with specific hyperparameters for convolutional width and stride to optimize temporal learning. A crucial addition in the model training is an L1​ loss component, which helps reinforce the learning of distance between generated and clean speech signals, fine-tuning the model further alongside the adversarial loss.
Results
Objective Evaluation
SEGAN demonstrates effective noise reduction, achieving consistent improvements over Wiener filtering across several metrics. While PESQ scores slightly lag, metrics such as CSIG, CBAK, and the segmental SNR illustrate SEGAN's ability to refine speech with less distortion and intrusiveness, proving its potential as a viable alternative for speech enhancement.
Subjective Evaluation
Subjective listening tests show that listeners preferred SEGAN-enhanced speech over both the noisy and Wiener baselines. With SEGAN signals being favored in a majority of test cases, the subjective evaluations confirm the enhancements' perceived improvements in speech quality and noise suppression.
Conclusion
SEGAN introduces a robust approach for speech enhancement, bridging the gap between theoretical advancements in generative modeling and practical applications in audio processing. By adopting an end-to-end generative architecture, it paves the way for future exploration in speech signal manipulation and enhancement techniques. Ongoing research may focus on advancing convolutional structures and integrating perceptual corrections into the GAN framework to further enhance high-frequency accuracy and reduce artifacts.