Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition (1711.05747v2)

Published 15 Nov 2017 in cs.SD, cs.LG, cs.NE, and eess.AS

Abstract: We investigate the effectiveness of generative adversarial networks (GANs) for speech enhancement, in the context of improving noise robustness of automatic speech recognition (ASR) systems. Prior work demonstrates that GANs can effectively suppress additive noise in raw waveform speech signals, improving perceptual quality metrics; however this technique was not justified in the context of ASR. In this work, we conduct a detailed study to measure the effectiveness of GANs in enhancing speech contaminated by both additive and reverberant noise. Motivated by recent advances in image processing, we propose operating GANs on log-Mel filterbank spectra instead of waveforms, which requires less computation and is more robust to reverberant noise. While GAN enhancement improves the performance of a clean-trained ASR system on noisy speech, it falls short of the performance achieved by conventional multi-style training (MTR). By appending the GAN-enhanced features to the noisy inputs and retraining, we achieve a 7% WER improvement relative to the MTR system.

Citations (197)

View on Semantic Scholar

Summary

The paper proposes using Generative Adversarial Networks (GANs) on log-Mel filterbank spectra for speech enhancement, improving upon prior waveform-based GAN approaches for Automatic Speech Recognition (ASR).
Experiments show frequency-domain GANs improve ASR Word Error Rate and offer performance gains when combined with traditional multi-style training through hybrid retraining.
The findings suggest GANs are promising as preprocessing or augmentation tools for ASR pipelines, especially when integrated strategically with traditional training methods like multi-style training.

Overview of Speech Enhancement with GANs for ASR Systems

This paper addresses the application of Generative Adversarial Networks (GANs) for enhancing speech quality in scenarios prone to both additive and reverberant noise, targeting improvements in automatic speech recognition (ASR) systems. By explicating the relevance and effectiveness of GAN-based techniques, this research navigates existing challenges within speech enhancement paradigms and proposes novel methodologies to optimize ASR accuracy.

Motivation and Approach

The motivation for employing GANs in speech enhancement stems from their proven efficacy in analogous tasks within the domain of image processing. The technique capitalizes on the GAN's dual nature: using a generator to enhance the distorted signal and a discriminator to evaluate the quality of enhancements. Prior work had limited GANs' application to the improvement of perceptual speech quality under additive noise without a detailed analysis of ASR-specific outcomes.

This research contrasts with previous methods by transitioning from waveform-based GAN operations to employing GANs on log-Mel filterbank spectra. This distinction reduces computational complexity and demonstrates enhanced robustness against reverberant noise. The transition to frequency-domain processing is informed by GAN applications in image translation, leveraging spectral feature mapping (SFM) to align with ASR procedures.

Experimental Findings

The experiments show a significant improvement in the Word Error Rate (WER) of a clean-trained ASR system following the GAN enhancement. More specifically, the new frequency-domain approach (FSEGAN) outperformed time-domain techniques, improving ASR-Clean performance by 54% relative. However, the traditional multi-style training (MTR) continued to achieve superior outcomes compared to this enhancement method alone.

Interestingly, when enhanced features were incorporated alongside noisy ones during the retraining process of MTR systems, an appreciable performance gain was achieved, surpassing traditional MTR baselines.

Implications and Future Directions

These findings flatter the potential of GANs as a preprocessing or augmentation tool in ASR pipelines, particularly when combined with traditional training regimes. The concept of hybrid retraining, integrating both the GAN-enhanced and original features into the input representation, emerged as a fruitful direction, unsettling the reliance on singular training methodologies.

The research nudges towards the hypothesis that simpler regression metrics may suffice over adversarial networks in certain enhancement contexts, suggesting a continuum between naive GAN applications and standard regression techniques. The visual coherence of the FSEGAN-produced spectra hints at broader applicability, stretching beyond ASR to domains such as telecommunication systems where the phase-inversion challenge remains critical.

Conclusion

This paper's exploration elucidates distinct methodologies in speech enhancement using GANs and underscores the nuanced interplay between signal processing and robust ASR development. While GANs show promise, their distinct application should be strategically aligned with both regression methods and traditional models to achieve scalable and generalized improvements in ASR performance. The results propagate a fertile ground for further exploration into hybrid models and revising GAN objectives for tailored ASR enhancements.

PDF Markdown

Related Papers

YouTube

Show All Videos