CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement (2209.11112v3)

Published 22 Sep 2022 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: In this work, we further develop the conformer-based metric generative adversarial network (CMGAN) model for speech enhancement (SE) in the time-frequency (TF) domain. This paper builds on our previous work but takes a more in-depth look by conducting extensive ablation studies on model inputs and architectural design choices. We rigorously tested the generalization ability of the model to unseen noise types and distortions. We have fortified our claims through DNS-MOS measurements and listening tests. Rather than focusing exclusively on the speech denoising task, we extend this work to address the dereverberation and super-resolution tasks. This necessitated exploring various architectural changes, specifically metric discriminator scores and masking techniques. It is essential to highlight that this is among the earliest works that attempted complex TF-domain super-resolution. Our findings show that CMGAN outperforms existing state-of-the-art methods in the three major speech enhancement tasks: denoising, dereverberation, and super-resolution. For example, in the denoising task using the Voice Bank+DEMAND dataset, CMGAN notably exceeded the performance of prior models, attaining a PESQ score of 3.41 and an SSNR of 11.10 dB. Audio samples and CMGAN implementations are available online.

Citations (37)

View on Semantic Scholar

Summary

The paper introduces CMGAN, a dual-path architecture that fuses magnitude and complex spectrogram analysis with a conformer-based generator and a metric discriminator to enhance speech quality.
It achieves significant improvements on benchmark datasets, attaining a PESQ of 3.41 and an SSNR of 11.10 dB for denoising and robust dereverberation performance.
Its innovative approach in the time-frequency domain promises practical applications in telecommunications, hearing aids, and ASR systems by addressing diverse speech enhancement tasks.

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

The paper presents a sophisticated approach to monaural speech enhancement with the introduction of a conformer-based Metric Generative Adversarial Network (CMGAN). This research primarily addresses three pivotal tasks within speech enhancement (SE): denoising, dereverberation, and super-resolution. The methodology is deployed in the time-frequency (TF) domain, leveraging the strengths of conformers, a recent advancement in deep learning models.

Methodology Overview

CMGAN integrates a dual-path architecture utilizing conformers, which capture both local and global dependencies in the TF domain. The network comprises a generator and a metric discriminator, with the latter designed to learn and optimize non-differentiable evaluation metric scores like PESQ using a GAN framework. This strategy aims to bridge the gap between traditional loss functions and the subjective quality perceived by human listeners.

Generator Architecture:
- The generator features an encoder-decoder scheme. It fuses magnitude and complex spectrogram components using a shared encoder, followed by two distinct decoders: one for magnitude masking and another for complex spectrogram refinement.
- A two-stage conformer block captures temporal and frequency domain dependencies, enhancing its ability to generalize across various noise and distortion types.
Metric Discriminator:
- This component is trained to predict PESQ scores, thus enabling the generator to improve speech quality directly correlating with human auditory perception.
Loss Functions:
- The model balances multiple losses: a linear combination of magnitude and complex spectrogram losses, time-domain waveform reconstruction, and adversarial losses focusing on metric optimization.

Experimental Results

Denoising

CMGAN demonstrates significant improvements over other state-of-the-art models on the Voice Bank+DEMAND dataset, achieving a PESQ of 3.41 and an SSNR of 11.10 dB. This indicates a robust capacity to enhance speech signals comprehensively.

Dereverberation

The model is further evaluated on the REVERB challenge dataset, where it surpasses existing methods in key metrics like cepstral distance (CD) and frequency weighted segmental SNR (FWSegSNR). The results emphasize CMGAN's ability to handle reverberant distortions effectively.

Super-resolution

In this novel application within the TF domain, CMGAN explores complex TF-domain super-resolution. It achieves superior scores, particularly on SNR across different upscaling tasks, demonstrating its potential for enhancing audio resolution beyond conventional spectral mapping techniques.

Implications and Future Directions

The introduction of CMGAN sets a precedent for integrating advanced architectures like conformers in speech enhancement tasks. Its approach of combining magnitude and complex spectrum analysis offers a holistic view for future SE developments. Potential extensions include real-time processing capabilities and evaluations on ASR systems to validate improvements in recognition performance.

Moreover, the adaptability of the conformer-based structure hints at broader applicability across numerous audio processing scenarios. Continuous refinements in discriminator objective functions and potential hybridization with other deep learning paradigms may further enhance the model's efficacy and robustness.

Overall, CMGAN provides a comprehensive framework for addressing diverse speech enhancement tasks, with promising implications for practical applications such as telecommunications, hearing aids, and ASR systems.

PDF Markdown

Related Papers

GitHub

GitHub - ruizhecao96/CMGAN: Conformer-based Metric GAN for speech enhancement (369 stars)

Tweets

https://twitter.com/ArxivSound/status/1787694772687176150

https://twitter.com/AudioAndSpeech/status/1787857977942192319