MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement (2104.03538v2)

Published 8 Apr 2021 in cs.SD, cs.AI, and eess.AS

Abstract: The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discriminator. Because only the scores of the target evaluation functions are needed during training, the metrics can even be non-differentiable. In this study, we propose a MetricGAN+ in which three training techniques incorporating domain-knowledge of speech processing are proposed. With these techniques, experimental results on the VoiceBank-DEMAND dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the previous MetricGAN and achieve state-of-the-art results (PESQ score = 3.15).

Citations (192)

View on Semantic Scholar

Summary

The paper presents an innovative speech enhancement model that optimizes perceptual metrics like PESQ using an adversarial framework.
It enhances discriminator training by incorporating noisy speech and an experience replay buffer to stabilize learning.
Experiments on the VoiceBank-DEMAND dataset demonstrated a 0.3 PESQ score improvement over MetricGAN, indicating better performance and efficiency.

An Improved Approach to Black-box Speech Enhancement via MetricGAN+

The paper "MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement" presents advancements in the field of speech enhancement to bridge the gap between traditional signal-level cost functions and human auditory perception. The authors leverage MetricGAN+, which enhances objective metrics for speech enhancement models by incorporating domain knowledge of speech processing. This approach optimizes scores such as the perceptual evaluation of speech quality (PESQ) using adversarial techniques similar to GANs, facilitating higher training efficiency than conventional methods.

Improvements from MetricGAN to MetricGAN+

MetricGAN+ improves on its predecessor, MetricGAN, through several key innovations. Notably, the training of the discriminator network is advanced by including noisy speech in the discriminator's scope, thereby stabilizing the learning process—an approach inspired by Kawanaka et al. Additionally, MetricGAN+ employs an experience replay buffer reminiscent of deep Q-network practices, allowing the discriminator to maintain its performance by preventing catastrophic forgetting through increased historical data sampling. For the generator, a learnable sigmoid function for mask estimation is utilized instead of a static one, offering a flexible per-frequency approach that optimizes speech processing differently across frequency bands.

Experimental Insights and Results

Experiments conducted using the VoiceBank-DEMAND dataset reveal that MetricGAN+ achieves a PESQ score of 3.15, marking a 0.3 increase compared to MetricGAN and significant improvements over alternative models. The analysis underscores that adopters of MetricGAN+ enjoy improvements across various metrics and that incorporating noisy speech in training holds significant benefits for the discriminator’s performance. Furthermore, the learnable sigmoid function enhances spectrum manipulation efficiency, particularly by approximating linear functions across most frequency bins, boosting performance and training efficiency.

Implications and Future Directions

MetricGAN+ exemplifies how incorporating domain expertise into black-box frameworks can effectively improve speech enhancement outcomes. The application demonstrates potential in optimizing disparate metrics, suggesting utility beyond auditory perception, such as in speech intelligibility or word error rate (WER) improvement under noisy conditions. Future exploration might refine discriminator architecture using advanced mechanisms like attention systems to further increase optimization complexity and model performance. Additionally, incremental learning could address limitations in training efficiency and buffer management.

Through its findings, the paper contributes a compelling narrative on the importance of aligning model training with perceptual metrics, paving the path for effective enhancements in speech signal processing frameworks.

PDF Markdown