MetricGAN+: Speech Enhancement via Perceptual Metrics
- The paper introduces MetricGAN+, a framework that directly targets perceptual metrics (e.g., PESQ) using a differentiable surrogate to drive adversarial training.
- MetricGAN+ employs a two-player architecture with a bidirectional LSTM-based generator and a convolutional discriminator to improve objective scores, achieving a PESQ boost from 2.86 to 3.15.
- Its domain-informed training enhancements, including noisy-speech supervision and an experience replay buffer, ensure stability and robust performance in challenging noise conditions.
MetricGAN+ is a speech enhancement framework that advances over conventional loss-based and adversarial approaches by directly targeting non-differentiable perceptual metrics, such as PESQ, via a black-box surrogate mechanism. By synthesizing insights from both speech signal processing and modern adversarial training, MetricGAN+ aligns the optimization of enhancement models more closely with human auditory perception, bridging the frequent gap between minimized signal distortions (e.g., L1, L2 losses) and actual perceptual quality as measured by human listeners and objective metrics.
1. Motivation and Theoretical Context
Conventional speech enhancement models are primarily trained to minimize analytic loss functions (e.g., mean squared error) that only loosely correlate with human perception. In practice, this leads to enhanced speech that exhibits lower measured distortion but may remain perceptually unsatisfactory. MetricGAN, as a precursor to MetricGAN+, addressed this disconnect by embedding a non-differentiable metric (such as PESQ or STOI) into a GAN-like adversarial framework, using a trainable surrogate network to approximate the black-box metric and backpropagate its predictions to the enhancement generator. MetricGAN+ extends this foundation with domain-specific techniques that increase both the reliability of metric approximation and the stability of training, resulting in significantly improved metric alignment and robustness (Fu et al., 2021).
2. Architecture and Signal Flow
MetricGAN+ adopts a two-player structure consisting of a generator and a discriminator , leveraging architectures specialized for speech enhancement:
- Generator ():
- Inputs: log-magnitude noisy spectrogram (257 frequency bins temporal frames).
- Composition: Two bidirectional LSTM layers (200 hidden units each), a fully connected (300-unit) LeakyReLU layer, and a 257-unit output layer with a frequency-wise learnable sigmoid activation. The learnable sigmoid is parameterized as:
where , is frequency-specific and learned, permitting soft masking in low frequencies and binary-like masking in high frequencies. - Output: Enhancement mask , with pointwise clamping () to preserve energy in all bands.
Discriminator ():
- Inputs: Pairs of spectrogram features , typically the output of or raw/noisy speech, and the clean reference.
- Architecture: Four 2D Conv layers ( filters), global average pooling, then three fully connected layers (50, 10 LeakyReLU units, and a linear scalar output).
- Output: Prediction of the normalized metric (e.g., normalized PESQ). Spectral normalization is applied throughout to enforce 1-Lipschitz continuity.
The combination allows to act as a differentiable proxy to an otherwise non-differentiable quality metric, thus transmitting usable gradients to for direct perceptual optimization.
3. Domain-Informed Training Enhancements
MetricGAN+ incorporates three training strategies, rooted in speech processing, to realize effective adversarial learning:
- Noisy-Speech Supervision for : Unlike the original MetricGAN, is trained on an extended triplet: clean-clean, enhanced-clean, and noisy-clean pairs. The loss function is:
Here, denotes the normalized metric. The inclusion of noisy-clean supervision explicitly anchors the metric proxy across the entire perceptual quality spectrum, preventing bias toward high-quality regions and improving D's calibration.
- Experience Replay Buffer: To counter catastrophic forgetting in , a replay buffer of past enhanced outputs and their metric scores is maintained. During each update, 20% of batch samples are retrieved from this history, sustaining 's robustness over time and across the generator's evolving output distributions.
- Learnable Frequency-wise Sigmoid for : The mask output utilizes a per-band learnable sigmoid, improving spectral adaptability. Empirically, tends to be smaller (soft, linear masking) in speech-rich low and mid frequencies, while peaking at higher bands (hard masking), aligning with typical speech-noise energy distributions and benefiting gradient flow.
4. Loss Functions and Handling Non-differentiable Metrics
MetricGAN+ enables the training of speech enhancement models with objectives derived directly from non-differentiable metrics:
- Discriminator Loss:
The discriminator is trained to regress on the true normalized metric, regressing to 1 on clean-clean, to the metric score on enhanced-clean, and to the metric on noisy-clean pairs as shown above.
- Generator Loss:
The generator is optimized to maximize predicted quality:
where targets maximal quality.
By using to approximate the black-box metric, gradients become available for despite the target’s non-differentiability. This black-box optimization allows MetricGAN+ to directly boost metrics such as PESQ, which are commonly used for objective perceptual evaluation but not available as differentiable losses (Fu et al., 2021).
5. Experimental Evaluation and Performance
All experimental results are based on the VoiceBank-DEMAND dataset, following standard protocols (train: 11,572 utterances, 28 speakers, SNRs {0,5,10,15} dB; test: 824 utterances, 2 speakers, SNRs {2.5,7.5,12.5,17.5} dB). In addition to PESQ, the MOS-predictive metrics CSIG, CBAK, and COVL are evaluated, all ranging 1–5.
| Model | PESQ | CSIG | CBAK | COVL |
|---|---|---|---|---|
| Noisy | 1.97 | 3.35 | 2.44 | 2.63 |
| BLSTM + MSE | 2.71 | 3.91 | 2.80 | 3.30 |
| SEGAN | 2.42 | 3.61 | 2.61 | 3.01 |
| MetricGAN | 2.86 | 3.99 | 2.94 | 3.42 |
| MetricGAN+ | 3.15 | 4.14 | 3.16 | 3.64 |
MetricGAN+ achieves a PESQ of 3.15 (vs. 2.86 for MetricGAN, +0.3), matching or surpassing other state-of-the-art systems in all auxiliary metrics. Ablation shows that noisy speech supervision yields the greatest single gain, with the replay buffer and adaptive sigmoid further stabilizing and improving results (Fu et al., 2021).
6. Generalization and the MetricGAN+/- Extension
MetricGAN+ has served as the foundation for subsequent generalization-focused extensions such as MetricGAN+/-, which introduces a "de-generator" network , identical in architecture to , tasked with synthesizing degraded outputs at prescribed intermediate metric scores . By incorporating these intermediate-quality samples into 's training, MetricGAN+/- ensures more robust metric approximation over a wider score range. Empirically, this results in further PESQ improvements and stronger generalization to unseen noise and speaker conditions, as demonstrated on CHiME3 datasets (Close et al., 2022).
| Model | PESQ (VB-D) | PESQ (CHiME3 sim) | Csig (VB-D) | Cbak (VB-D) | Covl (VB-D) |
|---|---|---|---|---|---|
| MetricGAN+ | 3.05–3.17 | 2.14 | 4.05 | 2.91 | 3.59 |
| MetricGAN+/- | 3.22 | 2.38 | 4.05 | 2.94 | 3.62 |
Here, VB-D denotes VoiceBank-DEMAND, and score gains are relative (e.g., +3.8% PESQ for MetricGAN+/- over MetricGAN+). Qualitative analyses indicate MetricGAN+/- avoids over-suppression in non-speech regions and better preserves low-frequency speech energy (Close et al., 2022).
7. Impact and Significance in Speech Enhancement
MetricGAN+ demonstrates that integrating speech-domain knowledge and direct metric surrogacy into adversarial training frameworks yields substantive improvements in both objective metrics and perceived naturalness of enhanced speech. Its architectural simplicity, real-time feasibility, and compatibility with black-box perceptual metrics position it as a foundation for further advancements in robust, perception-aligned speech enhancement. Extension frameworks such as MetricGAN+/- indicate ongoing progress toward more generalizable and robust metric-driven learning, particularly in far-field and unseen noise conditions (Fu et al., 2021, Close et al., 2022).