MP-SENet: Parallel Speech Enhancement
- MP-SENet is a neural speech enhancement model that explicitly and parallelly estimates compressed magnitude and wrapped phase spectra in the STFT domain.
- It employs a hybrid convolutional and Transformer-based encoder–decoder design with multi-level losses, including adversarial metric feedback.
- Empirical results show state-of-the-art performance in denoising, dereverberation, and bandwidth extension with a compact 2.26M-parameter model.
MP-SENet is a neural speech enhancement architecture that performs parallel magnitude and phase denoising in the short-time Fourier transform (STFT) domain. Unlike prior approaches that emphasize magnitude estimation or treat phase implicitly, MP-SENet introduces explicit, parallel estimation of compressed magnitude and wrapped phase spectra. Its hybrid convolutional and Transformer-based encoder–decoder design, coupled with multi-level losses and adversarial metric feedback, supports unified and effective solutions for denoising, dereverberation, and bandwidth extension.
1. Architecture and Signal Flow
MP-SENet operates on the STFT representation of a noisy time-domain waveform . The front-end applies a 400-point FFT with 400-sample window and 100-sample hop, yielding per-frame magnitude and wrapped phase ().
Magnitude Compression and Stacking
- Magnitude is compressed: , where .
- Input feature is stacked: .
Encoder
- Initial 2D ConvInstanceNormPReLU block for channel lifting.
- Dilated DenseNet with four 1D-conv layers (dilations 1, 2, 4, 8) along time, with dense connections.
- Second 2D Conv block with stride 2 to downsample in time and frequency.
- Output: , where .
Time/Frequency Transformers ("TS-Conformers" or "TF-Transformers")
- Four stacked blocks, each combining:
- Multi-head self-attention over time/frequency grid.
- Depthwise-separable convolution + GLU.
- Feedforward layers with residuals.
- Captures both global temporal/frequency context and local structure.
- Output: , a rich compressed TF representation.
Parallel Decoders
- Magnitude mask decoder: Predicts a compressed mask via DenseNet, deconvolutional upsampling, a 11 2D convolution, and a learnable sigmoid ("LSigmoid") activation:
The enhanced magnitude is recovered as:
- Phase decoder: Runs a matching upsampling path, with two parallel conv heads to produce pseudo-real and pseudo-imaginary outputs. Phase is inferred using a modified two-argument arctangent with wrap handling:
where if and otherwise.
ISTFT reconstructs the enhanced waveform from .
2. Loss Functions and Training Objectives
MP-SENet combines losses at four spectral and one temporal level, balancing magnitude and phase fidelity as well as perceptual metrics:
| Loss Type | Functional Form (for target vs. estimate ) | Purpose/Domain |
|---|---|---|
| Waveform (time) | ||
| Magnitude (STFT) | ||
| Complex spectrum | ||
| Phase (anti-wrapping) | ||
| Adversarial (PESQ proxy) |
where phase losses use the anti-wrap operator:
to map phase differences into .
The total generator loss is:
with the empirically set weights , , , , (Lu et al., 2023), or similar in (Lu et al., 2023).
The adversarial metric loss leverages a discriminator (as in MetricGAN/CMGAN) outputting a value approximating scaled PESQ. Discriminator and generator update steps alternate.
3. Training Protocols
Key training configurations for reproducibility:
- Data: VoiceBank+DEMAND (11,572 train utterances, 28 speakers; 872 test utterances, 2 speakers; 16 kHz resampling). Supplementary: DNS Challenge corpus, REVERB Challenge, VCTK (for bandwidth extension scenarios).
- STFT Front-End: FFT size 400, window 400, hop 100, yielding 201 frequency bins.
- Optimizer: AdamW (, , weight decay ).
- Learning Rate: Initial , exponential decay or halved every 30 epochs, with 100 epochs or 500k steps.
- Batch Size: 4–8.
- Model Size: Approximately 2.26M parameters.
- Magnitude Compressor: (power-law).
- LSigmoid Parameters: , trainable per frequency bin.
Training jointly minimizes and the discriminator loss.
4. Empirical Performance and Ablation Results
On VoiceBank+DEMAND (16 kHz, seen/unseen conditions), MP-SENet establishes state-of-the-art performance, notably:
| Method | PESQ | CSIG | CBAK | COVL | SSNR | STOI |
|---|---|---|---|---|---|---|
| Noisy | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 0.92 |
| SEGAN | 2.16 | 3.48 | 2.94 | 2.80 | 7.73 | 0.93 |
| MetricGAN+ | 3.15 | 4.14 | 3.47 | 3.61 | 12.08 | 0.94 |
| DPT-FSNet | 3.33 | — | — | — | — | — |
| TridentSE | 3.47 | 4.70 | 3.81 | 4.10 | — | 0.96 |
| CMGAN | 3.41 | 4.63 | 3.94 | 4.12 | 11.10 | 0.96 |
| PHASEN | 2.99 | 4.21 | 3.33 | 3.61 | 11.54 | 0.96 |
| MP-SENet | 3.50 | 4.73 | 3.95 | 4.22 | 10.64 | 0.96 |
MP-SENet achieves the highest PESQ (3.50) and MOS proxies (CSIG, COVL), reflecting improved perceptual speech quality due to explicit and parallel magnitude–phase denoising.
Ablation experiments highlight the architectural and training design choices:
- Removing magnitude compression: PESQ drops to 2.97.
- Replacing LSigmoid with PReLU: PESQ to 3.40.
- Omit dedicated phase decoder or explicit phase loss: PESQ to 3.31 and 3.39, respectively.
- Disabling complex-spectrum loss or adversarial training: PESQ to 3.44 and 3.39, respectively.
Explicit parallel phase modeling and anti-wrapping losses yield measurable quality improvements, outperforming models that use phase conditioning or implicit complex masking.
Further validation on DNS Challenge (3,000 hours) and REVERB/VCTK demonstrates transferability: denoising PESQ up to 3.62, dereverberation SRMR up to 6.67, bandwidth extension WB-PESQ up to 4.28 (Lu et al., 2023).
5. The Role of Explicit Phase Modeling
MP-SENet's principal innovation is parallel and explicit estimation of magnitude and wrapped phase, abrogating the classic magnitude–phase compensation effect. Its architecture delivers phase estimates via direct prediction and anti-wrapping loss, rather than relying on magnitude-only or complex-valued masking, yielding lower phase distortion (as measured by group delay and phase metrics).
Empirical ablation supports this approach:
- "Magnitude only" (no phase decoder): increased phase distortion, reduced PESQ.
- "Complex only": does not match the perceptual improvements of explicit decoders.
- "w/o phase loss": increased phase error (by group delay/IAF), reduced quality.
This suggests that parallel treatment of magnitude and phase, along with multi-level objectives, allows for finer control of both perceptual and instrumental enhancement scores, and mitigates compensation artifacts typical of magnitude-only systems.
6. Multi-Task and Task-Transfer Enhancement
The MP-SENet design natively accommodates multiple speech enhancement objectives through decoder and loss reconfiguration:
- Denoising and dereverberation: use learnable sigmoid for mask estimation.
- Bandwidth extension: swap for PReLU activation for unbounded mask support.
- Multi-level loss composition and flexible downstream targets enable the same architecture to outperform specialty models over diverse benchmarks without architectural changes, as validated on VoiceBank+DEMAND, REVERB, and VCTK.
A plausible implication is that parallel encoder–decoder frameworks with versatile loss aggregation may serve as a robust universal backbone for speech restoration tasks.
7. Context within Neural Speech Enhancement
MP-SENet is distinct from magnitude-only (MetricGAN+, DCCRN) and complex-masking (CMGAN, TridentSE) approaches by targeting the phase denoising bottleneck through explicit parallel pathways and anti-wrapping phase objectives.
The compact model size (M parameters) and the ability to train via metric-discriminators (i.e., PESQ-approximate feedback) aligns with large-scale, adversarially optimized architectures, yet it achieves state-of-the-art instrumental and perceptual metrics across major speech enhancement benchmarks.
The design encourages further exploration of explicit phase processing, unified magnitude–phase models, and scales favorably due to moderate parameter count and TF-Transformer modularity. Explicit parallel estimation emerges as a principled alternative to implicit or sequential phase handling and is supported by objective and subjective evaluations.
Summary Table: MP-SENet Core Elements
| Component | Description | Key Value/Formula |
|---|---|---|
| Magnitude Compressor | Power-law, to regularize dynamic range | |
| LSigmoid Activation | (, trainable) | Mask range |
| Phase Decoder | 2D-conv (real, imag) arctan2 + unwrapping | formula above |
| Multi-level Loss | Time, magnitude, complex, anti-wrapped phase, adversarial | Per equations above |
| TF-Transformer Stack | Four blocks; captures long/short-range TF context | Multi-head, depthwise conv |
| Discriminator | Predicts metric surrogate (e.g., PESQ) | MetricGAN-style loss |
| Best Denoising PESQ | 3.50 (VB+DEMAND), 3.62 (DNS); best among cited methods |
In sum, MP-SENet leverages parallel, explicit magnitude and phase processing with multi-level loss optimization, achieving leading performance in denoising, dereverberation, and bandwidth extension, and demonstrating the efficacy of explicit phase modeling in modern neural speech enhancement (Lu et al., 2023, Lu et al., 2023).