Papers
Topics
Authors
Recent
Search
2000 character limit reached

MP-SENet: Parallel Speech Enhancement

Updated 10 November 2025
  • MP-SENet is a neural speech enhancement model that explicitly and parallelly estimates compressed magnitude and wrapped phase spectra in the STFT domain.
  • It employs a hybrid convolutional and Transformer-based encoder–decoder design with multi-level losses, including adversarial metric feedback.
  • Empirical results show state-of-the-art performance in denoising, dereverberation, and bandwidth extension with a compact 2.26M-parameter model.

MP-SENet is a neural speech enhancement architecture that performs parallel magnitude and phase denoising in the short-time Fourier transform (STFT) domain. Unlike prior approaches that emphasize magnitude estimation or treat phase implicitly, MP-SENet introduces explicit, parallel estimation of compressed magnitude and wrapped phase spectra. Its hybrid convolutional and Transformer-based encoder–decoder design, coupled with multi-level losses and adversarial metric feedback, supports unified and effective solutions for denoising, dereverberation, and bandwidth extension.

1. Architecture and Signal Flow

MP-SENet operates on the STFT representation of a noisy time-domain waveform y∈RLy \in \mathbb{R}^L. The front-end applies a 400-point FFT with 400-sample window and 100-sample hop, yielding per-frame magnitude Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F} and wrapped phase Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F} (Yp∈[−π,Ï€]Y_p \in [-\pi,\pi]).

Magnitude Compression and Stacking

  • Magnitude is compressed: Ymc=YmcY_m^c = Y_m^c, where c=0.3c = 0.3.
  • Input feature is stacked: Yin(t,f,:)=[Ymc(t,f),Yp(t,f)]∈RTĂ—FĂ—2Y_{\mathrm{in}}(t,f,:) = [Y_m^c(t,f), Y_p(t,f)] \in \mathbb{R}^{T \times F \times 2}.

Encoder

  • Initial 2D Conv→\toInstanceNorm→\toPReLU block for channel lifting.
  • Dilated DenseNet with four 1D-conv layers (dilations 1, 2, 4, 8) along time, with dense connections.
  • Second 2D Conv block with stride 2 to downsample in time and frequency.
  • Output: R0∈RTâ€²Ă—Fâ€²Ă—CR_0 \in \mathbb{R}^{T' \times F' \times C}, where Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}0.

Time/Frequency Transformers ("TS-Conformers" or "TF-Transformers")

  • Four stacked blocks, each combining:
    • Multi-head self-attention over time/frequency grid.
    • Depthwise-separable convolution + GLU.
    • Feedforward layers with residuals.
  • Captures both global temporal/frequency context and local structure.
  • Output: Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}1, a rich compressed TF representation.

Parallel Decoders

  • Magnitude mask decoder: Predicts a compressed mask Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}2 via DenseNet, deconvolutional upsampling, a 1Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}31 2D convolution, and a learnable sigmoid ("LSigmoid") activation:

Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}4

The enhanced magnitude is recovered as:

Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}5

  • Phase decoder: Runs a matching upsampling path, with two parallel Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}6 conv heads to produce pseudo-real Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}7 and pseudo-imaginary Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}8 outputs. Phase is inferred using a modified two-argument arctangent with wrap handling:

Ym∈RTĂ—FY_m \in \mathbb{R}^{T \times F}9

where Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}0 if Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}1 and Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}2 otherwise.

ISTFT reconstructs the enhanced waveform Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}3 from Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}4.

2. Loss Functions and Training Objectives

MP-SENet combines losses at four spectral and one temporal level, balancing magnitude and phase fidelity as well as perceptual metrics:

Loss Type Functional Form (for target Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}5 vs. estimate Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}6) Purpose/Domain
Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}7 Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}8 Waveform (time)
Yp∈RTĂ—FY_p \in \mathbb{R}^{T \times F}9 Yp∈[−π,Ï€]Y_p \in [-\pi,\pi]0 Magnitude (STFT)
Yp∈[−π,π]Y_p \in [-\pi,\pi]1 Yp∈[−π,π]Y_p \in [-\pi,\pi]2 Complex spectrum
Yp∈[−π,π]Y_p \in [-\pi,\pi]3 Yp∈[−π,π]Y_p \in [-\pi,\pi]4 Phase (anti-wrapping)
Yp∈[−π,π]Y_p \in [-\pi,\pi]5 Yp∈[−π,π]Y_p \in [-\pi,\pi]6 Adversarial (PESQ proxy)

where phase losses use the anti-wrap operator:

Yp∈[−π,π]Y_p \in [-\pi,\pi]7

to map phase differences into Yp∈[−π,π]Y_p \in [-\pi,\pi]8.

The total generator loss is:

Yp∈[−π,π]Y_p \in [-\pi,\pi]9

with the empirically set weights Ymc=YmcY_m^c = Y_m^c0, Ymc=YmcY_m^c = Y_m^c1, Ymc=YmcY_m^c = Y_m^c2, Ymc=YmcY_m^c = Y_m^c3, Ymc=YmcY_m^c = Y_m^c4 (Lu et al., 2023), or similar Ymc=YmcY_m^c = Y_m^c5 in (Lu et al., 2023).

The adversarial metric loss leverages a discriminator Ymc=YmcY_m^c = Y_m^c6 (as in MetricGAN/CMGAN) outputting a value approximating scaled PESQ. Discriminator and generator update steps alternate.

3. Training Protocols

Key training configurations for reproducibility:

  • Data: VoiceBank+DEMAND (Ymc=YmcY_m^c = Y_m^c711,572 train utterances, 28 speakers; 872 test utterances, 2 speakers; 16 kHz resampling). Supplementary: DNS Challenge corpus, REVERB Challenge, VCTK (for bandwidth extension scenarios).
  • STFT Front-End: FFT size 400, window 400, hop 100, yielding 201 frequency bins.
  • Optimizer: AdamW (Ymc=YmcY_m^c = Y_m^c8, Ymc=YmcY_m^c = Y_m^c9, weight decay c=0.3c = 0.30).
  • Learning Rate: Initial c=0.3c = 0.31, exponential decay or halved every 30 epochs, with 100 epochs or 500k steps.
  • Batch Size: 4–8.
  • Model Size: Approximately 2.26M parameters.
  • Magnitude Compressor: c=0.3c = 0.32 (power-law).
  • LSigmoid Parameters: c=0.3c = 0.33, c=0.3c = 0.34 trainable per frequency bin.

Training jointly minimizes c=0.3c = 0.35 and the discriminator loss.

4. Empirical Performance and Ablation Results

On VoiceBank+DEMAND (16 kHz, seen/unseen conditions), MP-SENet establishes state-of-the-art performance, notably:

Method PESQ CSIG CBAK COVL SSNR STOI
Noisy 1.97 3.35 2.44 2.63 1.68 0.92
SEGAN 2.16 3.48 2.94 2.80 7.73 0.93
MetricGAN+ 3.15 4.14 3.47 3.61 12.08 0.94
DPT-FSNet 3.33 — — — — —
TridentSE 3.47 4.70 3.81 4.10 — 0.96
CMGAN 3.41 4.63 3.94 4.12 11.10 0.96
PHASEN 2.99 4.21 3.33 3.61 11.54 0.96
MP-SENet 3.50 4.73 3.95 4.22 10.64 0.96

MP-SENet achieves the highest PESQ (3.50) and MOS proxies (CSIG, COVL), reflecting improved perceptual speech quality due to explicit and parallel magnitude–phase denoising.

Ablation experiments highlight the architectural and training design choices:

  • Removing magnitude compression: PESQ drops to 2.97.
  • Replacing LSigmoid with PReLU: PESQ to 3.40.
  • Omit dedicated phase decoder or explicit phase loss: PESQ to 3.31 and 3.39, respectively.
  • Disabling complex-spectrum loss or adversarial training: PESQ to 3.44 and 3.39, respectively.

Explicit parallel phase modeling and anti-wrapping losses yield measurable quality improvements, outperforming models that use phase conditioning or implicit complex masking.

Further validation on DNS Challenge (3,000 hours) and REVERB/VCTK demonstrates transferability: denoising PESQ up to 3.62, dereverberation SRMR up to 6.67, bandwidth extension WB-PESQ up to 4.28 (Lu et al., 2023).

5. The Role of Explicit Phase Modeling

MP-SENet's principal innovation is parallel and explicit estimation of magnitude and wrapped phase, abrogating the classic magnitude–phase compensation effect. Its architecture delivers phase estimates via direct prediction and anti-wrapping loss, rather than relying on magnitude-only or complex-valued masking, yielding lower phase distortion (as measured by group delay and phase metrics).

Empirical ablation supports this approach:

  • "Magnitude only" (no phase decoder): increased phase distortion, reduced PESQ.
  • "Complex only": does not match the perceptual improvements of explicit decoders.
  • "w/o phase loss": increased phase error (by group delay/IAF), reduced quality.

This suggests that parallel treatment of magnitude and phase, along with multi-level objectives, allows for finer control of both perceptual and instrumental enhancement scores, and mitigates compensation artifacts typical of magnitude-only systems.

6. Multi-Task and Task-Transfer Enhancement

The MP-SENet design natively accommodates multiple speech enhancement objectives through decoder and loss reconfiguration:

  • Denoising and dereverberation: use learnable sigmoid for mask estimation.
  • Bandwidth extension: swap for PReLU activation for unbounded mask support.
  • Multi-level loss composition and flexible downstream targets enable the same architecture to outperform specialty models over diverse benchmarks without architectural changes, as validated on VoiceBank+DEMAND, REVERB, and VCTK.

A plausible implication is that parallel encoder–decoder frameworks with versatile loss aggregation may serve as a robust universal backbone for speech restoration tasks.

7. Context within Neural Speech Enhancement

MP-SENet is distinct from magnitude-only (MetricGAN+, DCCRN) and complex-masking (CMGAN, TridentSE) approaches by targeting the phase denoising bottleneck through explicit parallel pathways and anti-wrapping phase objectives.

The compact model size (c=0.3c = 0.36M parameters) and the ability to train via metric-discriminators (i.e., PESQ-approximate feedback) aligns with large-scale, adversarially optimized architectures, yet it achieves state-of-the-art instrumental and perceptual metrics across major speech enhancement benchmarks.

The design encourages further exploration of explicit phase processing, unified magnitude–phase models, and scales favorably due to moderate parameter count and TF-Transformer modularity. Explicit parallel estimation emerges as a principled alternative to implicit or sequential phase handling and is supported by objective and subjective evaluations.

Summary Table: MP-SENet Core Elements

Component Description Key Value/Formula
Magnitude Compressor Power-law, to regularize dynamic range c=0.3c = 0.37
LSigmoid Activation c=0.3c = 0.38 (c=0.3c = 0.39, Yin(t,f,:)=[Ymc(t,f),Yp(t,f)]∈RTĂ—FĂ—2Y_{\mathrm{in}}(t,f,:) = [Y_m^c(t,f), Y_p(t,f)] \in \mathbb{R}^{T \times F \times 2}0 trainable) Mask range Yin(t,f,:)=[Ymc(t,f),Yp(t,f)]∈RTĂ—FĂ—2Y_{\mathrm{in}}(t,f,:) = [Y_m^c(t,f), Y_p(t,f)] \in \mathbb{R}^{T \times F \times 2}1
Phase Decoder 2D-conv Yin(t,f,:)=[Ymc(t,f),Yp(t,f)]∈RTĂ—FĂ—2Y_{\mathrm{in}}(t,f,:) = [Y_m^c(t,f), Y_p(t,f)] \in \mathbb{R}^{T \times F \times 2}2 (real, imag) Yin(t,f,:)=[Ymc(t,f),Yp(t,f)]∈RTĂ—FĂ—2Y_{\mathrm{in}}(t,f,:) = [Y_m^c(t,f), Y_p(t,f)] \in \mathbb{R}^{T \times F \times 2}3 arctan2 + unwrapping Yin(t,f,:)=[Ymc(t,f),Yp(t,f)]∈RTĂ—FĂ—2Y_{\mathrm{in}}(t,f,:) = [Y_m^c(t,f), Y_p(t,f)] \in \mathbb{R}^{T \times F \times 2}4 formula above
Multi-level Loss Time, magnitude, complex, anti-wrapped phase, adversarial Per equations above
TF-Transformer Stack Four blocks; captures long/short-range TF context Multi-head, depthwise conv
Discriminator Predicts metric surrogate (e.g., PESQ) MetricGAN-style loss
Best Denoising PESQ 3.50 (VB+DEMAND), 3.62 (DNS); best among cited methods

In sum, MP-SENet leverages parallel, explicit magnitude and phase processing with multi-level loss optimization, achieving leading performance in denoising, dereverberation, and bandwidth extension, and demonstrating the efficacy of explicit phase modeling in modern neural speech enhancement (Lu et al., 2023, Lu et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MPSENet.