MP-SENet: Parallel Speech Enhancement

Updated 10 November 2025

MP-SENet is a neural speech enhancement model that explicitly and parallelly estimates compressed magnitude and wrapped phase spectra in the STFT domain.
It employs a hybrid convolutional and Transformer-based encoder–decoder design with multi-level losses, including adversarial metric feedback.
Empirical results show state-of-the-art performance in denoising, dereverberation, and bandwidth extension with a compact 2.26M-parameter model.

MP-SENet is a neural speech enhancement architecture that performs parallel magnitude and phase denoising in the short-time Fourier transform (STFT) domain. Unlike prior approaches that emphasize magnitude estimation or treat phase implicitly, MP-SENet introduces explicit, parallel estimation of compressed magnitude and wrapped phase spectra. Its hybrid convolutional and Transformer-based encoder–decoder design, coupled with multi-level losses and adversarial metric feedback, supports unified and effective solutions for denoising, dereverberation, and bandwidth extension.

1. Architecture and Signal Flow

MP-SENet operates on the STFT representation of a noisy time-domain waveform $y \in \mathbb{R}^L$ . The front-end applies a 400-point FFT with 400-sample window and 100-sample hop, yielding per-frame magnitude $Y_m \in \mathbb{R}^{T \times F}$ and wrapped phase $Y_p \in \mathbb{R}^{T \times F}$ ( $Y_p \in [-\pi,\pi]$ ).

Magnitude Compression and Stacking

Magnitude is compressed: $Y_m^c = Y_m^c$ , where $c = 0.3$ .
Input feature is stacked: $Y_{\mathrm{in}}(t,f,:) = [Y_m^c(t,f), Y_p(t,f)] \in \mathbb{R}^{T \times F \times 2}$ .

Encoder

Initial 2D Conv $\to$ InstanceNorm $\to$ PReLU block for channel lifting.
Dilated DenseNet with four 1D-conv layers (dilations 1, 2, 4, 8) along time, with dense connections.
Second 2D Conv block with stride 2 to downsample in time and frequency.
Output: $R_0 \in \mathbb{R}^{T' \times F' \times C}$ , where $T' \sim T/2,\ F' \sim F/2$ .

Time/Frequency Transformers ("TS-Conformers" or "TF-Transformers")

Four stacked blocks, each combining:
- Multi-head self-attention over time/frequency grid.
- Depthwise-separable convolution + GLU.
- Feedforward layers with residuals.
Captures both global temporal/frequency context and local structure.
Output: $R_4$ , a rich compressed TF representation.

Parallel Decoders

Magnitude mask decoder: Predicts a compressed mask $\hat{M}^c \in (0,2)$ via DenseNet, deconvolutional upsampling, a 1 $\times$ 1 2D convolution, and a learnable sigmoid ("LSigmoid") activation:

$\mathrm{LSigmoid}(t) = \frac{\beta}{1 + \exp(1-\alpha t)},\quad \beta=2.0,\ \alpha\in\mathbb{R}^F$

The enhanced magnitude is recovered as:

$\hat{X}_m = (Y_m^c \odot \hat{M}^c)^{1/c}$

Phase decoder: Runs a matching upsampling path, with two parallel $1\times 1$ conv heads to produce pseudo-real $\hat{X}_p^{(r)}$ and pseudo-imaginary $\hat{X}_p^{(i)}$ outputs. Phase is inferred using a modified two-argument arctangent with wrap handling:

$\hat{X}_p = \arctan\left(\frac{\hat{X}_p^{(i)}}{\hat{X}_p^{(r)}}\right) - \frac{\pi}{2} \cdot \mathrm{Sgn}^*(\hat{X}_p^{(i)}) \left[\mathrm{Sgn}^*(\hat{X}_p^{(r)})-1\right]$

where $\mathrm{Sgn}^*(t) = +1$ if $t\geq 0$ and $-1$ otherwise.

ISTFT reconstructs the enhanced waveform $\hat{y}$ from $\hat{X}_m,\ \hat{X}_p$ .

2. Loss Functions and Training Objectives

MP-SENet combines losses at four spectral and one temporal level, balancing magnitude and phase fidelity as well as perceptual metrics:

Loss Type	Functional Form (for target $X$ vs. estimate $\hat{X}$ )	Purpose/Domain
$\mathcal{L}_{\mathrm{Time}}$	$\mathbb{E}[\\| x - \hat{y} \\|_1]$	Waveform (time)
$\mathcal{L}_{\mathrm{Mag}}$	$\mathbb{E}[\\| X_m - \hat{X}_m \\|_2^2]$	Magnitude (STFT)
$\mathcal{L}_{\mathrm{Com}}$	$\mathbb{E}[\\| X_r - \hat{X}_r \\|_2^2] + \mathbb{E}[\\| X_i - \hat{X}_i \\|_2^2]$	Complex spectrum
$\mathcal{L}_{\mathrm{Pha}}$	$\mathcal{L}_{\mathrm{IP}} + \mathcal{L}_{\mathrm{GD}} + \mathcal{L}_{\mathrm{IAF}}$	Phase (anti-wrapping)
$\mathcal{L}_{\mathrm{Metric}}$	$\mathbb{E}[\\| D(X_m, \hat{X}_m) - 1 \\|_2^2]$	Adversarial (PESQ proxy)

where phase losses use the anti-wrap operator:

$f_{\mathrm{AW}}(t) = | t - 2\pi \cdot \mathrm{round}(t/2\pi) |$

to map phase differences into $[0, 2\pi)$ .

The total generator loss is:

$\mathcal{L}_G = \gamma_1\,\mathcal{L}_{\mathrm{Time}} + \gamma_2\,\mathcal{L}_{\mathrm{Mag}} + \gamma_3\,\mathcal{L}_{\mathrm{Com}} + \gamma_4\,\mathcal{L}_{\mathrm{Metric}} + \gamma_5\,\mathcal{L}_{\mathrm{Pha}}$

with the empirically set weights $\gamma_1=0.2$ , $\gamma_2=0.9$ , $\gamma_3=0.1$ , $\gamma_4=0.05$ , $\gamma_5=0.3$ (Lu et al., 2023), or similar $\lambda_i$ in (Lu et al., 2023).

The adversarial metric loss leverages a discriminator $D$ (as in MetricGAN/CMGAN) outputting a value approximating scaled PESQ. Discriminator and generator update steps alternate.

3. Training Protocols

Key training configurations for reproducibility:

Data: VoiceBank+DEMAND ( $\sim$ 11,572 train utterances, 28 speakers; 872 test utterances, 2 speakers; 16 kHz resampling). Supplementary: DNS Challenge corpus, REVERB Challenge, VCTK (for bandwidth extension scenarios).
STFT Front-End: FFT size 400, window 400, hop 100, yielding 201 frequency bins.
Optimizer: AdamW ( $\beta_1=0.8$ , $\beta_2=0.99$ , weight decay $10^{-2}$ ).
Learning Rate: Initial $5 \times 10^{-4}$ , exponential decay or halved every 30 epochs, with 100 epochs or 500k steps.
Batch Size: 4–8.
Model Size: Approximately 2.26M parameters.
Magnitude Compressor: $c = 0.3$ (power-law).
LSigmoid Parameters: $\beta = 2.0$ , $\alpha$ trainable per frequency bin.

Training jointly minimizes $\mathcal{L}_G$ and the discriminator loss.

4. Empirical Performance and Ablation Results

On VoiceBank+DEMAND (16 kHz, seen/unseen conditions), MP-SENet establishes state-of-the-art performance, notably:

Method	PESQ	CSIG	CBAK	COVL	SSNR	STOI
Noisy	1.97	3.35	2.44	2.63	1.68	0.92
SEGAN	2.16	3.48	2.94	2.80	7.73	0.93
MetricGAN+	3.15	4.14	3.47	3.61	12.08	0.94
DPT-FSNet	3.33	—	—	—	—	—
TridentSE	3.47	4.70	3.81	4.10	—	0.96
CMGAN	3.41	4.63	3.94	4.12	11.10	0.96
PHASEN	2.99	4.21	3.33	3.61	11.54	0.96
MP-SENet	3.50	4.73	3.95	4.22	10.64	0.96

MP-SENet achieves the highest PESQ (3.50) and MOS proxies (CSIG, COVL), reflecting improved perceptual speech quality due to explicit and parallel magnitude–phase denoising.

Ablation experiments highlight the architectural and training design choices:

Removing magnitude compression: PESQ drops to 2.97.
Replacing LSigmoid with PReLU: PESQ to 3.40.
Omit dedicated phase decoder or explicit phase loss: PESQ to 3.31 and 3.39, respectively.
Disabling complex-spectrum loss or adversarial training: PESQ to 3.44 and 3.39, respectively.

Explicit parallel phase modeling and anti-wrapping losses yield measurable quality improvements, outperforming models that use phase conditioning or implicit complex masking.

Further validation on DNS Challenge (3,000 hours) and REVERB/VCTK demonstrates transferability: denoising PESQ up to 3.62, dereverberation SRMR up to 6.67, bandwidth extension WB-PESQ up to 4.28 (Lu et al., 2023).

5. The Role of Explicit Phase Modeling

MP-SENet's principal innovation is parallel and explicit estimation of magnitude and wrapped phase, abrogating the classic magnitude–phase compensation effect. Its architecture delivers phase estimates via direct prediction and anti-wrapping loss, rather than relying on magnitude-only or complex-valued masking, yielding lower phase distortion (as measured by group delay and phase metrics).

Empirical ablation supports this approach:

"Magnitude only" (no phase decoder): increased phase distortion, reduced PESQ.
"Complex only": does not match the perceptual improvements of explicit decoders.
"w/o phase loss": increased phase error (by group delay/IAF), reduced quality.

This suggests that parallel treatment of magnitude and phase, along with multi-level objectives, allows for finer control of both perceptual and instrumental enhancement scores, and mitigates compensation artifacts typical of magnitude-only systems.

6. Multi-Task and Task-Transfer Enhancement

The MP-SENet design natively accommodates multiple speech enhancement objectives through decoder and loss reconfiguration:

Denoising and dereverberation: use learnable sigmoid for mask estimation.
Bandwidth extension: swap for PReLU activation for unbounded mask support.
Multi-level loss composition and flexible downstream targets enable the same architecture to outperform specialty models over diverse benchmarks without architectural changes, as validated on VoiceBank+DEMAND, REVERB, and VCTK.

A plausible implication is that parallel encoder–decoder frameworks with versatile loss aggregation may serve as a robust universal backbone for speech restoration tasks.

7. Context within Neural Speech Enhancement

MP-SENet is distinct from magnitude-only (MetricGAN+, DCCRN) and complex-masking (CMGAN, TridentSE) approaches by targeting the phase denoising bottleneck through explicit parallel pathways and anti-wrapping phase objectives.

The compact model size ( $\approx 2.26$ M parameters) and the ability to train via metric-discriminators (i.e., PESQ-approximate feedback) aligns with large-scale, adversarially optimized architectures, yet it achieves state-of-the-art instrumental and perceptual metrics across major speech enhancement benchmarks.

The design encourages further exploration of explicit phase processing, unified magnitude–phase models, and scales favorably due to moderate parameter count and TF-Transformer modularity. Explicit parallel estimation emerges as a principled alternative to implicit or sequential phase handling and is supported by objective and subjective evaluations.

Summary Table: MP-SENet Core Elements

Component	Description	Key Value/Formula
Magnitude Compressor	Power-law, to regularize dynamic range	$c=0.3$
LSigmoid Activation	$\beta/(1+\exp(1-\alpha t))$ ( $\beta=2.0$ , $\alpha$ trainable)	Mask range $(0,2)$
Phase Decoder	2D-conv $\to$ (real, imag) $\to$ arctan2 + unwrapping	$\hat{X}_p$ formula above
Multi-level Loss	Time, magnitude, complex, anti-wrapped phase, adversarial	Per equations above
TF-Transformer Stack	Four blocks; captures long/short-range TF context	Multi-head, depthwise conv
Discriminator	Predicts metric surrogate (e.g., PESQ)	MetricGAN-style loss
Best Denoising PESQ	3.50 (VB+DEMAND), 3.62 (DNS); best among cited methods

In sum, MP-SENet leverages parallel, explicit magnitude and phase processing with multi-level loss optimization, achieving leading performance in denoising, dereverberation, and bandwidth extension, and demonstrating the efficacy of explicit phase modeling in modern neural speech enhancement (Lu et al., 2023, Lu et al., 2023).

PDF Markdown Chat (Pro)

References (2)

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra (2023)

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MPSENet.

MP-SENet: Parallel Speech Enhancement

1. Architecture and Signal Flow

2. Loss Functions and Training Objectives

3. Training Protocols

4. Empirical Performance and Ablation Results

5. The Role of Explicit Phase Modeling

6. Multi-Task and Task-Transfer Enhancement

7. Context within Neural Speech Enhancement

Summary Table: MP-SENet Core Elements

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MP-SENet: Parallel Speech Enhancement

1. Architecture and Signal Flow

2. Loss Functions and Training Objectives

3. Training Protocols

4. Empirical Performance and Ablation Results

5. The Role of Explicit Phase Modeling

6. Multi-Task and Task-Transfer Enhancement

7. Context within Neural Speech Enhancement

Summary Table: MP-SENet Core Elements

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research