Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 180 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 66 tok/s Pro
Kimi K2 163 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

MP-SENet: Parallel Speech Enhancement

Updated 10 November 2025
  • MP-SENet is a neural speech enhancement model that explicitly and parallelly estimates compressed magnitude and wrapped phase spectra in the STFT domain.
  • It employs a hybrid convolutional and Transformer-based encoder–decoder design with multi-level losses, including adversarial metric feedback.
  • Empirical results show state-of-the-art performance in denoising, dereverberation, and bandwidth extension with a compact 2.26M-parameter model.

MP-SENet is a neural speech enhancement architecture that performs parallel magnitude and phase denoising in the short-time Fourier transform (STFT) domain. Unlike prior approaches that emphasize magnitude estimation or treat phase implicitly, MP-SENet introduces explicit, parallel estimation of compressed magnitude and wrapped phase spectra. Its hybrid convolutional and Transformer-based encoder–decoder design, coupled with multi-level losses and adversarial metric feedback, supports unified and effective solutions for denoising, dereverberation, and bandwidth extension.

1. Architecture and Signal Flow

MP-SENet operates on the STFT representation of a noisy time-domain waveform yRLy \in \mathbb{R}^L. The front-end applies a 400-point FFT with 400-sample window and 100-sample hop, yielding per-frame magnitude YmRT×FY_m \in \mathbb{R}^{T \times F} and wrapped phase YpRT×FY_p \in \mathbb{R}^{T \times F} (Yp[π,π]Y_p \in [-\pi,\pi]).

Magnitude Compression and Stacking

  • Magnitude is compressed: Ymc=YmcY_m^c = Y_m^c, where c=0.3c = 0.3.
  • Input feature is stacked: Yin(t,f,:)=[Ymc(t,f),Yp(t,f)]RT×F×2Y_{\mathrm{in}}(t,f,:) = [Y_m^c(t,f), Y_p(t,f)] \in \mathbb{R}^{T \times F \times 2}.

Encoder

  • Initial 2D Conv\toInstanceNorm\toPReLU block for channel lifting.
  • Dilated DenseNet with four 1D-conv layers (dilations 1, 2, 4, 8) along time, with dense connections.
  • Second 2D Conv block with stride 2 to downsample in time and frequency.
  • Output: R0RT×F×CR_0 \in \mathbb{R}^{T' \times F' \times C}, where TT/2, FF/2T' \sim T/2,\ F' \sim F/2.

Time/Frequency Transformers ("TS-Conformers" or "TF-Transformers")

  • Four stacked blocks, each combining:
    • Multi-head self-attention over time/frequency grid.
    • Depthwise-separable convolution + GLU.
    • Feedforward layers with residuals.
  • Captures both global temporal/frequency context and local structure.
  • Output: R4R_4, a rich compressed TF representation.

Parallel Decoders

  • Magnitude mask decoder: Predicts a compressed mask M^c(0,2)\hat{M}^c \in (0,2) via DenseNet, deconvolutional upsampling, a 1×\times1 2D convolution, and a learnable sigmoid ("LSigmoid") activation:

LSigmoid(t)=β1+exp(1αt),β=2.0, αRF\mathrm{LSigmoid}(t) = \frac{\beta}{1 + \exp(1-\alpha t)},\quad \beta=2.0,\ \alpha\in\mathbb{R}^F

The enhanced magnitude is recovered as:

X^m=(YmcM^c)1/c\hat{X}_m = (Y_m^c \odot \hat{M}^c)^{1/c}

  • Phase decoder: Runs a matching upsampling path, with two parallel 1×11\times 1 conv heads to produce pseudo-real X^p(r)\hat{X}_p^{(r)} and pseudo-imaginary X^p(i)\hat{X}_p^{(i)} outputs. Phase is inferred using a modified two-argument arctangent with wrap handling:

X^p=arctan(X^p(i)X^p(r))π2Sgn(X^p(i))[Sgn(X^p(r))1]\hat{X}_p = \arctan\left(\frac{\hat{X}_p^{(i)}}{\hat{X}_p^{(r)}}\right) - \frac{\pi}{2} \cdot \mathrm{Sgn}^*(\hat{X}_p^{(i)}) \left[\mathrm{Sgn}^*(\hat{X}_p^{(r)})-1\right]

where Sgn(t)=+1\mathrm{Sgn}^*(t) = +1 if t0t\geq 0 and 1-1 otherwise.

ISTFT reconstructs the enhanced waveform y^\hat{y} from X^m, X^p\hat{X}_m,\ \hat{X}_p.

2. Loss Functions and Training Objectives

MP-SENet combines losses at four spectral and one temporal level, balancing magnitude and phase fidelity as well as perceptual metrics:

Loss Type Functional Form (for target XX vs. estimate X^\hat{X}) Purpose/Domain
LTime\mathcal{L}_{\mathrm{Time}} E[xy^1]\mathbb{E}[\| x - \hat{y} \|_1] Waveform (time)
LMag\mathcal{L}_{\mathrm{Mag}} E[XmX^m22]\mathbb{E}[\| X_m - \hat{X}_m \|_2^2] Magnitude (STFT)
LCom\mathcal{L}_{\mathrm{Com}} E[XrX^r22]+E[XiX^i22]\mathbb{E}[\| X_r - \hat{X}_r \|_2^2] + \mathbb{E}[\| X_i - \hat{X}_i \|_2^2] Complex spectrum
LPha\mathcal{L}_{\mathrm{Pha}} LIP+LGD+LIAF\mathcal{L}_{\mathrm{IP}} + \mathcal{L}_{\mathrm{GD}} + \mathcal{L}_{\mathrm{IAF}} Phase (anti-wrapping)
LMetric\mathcal{L}_{\mathrm{Metric}} E[D(Xm,X^m)122]\mathbb{E}[\| D(X_m, \hat{X}_m) - 1 \|_2^2] Adversarial (PESQ proxy)

where phase losses use the anti-wrap operator:

fAW(t)=t2πround(t/2π)f_{\mathrm{AW}}(t) = | t - 2\pi \cdot \mathrm{round}(t/2\pi) |

to map phase differences into [0,2π)[0, 2\pi).

The total generator loss is:

LG=γ1LTime+γ2LMag+γ3LCom+γ4LMetric+γ5LPha\mathcal{L}_G = \gamma_1\,\mathcal{L}_{\mathrm{Time}} + \gamma_2\,\mathcal{L}_{\mathrm{Mag}} + \gamma_3\,\mathcal{L}_{\mathrm{Com}} + \gamma_4\,\mathcal{L}_{\mathrm{Metric}} + \gamma_5\,\mathcal{L}_{\mathrm{Pha}}

with the empirically set weights γ1=0.2\gamma_1=0.2, γ2=0.9\gamma_2=0.9, γ3=0.1\gamma_3=0.1, γ4=0.05\gamma_4=0.05, γ5=0.3\gamma_5=0.3 (Lu et al., 2023), or similar λi\lambda_i in (Lu et al., 2023).

The adversarial metric loss leverages a discriminator DD (as in MetricGAN/CMGAN) outputting a value approximating scaled PESQ. Discriminator and generator update steps alternate.

3. Training Protocols

Key training configurations for reproducibility:

  • Data: VoiceBank+DEMAND (\sim11,572 train utterances, 28 speakers; 872 test utterances, 2 speakers; 16 kHz resampling). Supplementary: DNS Challenge corpus, REVERB Challenge, VCTK (for bandwidth extension scenarios).
  • STFT Front-End: FFT size 400, window 400, hop 100, yielding 201 frequency bins.
  • Optimizer: AdamW (β1=0.8\beta_1=0.8, β2=0.99\beta_2=0.99, weight decay 10210^{-2}).
  • Learning Rate: Initial 5×1045 \times 10^{-4}, exponential decay or halved every 30 epochs, with 100 epochs or 500k steps.
  • Batch Size: 4–8.
  • Model Size: Approximately 2.26M parameters.
  • Magnitude Compressor: c=0.3c = 0.3 (power-law).
  • LSigmoid Parameters: β=2.0\beta = 2.0, α\alpha trainable per frequency bin.

Training jointly minimizes LG\mathcal{L}_G and the discriminator loss.

4. Empirical Performance and Ablation Results

On VoiceBank+DEMAND (16 kHz, seen/unseen conditions), MP-SENet establishes state-of-the-art performance, notably:

Method PESQ CSIG CBAK COVL SSNR STOI
Noisy 1.97 3.35 2.44 2.63 1.68 0.92
SEGAN 2.16 3.48 2.94 2.80 7.73 0.93
MetricGAN+ 3.15 4.14 3.47 3.61 12.08 0.94
DPT-FSNet 3.33
TridentSE 3.47 4.70 3.81 4.10 0.96
CMGAN 3.41 4.63 3.94 4.12 11.10 0.96
PHASEN 2.99 4.21 3.33 3.61 11.54 0.96
MP-SENet 3.50 4.73 3.95 4.22 10.64 0.96

MP-SENet achieves the highest PESQ (3.50) and MOS proxies (CSIG, COVL), reflecting improved perceptual speech quality due to explicit and parallel magnitude–phase denoising.

Ablation experiments highlight the architectural and training design choices:

  • Removing magnitude compression: PESQ drops to 2.97.
  • Replacing LSigmoid with PReLU: PESQ to 3.40.
  • Omit dedicated phase decoder or explicit phase loss: PESQ to 3.31 and 3.39, respectively.
  • Disabling complex-spectrum loss or adversarial training: PESQ to 3.44 and 3.39, respectively.

Explicit parallel phase modeling and anti-wrapping losses yield measurable quality improvements, outperforming models that use phase conditioning or implicit complex masking.

Further validation on DNS Challenge (3,000 hours) and REVERB/VCTK demonstrates transferability: denoising PESQ up to 3.62, dereverberation SRMR up to 6.67, bandwidth extension WB-PESQ up to 4.28 (Lu et al., 2023).

5. The Role of Explicit Phase Modeling

MP-SENet's principal innovation is parallel and explicit estimation of magnitude and wrapped phase, abrogating the classic magnitude–phase compensation effect. Its architecture delivers phase estimates via direct prediction and anti-wrapping loss, rather than relying on magnitude-only or complex-valued masking, yielding lower phase distortion (as measured by group delay and phase metrics).

Empirical ablation supports this approach:

  • "Magnitude only" (no phase decoder): increased phase distortion, reduced PESQ.
  • "Complex only": does not match the perceptual improvements of explicit decoders.
  • "w/o phase loss": increased phase error (by group delay/IAF), reduced quality.

This suggests that parallel treatment of magnitude and phase, along with multi-level objectives, allows for finer control of both perceptual and instrumental enhancement scores, and mitigates compensation artifacts typical of magnitude-only systems.

6. Multi-Task and Task-Transfer Enhancement

The MP-SENet design natively accommodates multiple speech enhancement objectives through decoder and loss reconfiguration:

  • Denoising and dereverberation: use learnable sigmoid for mask estimation.
  • Bandwidth extension: swap for PReLU activation for unbounded mask support.
  • Multi-level loss composition and flexible downstream targets enable the same architecture to outperform specialty models over diverse benchmarks without architectural changes, as validated on VoiceBank+DEMAND, REVERB, and VCTK.

A plausible implication is that parallel encoder–decoder frameworks with versatile loss aggregation may serve as a robust universal backbone for speech restoration tasks.

7. Context within Neural Speech Enhancement

MP-SENet is distinct from magnitude-only (MetricGAN+, DCCRN) and complex-masking (CMGAN, TridentSE) approaches by targeting the phase denoising bottleneck through explicit parallel pathways and anti-wrapping phase objectives.

The compact model size (2.26\approx 2.26M parameters) and the ability to train via metric-discriminators (i.e., PESQ-approximate feedback) aligns with large-scale, adversarially optimized architectures, yet it achieves state-of-the-art instrumental and perceptual metrics across major speech enhancement benchmarks.

The design encourages further exploration of explicit phase processing, unified magnitude–phase models, and scales favorably due to moderate parameter count and TF-Transformer modularity. Explicit parallel estimation emerges as a principled alternative to implicit or sequential phase handling and is supported by objective and subjective evaluations.

Summary Table: MP-SENet Core Elements

Component Description Key Value/Formula
Magnitude Compressor Power-law, to regularize dynamic range c=0.3c=0.3
LSigmoid Activation β/(1+exp(1αt))\beta/(1+\exp(1-\alpha t)) (β=2.0\beta=2.0, α\alpha trainable) Mask range (0,2)(0,2)
Phase Decoder 2D-conv \to (real, imag) \to arctan2 + unwrapping X^p\hat{X}_p formula above
Multi-level Loss Time, magnitude, complex, anti-wrapped phase, adversarial Per equations above
TF-Transformer Stack Four blocks; captures long/short-range TF context Multi-head, depthwise conv
Discriminator Predicts metric surrogate (e.g., PESQ) MetricGAN-style loss
Best Denoising PESQ 3.50 (VB+DEMAND), 3.62 (DNS); best among cited methods

In sum, MP-SENet leverages parallel, explicit magnitude and phase processing with multi-level loss optimization, achieving leading performance in denoising, dereverberation, and bandwidth extension, and demonstrating the efficacy of explicit phase modeling in modern neural speech enhancement (Lu et al., 2023, Lu et al., 2023).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MPSENet.