MP-SENet: Dual-Domain Parallel Architectures

Updated 22 November 2025

MP-SENet is a dual-domain approach combining a transformer-based parallel magnitude-phase speech enhancement network and a multi-branch squeeze-excitation residual network for vision.
The speech module uses time-frequency transformers, parallel decoders, and GR-KAN layers to explicitly estimate magnitude and phase, achieving superior PESQ scores.
The vision module (SENetV2) employs aggregated multi-branch MLPs within residual blocks to enhance global-channel feature recalibration and boost image recognition accuracy.

MP-SENet denotes two distinct architectures in the research literature: (1) a transformer-based parallel magnitude and phase speech enhancement network in the audio domain (Lu et al., 2023, Lu et al., 2023, Li et al., 2024), and (2) a multi-branch squeeze-excitation residual network ("SENetV2") for visual recognition (Narayanan, 2023). Both share the "MP-SENet" designation and the unifying theme of modular parallelism for enhanced representational power, but are unrelated in architectural detail and application domain.

1. Parallel Magnitude-Phase Speech Enhancement Network

MP-SENet for speech enhancement is a time-frequency (TF) domain architecture designed to explicitly estimate and denoise both magnitude and wrapped phase spectra in parallel, addressing the magnitude-phase compensation effect observed in earlier complex-valued or mask-only denoising approaches (Lu et al., 2023, Lu et al., 2023).

Key Components

Input Representation: Given noisy waveform $y(t)$ , the input STFT $\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))$ yields magnitude $Y_m$ and wrapped phase $Y_p$ . After power-law compression, the features $[Y_m^c, Y_p]$ are stacked as a two-channel $T \times F \times 2$ tensor.
Encoder: Comprises 2D convolutions and dilated DenseNet blocks, expanding channels to $C$ and downsampling frequency bins ( $F \to F/2$ or as per configuration).
TF-Transformer Bridge: Stacked "Time-Frequency Transformer" blocks that alternate between temporal self-attention (on $(B F', T, C)$ ) and frequency self-attention (on $(B T, F', C)$ ) representations, each incorporating bi-directional GRU and position-wise linear layers. These blocks model long-range and local dependencies in both axes.
Parallel Decoders: Two symmetric decoder branches:
- Magnitude Mask Decoder: Predicts a mask $M(t,f) \in [0,2]$ applied to the compressed noisy magnitude, utilizing a learnable sigmoid for mask activation.
- Phase Decoder: Estimates wrapped phase via two parallel convolutional branches for pseudo-real and pseudo-imag channels, followed by $\mathrm{atan2}$ for stable unwrapping.
Losses: Multi-level objectives including:
- Power-law compressed magnitude MSE
- Phase losses with anti-wrapping, group delay, and instantaneous frequency
- Complex STFT $\ell_2$ loss
- STFT consistency loss
- Adversarial metric loss via a MetricGAN-like discriminator predicting PESQ.

Performance and Empirical Analysis

On VoiceBank+DEMAND, MP-SENet achieves WB-PESQ 3.60 with 2.26M parameters, outperforming CMGAN (3.41) and PHASEN (2.99) while maintaining competitive scores on CSIG, CBAK, COVL, and STOI (Lu et al., 2023). Results extend to multi-condition datasets (DNS, REVERB, VCTK) and tasks including dereverberation and bandwidth extension, demonstrating robust generalization. Qualitative spectrograms reveal superior preservation of harmonic structure and high-frequency details.

Ablation studies confirm that explicit phase decoding and comprehensive phase loss terms provide significant objective and perceptual gains over magnitude-only or naive phase approaches. Power-law compression and learnable sigmoidal masking are also identified as critical design choices (Lu et al., 2023).

2. Aggregated Squeeze-Excitation Residual Network (SENetV2)

In the context of visual recognition, MP-SENet (also referred to as SENetV2) augments the classical squeeze-excitation (SE) module by replacing the per-channel two-layer MLP with a multi-branch (aggregated) MLP (Narayanan, 2023). This design allows the network to gather richer global and channel-wise representations, improving the recalibration of feature maps in residual blocks.

Architectural Description

Residual Block: Follows the canonical ResNet bottleneck structure with $1\times1$ down/expansion, $3\times3$ conv, and SE module.
SE Recap: After global average pooling, an MLP computes channel-wise gating as $s = \sigma(W_2 \delta(W_1 z))$ (Eqn (1)), with $r$ the reduction ratio.
Multi-Branch Aggregated MLP: SENetV2 introduces $k$ branches, each a two-layer MLP:

$u_i = W_2^{(i)} \delta(W_1^{(i)} z) \qquad (i = 1, ..., k)$

Aggregated by summation:

$s = \sigma\left(\sum_{i=1}^k u_i\right)$

The output $s$ rescales the feature maps as in SE.

Hyperparameters: Typically $r=32$ and $k=4$ for all stages.

Computational and Empirical Profile

Parameter increase per SE block is linear in $k$ ; for $k=4$ , this is $4 \times$ the original SE-ResNet added parameters per block (e.g., SENetV2-50: 28.67M vs SE-ResNet-50: 24.90M). On CIFAR-10/100 and Tiny-ImageNet, SENetV2 consistently yields 0.8–1.3% absolute top-1 accuracy gains over SE-ResNet for comparable model sizes.

Visualization of early kernels indicates richer filter diversity and higher-contrast patterns, consistent with improved global-channel feature synthesis. No significant benefit is observed for $k>4$ ("diminishing returns", as noted in the paper).

3. Incorporation of Kolmogorov-Arnold Networks in MP-SENet for Speech

MP-SENet's TF-Transformer blocks and decoders have also been enhanced with group-rational Kolmogorov-Arnold Network (GR-KAN) modules (Li et al., 2024). GR-KAN replaces standard feed-forward sublayers:

GR-KAN Layer: For input $x \in \mathbb{R}^I$ , it partitions channels into $k$ groups, applying a learnable rational function $\varphi_g(\cdot)$ per group, followed by weighted mixing:

$L(x)_j = \sum_{i=1}^I w_{i,j} \, \varphi_{\lfloor (i-1)/(I/k) \rfloor}(x_i)$

Integration:
- In each TF-Transformer, replace "linear $\to$ LeakyReLU" with two stacked GR-KANs plus linear projections.
- Decoder upsampling blocks substitute PReLU+Conv2D with two branches: RBF-KAN+Conv2D and Swish+Conv2D, summed.
Empirical Effect:
- On VoiceBank-DEMAND, adding GR-KAN raises PESQ from 3.565 (GELU baseline) to 3.61 at ~8% parameter and FLOPs overhead.
- Qualitative improvement: sharper harmonics, more uniform noise suppression, reduced phase artifacts.
- Ablations confirm that RBF-KAN in decoders and GR-KAN in TF blocks yield consistently higher PESQ relative to conventional activations (Li et al., 2024).

4. Training Procedures and Objective Functions

MP-SENet is trained using a composite objective that includes:

Spectral L1/L2 losses on compressed magnitude, anti-wrapping phase, and complex spectrum
STFT consistency loss (difference between $\hat{X}$ and $\mathrm{STFT}(\mathrm{iSTFT}(\hat X))$ )
GAN-style adversarial loss where a metric discriminator predicts a perceptual quality index (e.g., PESQ)
For visual MP-SENet (SENetV2), orthogonal training protocols are used (Adam/SGD, standard augmentations, batch size, etc.).

Key audio hyperparameters: 16 kHz sampling, STFT with $N_{fft}=400$ , hop=100, batch sizes (4 or as tuned), AdamW optimizer, 100–200 epochs, with empirical weights for each loss term (Lu et al., 2023, Li et al., 2024).

MP-SENet is evaluated against both time-domain (SEGAN, DEMUCS) and TF-domain (MetricGAN+, PHASEN, CMGAN, TridentSE) speech enhancement models. The explicit parallel phase estimation, combined with the TF-Transformer backbone, produces higher PESQ and subjective quality than complex masking and magnitude-only approaches, particularly under low-SNR, dereverberation, and bandwidth extension settings (Lu et al., 2023). In the vision domain, aggregated MLP SE modules improve over original SE and ResNeXt-SE on both small- and large-scale classification (Narayanan, 2023).

A plausible implication is that modular parallelism, either at the spectrum-branch or channel-branch level, is a general strategy for enhancing capacity in both audio and image domains.

6. Ablations and Empirical Insights

Ablations on both variants of MP-SENet uniformly indicate that:

Phase branch and anti-wrapping loss are essential for best perceptual metrics
Learnable sigmoidal mask and power-law compression are crucial for accurate magnitude recovery
Aggregated MLP branches past $k=4$ give marginal returns in SENetV2
In GR-KAN-augmented speech MP-SENet, RBF-KAN outperforms conventional nonlinearities in decoder upsampling

Tables from (Narayanan, 2023):

Model	CIFAR-100 Top-1	Params
ResNet-50	61.72	23.62M
SE-ResNet-50	62.26	24.90M
SENetV2-50	63.52	28.67M

Model	VoiceBank-DEMAND PESQ	Params
PHASEN	2.99	–
CMGAN	3.41	1.83M
MP-SENet	3.50–3.61	2.05–2.49M

7. Applications and Future Directions

MP-SENet (speech) serves as a unified backend for speech denoising, dereverberation, and bandwidth extension, with implications in robust ASR frontends, hearing aids, and communication codecs. Its modular TF encoder-decoder and explicit phase branch are likely to inspire further architectures that decouple spectral components.

MP-SENet (vision/SENetV2) validates multi-branch global-local recalibration in residual networks, indicating broader applicability in deep CNN compression, efficient attention, and possibly hybrid vision transformers.

This suggests that the "MP-SENet" principle—explicit multifactor parallelism within modular blocks—constitutes a practical advance in both signal enhancement and visual representation domains, subject to continued ablation-driven refinement and domain-specific tailoring (Lu et al., 2023, Lu et al., 2023, Li et al., 2024, Narayanan, 2023).