Papers
Topics
Authors
Recent
2000 character limit reached

MP-SENet: Dual-Domain Parallel Architectures

Updated 22 November 2025
  • MP-SENet is a dual-domain approach combining a transformer-based parallel magnitude-phase speech enhancement network and a multi-branch squeeze-excitation residual network for vision.
  • The speech module uses time-frequency transformers, parallel decoders, and GR-KAN layers to explicitly estimate magnitude and phase, achieving superior PESQ scores.
  • The vision module (SENetV2) employs aggregated multi-branch MLPs within residual blocks to enhance global-channel feature recalibration and boost image recognition accuracy.

MP-SENet denotes two distinct architectures in the research literature: (1) a transformer-based parallel magnitude and phase speech enhancement network in the audio domain (Lu et al., 2023, Lu et al., 2023, Li et al., 23 Dec 2024), and (2) a multi-branch squeeze-excitation residual network ("SENetV2") for visual recognition (Narayanan, 2023). Both share the "MP-SENet" designation and the unifying theme of modular parallelism for enhanced representational power, but are unrelated in architectural detail and application domain.

1. Parallel Magnitude-Phase Speech Enhancement Network

MP-SENet for speech enhancement is a time-frequency (TF) domain architecture designed to explicitly estimate and denoise both magnitude and wrapped phase spectra in parallel, addressing the magnitude-phase compensation effect observed in earlier complex-valued or mask-only denoising approaches (Lu et al., 2023, Lu et al., 2023).

Key Components

  • Input Representation: Given noisy waveform y(t)y(t), the input STFT Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f)) yields magnitude YmY_m and wrapped phase YpY_p. After power-law compression, the features [Ymc,Yp][Y_m^c, Y_p] are stacked as a two-channel T×F×2T \times F \times 2 tensor.
  • Encoder: Comprises 2D convolutions and dilated DenseNet blocks, expanding channels to CC and downsampling frequency bins (FF/2F \to F/2 or as per configuration).
  • TF-Transformer Bridge: Stacked "Time-Frequency Transformer" blocks that alternate between temporal self-attention (on (BF,T,C)(B F', T, C)) and frequency self-attention (on (BT,F,C)(B T, F', C)) representations, each incorporating bi-directional GRU and position-wise linear layers. These blocks model long-range and local dependencies in both axes.
  • Parallel Decoders: Two symmetric decoder branches:
    • Magnitude Mask Decoder: Predicts a mask M(t,f)[0,2]M(t,f) \in [0,2] applied to the compressed noisy magnitude, utilizing a learnable sigmoid for mask activation.
    • Phase Decoder: Estimates wrapped phase via two parallel convolutional branches for pseudo-real and pseudo-imag channels, followed by atan2\mathrm{atan2} for stable unwrapping.
  • Losses: Multi-level objectives including:
    • Power-law compressed magnitude MSE
    • Phase losses with anti-wrapping, group delay, and instantaneous frequency
    • Complex STFT 2\ell_2 loss
    • STFT consistency loss
    • Adversarial metric loss via a MetricGAN-like discriminator predicting PESQ.

Performance and Empirical Analysis

On VoiceBank+DEMAND, MP-SENet achieves WB-PESQ 3.60 with 2.26M parameters, outperforming CMGAN (3.41) and PHASEN (2.99) while maintaining competitive scores on CSIG, CBAK, COVL, and STOI (Lu et al., 2023). Results extend to multi-condition datasets (DNS, REVERB, VCTK) and tasks including dereverberation and bandwidth extension, demonstrating robust generalization. Qualitative spectrograms reveal superior preservation of harmonic structure and high-frequency details.

Ablation studies confirm that explicit phase decoding and comprehensive phase loss terms provide significant objective and perceptual gains over magnitude-only or naive phase approaches. Power-law compression and learnable sigmoidal masking are also identified as critical design choices (Lu et al., 2023).

2. Aggregated Squeeze-Excitation Residual Network (SENetV2)

In the context of visual recognition, MP-SENet (also referred to as SENetV2) augments the classical squeeze-excitation (SE) module by replacing the per-channel two-layer MLP with a multi-branch (aggregated) MLP (Narayanan, 2023). This design allows the network to gather richer global and channel-wise representations, improving the recalibration of feature maps in residual blocks.

Architectural Description

  • Residual Block: Follows the canonical ResNet bottleneck structure with 1×11\times1 down/expansion, 3×33\times3 conv, and SE module.
  • SE Recap: After global average pooling, an MLP computes channel-wise gating as s=σ(W2δ(W1z))s = \sigma(W_2 \delta(W_1 z)) (Eqn (1)), with rr the reduction ratio.
  • Multi-Branch Aggregated MLP: SENetV2 introduces kk branches, each a two-layer MLP:

ui=W2(i)δ(W1(i)z)(i=1,...,k)u_i = W_2^{(i)} \delta(W_1^{(i)} z) \qquad (i = 1, ..., k)

Aggregated by summation:

s=σ(i=1kui)s = \sigma\left(\sum_{i=1}^k u_i\right)

The output ss rescales the feature maps as in SE.

  • Hyperparameters: Typically r=32r=32 and k=4k=4 for all stages.

Computational and Empirical Profile

Parameter increase per SE block is linear in kk; for k=4k=4, this is 4×4 \times the original SE-ResNet added parameters per block (e.g., SENetV2-50: 28.67M vs SE-ResNet-50: 24.90M). On CIFAR-10/100 and Tiny-ImageNet, SENetV2 consistently yields 0.8–1.3% absolute top-1 accuracy gains over SE-ResNet for comparable model sizes.

Visualization of early kernels indicates richer filter diversity and higher-contrast patterns, consistent with improved global-channel feature synthesis. No significant benefit is observed for k>4k>4 ("diminishing returns", as noted in the paper).

3. Incorporation of Kolmogorov-Arnold Networks in MP-SENet for Speech

MP-SENet's TF-Transformer blocks and decoders have also been enhanced with group-rational Kolmogorov-Arnold Network (GR-KAN) modules (Li et al., 23 Dec 2024). GR-KAN replaces standard feed-forward sublayers:

  • GR-KAN Layer: For input xRIx \in \mathbb{R}^I, it partitions channels into kk groups, applying a learnable rational function φg()\varphi_g(\cdot) per group, followed by weighted mixing:

L(x)j=i=1Iwi,jφ(i1)/(I/k)(xi)L(x)_j = \sum_{i=1}^I w_{i,j} \, \varphi_{\lfloor (i-1)/(I/k) \rfloor}(x_i)

  • Integration:
    • In each TF-Transformer, replace "linear\toLeakyReLU" with two stacked GR-KANs plus linear projections.
    • Decoder upsampling blocks substitute PReLU+Conv2D with two branches: RBF-KAN+Conv2D and Swish+Conv2D, summed.
  • Empirical Effect:
    • On VoiceBank-DEMAND, adding GR-KAN raises PESQ from 3.565 (GELU baseline) to 3.61 at ~8% parameter and FLOPs overhead.
    • Qualitative improvement: sharper harmonics, more uniform noise suppression, reduced phase artifacts.
    • Ablations confirm that RBF-KAN in decoders and GR-KAN in TF blocks yield consistently higher PESQ relative to conventional activations (Li et al., 23 Dec 2024).

4. Training Procedures and Objective Functions

MP-SENet is trained using a composite objective that includes:

  • Spectral L1/L2 losses on compressed magnitude, anti-wrapping phase, and complex spectrum
  • STFT consistency loss (difference between X^\hat{X} and STFT(iSTFT(X^))\mathrm{STFT}(\mathrm{iSTFT}(\hat X)))
  • GAN-style adversarial loss where a metric discriminator predicts a perceptual quality index (e.g., PESQ)
  • For visual MP-SENet (SENetV2), orthogonal training protocols are used (Adam/SGD, standard augmentations, batch size, etc.).

Key audio hyperparameters: 16 kHz sampling, STFT with Nfft=400N_{fft}=400, hop=100, batch sizes (4 or as tuned), AdamW optimizer, 100–200 epochs, with empirical weights for each loss term (Lu et al., 2023, Li et al., 23 Dec 2024).

MP-SENet is evaluated against both time-domain (SEGAN, DEMUCS) and TF-domain (MetricGAN+, PHASEN, CMGAN, TridentSE) speech enhancement models. The explicit parallel phase estimation, combined with the TF-Transformer backbone, produces higher PESQ and subjective quality than complex masking and magnitude-only approaches, particularly under low-SNR, dereverberation, and bandwidth extension settings (Lu et al., 2023). In the vision domain, aggregated MLP SE modules improve over original SE and ResNeXt-SE on both small- and large-scale classification (Narayanan, 2023).

A plausible implication is that modular parallelism, either at the spectrum-branch or channel-branch level, is a general strategy for enhancing capacity in both audio and image domains.

6. Ablations and Empirical Insights

Ablations on both variants of MP-SENet uniformly indicate that:

  • Phase branch and anti-wrapping loss are essential for best perceptual metrics
  • Learnable sigmoidal mask and power-law compression are crucial for accurate magnitude recovery
  • Aggregated MLP branches past k=4k=4 give marginal returns in SENetV2
  • In GR-KAN-augmented speech MP-SENet, RBF-KAN outperforms conventional nonlinearities in decoder upsampling

Tables from (Narayanan, 2023):

Model CIFAR-100 Top-1 Params
ResNet-50 61.72 23.62M
SE-ResNet-50 62.26 24.90M
SENetV2-50 63.52 28.67M
Model VoiceBank-DEMAND PESQ Params
PHASEN 2.99
CMGAN 3.41 1.83M
MP-SENet 3.50–3.61 2.05–2.49M

7. Applications and Future Directions

MP-SENet (speech) serves as a unified backend for speech denoising, dereverberation, and bandwidth extension, with implications in robust ASR frontends, hearing aids, and communication codecs. Its modular TF encoder-decoder and explicit phase branch are likely to inspire further architectures that decouple spectral components.

MP-SENet (vision/SENetV2) validates multi-branch global-local recalibration in residual networks, indicating broader applicability in deep CNN compression, efficient attention, and possibly hybrid vision transformers.

This suggests that the "MP-SENet" principle—explicit multifactor parallelism within modular blocks—constitutes a practical advance in both signal enhancement and visual representation domains, subject to continued ablation-driven refinement and domain-specific tailoring (Lu et al., 2023, Lu et al., 2023, Li et al., 23 Dec 2024, Narayanan, 2023).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MP-SENet.