Papers
Topics
Authors
Recent
Search
2000 character limit reached

MP-SENet: Dual-Domain Parallel Architectures

Updated 22 November 2025
  • MP-SENet is a dual-domain approach combining a transformer-based parallel magnitude-phase speech enhancement network and a multi-branch squeeze-excitation residual network for vision.
  • The speech module uses time-frequency transformers, parallel decoders, and GR-KAN layers to explicitly estimate magnitude and phase, achieving superior PESQ scores.
  • The vision module (SENetV2) employs aggregated multi-branch MLPs within residual blocks to enhance global-channel feature recalibration and boost image recognition accuracy.

MP-SENet denotes two distinct architectures in the research literature: (1) a transformer-based parallel magnitude and phase speech enhancement network in the audio domain (Lu et al., 2023, Lu et al., 2023, Li et al., 2024), and (2) a multi-branch squeeze-excitation residual network ("SENetV2") for visual recognition (Narayanan, 2023). Both share the "MP-SENet" designation and the unifying theme of modular parallelism for enhanced representational power, but are unrelated in architectural detail and application domain.

1. Parallel Magnitude-Phase Speech Enhancement Network

MP-SENet for speech enhancement is a time-frequency (TF) domain architecture designed to explicitly estimate and denoise both magnitude and wrapped phase spectra in parallel, addressing the magnitude-phase compensation effect observed in earlier complex-valued or mask-only denoising approaches (Lu et al., 2023, Lu et al., 2023).

Key Components

  • Input Representation: Given noisy waveform y(t)y(t), the input STFT Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f)) yields magnitude YmY_m and wrapped phase YpY_p. After power-law compression, the features [Ymc,Yp][Y_m^c, Y_p] are stacked as a two-channel T×F×2T \times F \times 2 tensor.
  • Encoder: Comprises 2D convolutions and dilated DenseNet blocks, expanding channels to CC and downsampling frequency bins (FF/2F \to F/2 or as per configuration).
  • TF-Transformer Bridge: Stacked "Time-Frequency Transformer" blocks that alternate between temporal self-attention (on (BF,T,C)(B F', T, C)) and frequency self-attention (on (BT,F,C)(B T, F', C)) representations, each incorporating bi-directional GRU and position-wise linear layers. These blocks model long-range and local dependencies in both axes.
  • Parallel Decoders: Two symmetric decoder branches:
    • Magnitude Mask Decoder: Predicts a mask Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))0 applied to the compressed noisy magnitude, utilizing a learnable sigmoid for mask activation.
    • Phase Decoder: Estimates wrapped phase via two parallel convolutional branches for pseudo-real and pseudo-imag channels, followed by Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))1 for stable unwrapping.
  • Losses: Multi-level objectives including:
    • Power-law compressed magnitude MSE
    • Phase losses with anti-wrapping, group delay, and instantaneous frequency
    • Complex STFT Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))2 loss
    • STFT consistency loss
    • Adversarial metric loss via a MetricGAN-like discriminator predicting PESQ.

Performance and Empirical Analysis

On VoiceBank+DEMAND, MP-SENet achieves WB-PESQ 3.60 with 2.26M parameters, outperforming CMGAN (3.41) and PHASEN (2.99) while maintaining competitive scores on CSIG, CBAK, COVL, and STOI (Lu et al., 2023). Results extend to multi-condition datasets (DNS, REVERB, VCTK) and tasks including dereverberation and bandwidth extension, demonstrating robust generalization. Qualitative spectrograms reveal superior preservation of harmonic structure and high-frequency details.

Ablation studies confirm that explicit phase decoding and comprehensive phase loss terms provide significant objective and perceptual gains over magnitude-only or naive phase approaches. Power-law compression and learnable sigmoidal masking are also identified as critical design choices (Lu et al., 2023).

2. Aggregated Squeeze-Excitation Residual Network (SENetV2)

In the context of visual recognition, MP-SENet (also referred to as SENetV2) augments the classical squeeze-excitation (SE) module by replacing the per-channel two-layer MLP with a multi-branch (aggregated) MLP (Narayanan, 2023). This design allows the network to gather richer global and channel-wise representations, improving the recalibration of feature maps in residual blocks.

Architectural Description

  • Residual Block: Follows the canonical ResNet bottleneck structure with Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))3 down/expansion, Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))4 conv, and SE module.
  • SE Recap: After global average pooling, an MLP computes channel-wise gating as Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))5 (Eqn (1)), with Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))6 the reduction ratio.
  • Multi-Branch Aggregated MLP: SENetV2 introduces Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))7 branches, each a two-layer MLP:

Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))8

Aggregated by summation:

Y(t,f)=Ym(t,f)exp(jYp(t,f))\mathbf{Y}(t,f) = Y_m(t,f) \exp(j Y_p(t,f))9

The output YmY_m0 rescales the feature maps as in SE.

  • Hyperparameters: Typically YmY_m1 and YmY_m2 for all stages.

Computational and Empirical Profile

Parameter increase per SE block is linear in YmY_m3; for YmY_m4, this is YmY_m5 the original SE-ResNet added parameters per block (e.g., SENetV2-50: 28.67M vs SE-ResNet-50: 24.90M). On CIFAR-10/100 and Tiny-ImageNet, SENetV2 consistently yields 0.8–1.3% absolute top-1 accuracy gains over SE-ResNet for comparable model sizes.

Visualization of early kernels indicates richer filter diversity and higher-contrast patterns, consistent with improved global-channel feature synthesis. No significant benefit is observed for YmY_m6 ("diminishing returns", as noted in the paper).

3. Incorporation of Kolmogorov-Arnold Networks in MP-SENet for Speech

MP-SENet's TF-Transformer blocks and decoders have also been enhanced with group-rational Kolmogorov-Arnold Network (GR-KAN) modules (Li et al., 2024). GR-KAN replaces standard feed-forward sublayers:

  • GR-KAN Layer: For input YmY_m7, it partitions channels into YmY_m8 groups, applying a learnable rational function YmY_m9 per group, followed by weighted mixing:

YpY_p0

  • Integration:
    • In each TF-Transformer, replace "linearYpY_p1LeakyReLU" with two stacked GR-KANs plus linear projections.
    • Decoder upsampling blocks substitute PReLU+Conv2D with two branches: RBF-KAN+Conv2D and Swish+Conv2D, summed.
  • Empirical Effect:
    • On VoiceBank-DEMAND, adding GR-KAN raises PESQ from 3.565 (GELU baseline) to 3.61 at ~8% parameter and FLOPs overhead.
    • Qualitative improvement: sharper harmonics, more uniform noise suppression, reduced phase artifacts.
    • Ablations confirm that RBF-KAN in decoders and GR-KAN in TF blocks yield consistently higher PESQ relative to conventional activations (Li et al., 2024).

4. Training Procedures and Objective Functions

MP-SENet is trained using a composite objective that includes:

  • Spectral L1/L2 losses on compressed magnitude, anti-wrapping phase, and complex spectrum
  • STFT consistency loss (difference between YpY_p2 and YpY_p3)
  • GAN-style adversarial loss where a metric discriminator predicts a perceptual quality index (e.g., PESQ)
  • For visual MP-SENet (SENetV2), orthogonal training protocols are used (Adam/SGD, standard augmentations, batch size, etc.).

Key audio hyperparameters: 16 kHz sampling, STFT with YpY_p4, hop=100, batch sizes (4 or as tuned), AdamW optimizer, 100–200 epochs, with empirical weights for each loss term (Lu et al., 2023, Li et al., 2024).

MP-SENet is evaluated against both time-domain (SEGAN, DEMUCS) and TF-domain (MetricGAN+, PHASEN, CMGAN, TridentSE) speech enhancement models. The explicit parallel phase estimation, combined with the TF-Transformer backbone, produces higher PESQ and subjective quality than complex masking and magnitude-only approaches, particularly under low-SNR, dereverberation, and bandwidth extension settings (Lu et al., 2023). In the vision domain, aggregated MLP SE modules improve over original SE and ResNeXt-SE on both small- and large-scale classification (Narayanan, 2023).

A plausible implication is that modular parallelism, either at the spectrum-branch or channel-branch level, is a general strategy for enhancing capacity in both audio and image domains.

6. Ablations and Empirical Insights

Ablations on both variants of MP-SENet uniformly indicate that:

  • Phase branch and anti-wrapping loss are essential for best perceptual metrics
  • Learnable sigmoidal mask and power-law compression are crucial for accurate magnitude recovery
  • Aggregated MLP branches past YpY_p5 give marginal returns in SENetV2
  • In GR-KAN-augmented speech MP-SENet, RBF-KAN outperforms conventional nonlinearities in decoder upsampling

Tables from (Narayanan, 2023):

Model CIFAR-100 Top-1 Params
ResNet-50 61.72 23.62M
SE-ResNet-50 62.26 24.90M
SENetV2-50 63.52 28.67M
Model VoiceBank-DEMAND PESQ Params
PHASEN 2.99
CMGAN 3.41 1.83M
MP-SENet 3.50–3.61 2.05–2.49M

7. Applications and Future Directions

MP-SENet (speech) serves as a unified backend for speech denoising, dereverberation, and bandwidth extension, with implications in robust ASR frontends, hearing aids, and communication codecs. Its modular TF encoder-decoder and explicit phase branch are likely to inspire further architectures that decouple spectral components.

MP-SENet (vision/SENetV2) validates multi-branch global-local recalibration in residual networks, indicating broader applicability in deep CNN compression, efficient attention, and possibly hybrid vision transformers.

This suggests that the "MP-SENet" principle—explicit multifactor parallelism within modular blocks—constitutes a practical advance in both signal enhancement and visual representation domains, subject to continued ablation-driven refinement and domain-specific tailoring (Lu et al., 2023, Lu et al., 2023, Li et al., 2024, Narayanan, 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MP-SENet.