MP-SENet: Dual-Domain Parallel Architectures
- MP-SENet is a dual-domain approach combining a transformer-based parallel magnitude-phase speech enhancement network and a multi-branch squeeze-excitation residual network for vision.
- The speech module uses time-frequency transformers, parallel decoders, and GR-KAN layers to explicitly estimate magnitude and phase, achieving superior PESQ scores.
- The vision module (SENetV2) employs aggregated multi-branch MLPs within residual blocks to enhance global-channel feature recalibration and boost image recognition accuracy.
MP-SENet denotes two distinct architectures in the research literature: (1) a transformer-based parallel magnitude and phase speech enhancement network in the audio domain (Lu et al., 2023, Lu et al., 2023, Li et al., 23 Dec 2024), and (2) a multi-branch squeeze-excitation residual network ("SENetV2") for visual recognition (Narayanan, 2023). Both share the "MP-SENet" designation and the unifying theme of modular parallelism for enhanced representational power, but are unrelated in architectural detail and application domain.
1. Parallel Magnitude-Phase Speech Enhancement Network
MP-SENet for speech enhancement is a time-frequency (TF) domain architecture designed to explicitly estimate and denoise both magnitude and wrapped phase spectra in parallel, addressing the magnitude-phase compensation effect observed in earlier complex-valued or mask-only denoising approaches (Lu et al., 2023, Lu et al., 2023).
Key Components
- Input Representation: Given noisy waveform , the input STFT yields magnitude and wrapped phase . After power-law compression, the features are stacked as a two-channel tensor.
- Encoder: Comprises 2D convolutions and dilated DenseNet blocks, expanding channels to and downsampling frequency bins ( or as per configuration).
- TF-Transformer Bridge: Stacked "Time-Frequency Transformer" blocks that alternate between temporal self-attention (on ) and frequency self-attention (on ) representations, each incorporating bi-directional GRU and position-wise linear layers. These blocks model long-range and local dependencies in both axes.
- Parallel Decoders: Two symmetric decoder branches:
- Magnitude Mask Decoder: Predicts a mask applied to the compressed noisy magnitude, utilizing a learnable sigmoid for mask activation.
- Phase Decoder: Estimates wrapped phase via two parallel convolutional branches for pseudo-real and pseudo-imag channels, followed by for stable unwrapping.
- Losses: Multi-level objectives including:
- Power-law compressed magnitude MSE
- Phase losses with anti-wrapping, group delay, and instantaneous frequency
- Complex STFT loss
- STFT consistency loss
- Adversarial metric loss via a MetricGAN-like discriminator predicting PESQ.
Performance and Empirical Analysis
On VoiceBank+DEMAND, MP-SENet achieves WB-PESQ 3.60 with 2.26M parameters, outperforming CMGAN (3.41) and PHASEN (2.99) while maintaining competitive scores on CSIG, CBAK, COVL, and STOI (Lu et al., 2023). Results extend to multi-condition datasets (DNS, REVERB, VCTK) and tasks including dereverberation and bandwidth extension, demonstrating robust generalization. Qualitative spectrograms reveal superior preservation of harmonic structure and high-frequency details.
Ablation studies confirm that explicit phase decoding and comprehensive phase loss terms provide significant objective and perceptual gains over magnitude-only or naive phase approaches. Power-law compression and learnable sigmoidal masking are also identified as critical design choices (Lu et al., 2023).
2. Aggregated Squeeze-Excitation Residual Network (SENetV2)
In the context of visual recognition, MP-SENet (also referred to as SENetV2) augments the classical squeeze-excitation (SE) module by replacing the per-channel two-layer MLP with a multi-branch (aggregated) MLP (Narayanan, 2023). This design allows the network to gather richer global and channel-wise representations, improving the recalibration of feature maps in residual blocks.
Architectural Description
- Residual Block: Follows the canonical ResNet bottleneck structure with down/expansion, conv, and SE module.
- SE Recap: After global average pooling, an MLP computes channel-wise gating as (Eqn (1)), with the reduction ratio.
- Multi-Branch Aggregated MLP: SENetV2 introduces branches, each a two-layer MLP:
Aggregated by summation:
The output rescales the feature maps as in SE.
- Hyperparameters: Typically and for all stages.
Computational and Empirical Profile
Parameter increase per SE block is linear in ; for , this is the original SE-ResNet added parameters per block (e.g., SENetV2-50: 28.67M vs SE-ResNet-50: 24.90M). On CIFAR-10/100 and Tiny-ImageNet, SENetV2 consistently yields 0.8–1.3% absolute top-1 accuracy gains over SE-ResNet for comparable model sizes.
Visualization of early kernels indicates richer filter diversity and higher-contrast patterns, consistent with improved global-channel feature synthesis. No significant benefit is observed for ("diminishing returns", as noted in the paper).
3. Incorporation of Kolmogorov-Arnold Networks in MP-SENet for Speech
MP-SENet's TF-Transformer blocks and decoders have also been enhanced with group-rational Kolmogorov-Arnold Network (GR-KAN) modules (Li et al., 23 Dec 2024). GR-KAN replaces standard feed-forward sublayers:
- GR-KAN Layer: For input , it partitions channels into groups, applying a learnable rational function per group, followed by weighted mixing:
- Integration:
- In each TF-Transformer, replace "linearLeakyReLU" with two stacked GR-KANs plus linear projections.
- Decoder upsampling blocks substitute PReLU+Conv2D with two branches: RBF-KAN+Conv2D and Swish+Conv2D, summed.
- Empirical Effect:
- On VoiceBank-DEMAND, adding GR-KAN raises PESQ from 3.565 (GELU baseline) to 3.61 at ~8% parameter and FLOPs overhead.
- Qualitative improvement: sharper harmonics, more uniform noise suppression, reduced phase artifacts.
- Ablations confirm that RBF-KAN in decoders and GR-KAN in TF blocks yield consistently higher PESQ relative to conventional activations (Li et al., 23 Dec 2024).
4. Training Procedures and Objective Functions
MP-SENet is trained using a composite objective that includes:
- Spectral L1/L2 losses on compressed magnitude, anti-wrapping phase, and complex spectrum
- STFT consistency loss (difference between and )
- GAN-style adversarial loss where a metric discriminator predicts a perceptual quality index (e.g., PESQ)
- For visual MP-SENet (SENetV2), orthogonal training protocols are used (Adam/SGD, standard augmentations, batch size, etc.).
Key audio hyperparameters: 16 kHz sampling, STFT with , hop=100, batch sizes (4 or as tuned), AdamW optimizer, 100–200 epochs, with empirical weights for each loss term (Lu et al., 2023, Li et al., 23 Dec 2024).
5. Comparison with Related Architectures
MP-SENet is evaluated against both time-domain (SEGAN, DEMUCS) and TF-domain (MetricGAN+, PHASEN, CMGAN, TridentSE) speech enhancement models. The explicit parallel phase estimation, combined with the TF-Transformer backbone, produces higher PESQ and subjective quality than complex masking and magnitude-only approaches, particularly under low-SNR, dereverberation, and bandwidth extension settings (Lu et al., 2023). In the vision domain, aggregated MLP SE modules improve over original SE and ResNeXt-SE on both small- and large-scale classification (Narayanan, 2023).
A plausible implication is that modular parallelism, either at the spectrum-branch or channel-branch level, is a general strategy for enhancing capacity in both audio and image domains.
6. Ablations and Empirical Insights
Ablations on both variants of MP-SENet uniformly indicate that:
- Phase branch and anti-wrapping loss are essential for best perceptual metrics
- Learnable sigmoidal mask and power-law compression are crucial for accurate magnitude recovery
- Aggregated MLP branches past give marginal returns in SENetV2
- In GR-KAN-augmented speech MP-SENet, RBF-KAN outperforms conventional nonlinearities in decoder upsampling
Tables from (Narayanan, 2023):
| Model | CIFAR-100 Top-1 | Params |
|---|---|---|
| ResNet-50 | 61.72 | 23.62M |
| SE-ResNet-50 | 62.26 | 24.90M |
| SENetV2-50 | 63.52 | 28.67M |
| Model | VoiceBank-DEMAND PESQ | Params |
|---|---|---|
| PHASEN | 2.99 | – |
| CMGAN | 3.41 | 1.83M |
| MP-SENet | 3.50–3.61 | 2.05–2.49M |
7. Applications and Future Directions
MP-SENet (speech) serves as a unified backend for speech denoising, dereverberation, and bandwidth extension, with implications in robust ASR frontends, hearing aids, and communication codecs. Its modular TF encoder-decoder and explicit phase branch are likely to inspire further architectures that decouple spectral components.
MP-SENet (vision/SENetV2) validates multi-branch global-local recalibration in residual networks, indicating broader applicability in deep CNN compression, efficient attention, and possibly hybrid vision transformers.
This suggests that the "MP-SENet" principle—explicit multifactor parallelism within modular blocks—constitutes a practical advance in both signal enhancement and visual representation domains, subject to continued ablation-driven refinement and domain-specific tailoring (Lu et al., 2023, Lu et al., 2023, Li et al., 23 Dec 2024, Narayanan, 2023).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free