ParaNoise-SV Noise-Robust Speaker Verification

Updated 13 August 2025

ParaNoise-SV is a unified noise-robust speaker verification model that employs dual U-Net architectures for explicit noise extraction and speech enhancement.
It utilizes parallel encoder-level connections to guide speech enhancement, effectively preserving speaker-specific features while suppressing noise.
Experiments on the VoxCeleb1 dataset show significant EER reductions, demonstrating its robust performance across varied acoustic conditions.

ParaNoise-SV is a unified model for noise-robust speaker verification that employs a dual U-Net architecture—one dedicated to explicit noise extraction and the other to speech enhancement—connected via parallel, encoder-level interactions. The central motivation is to overcome limitations found in previous joint SE-SV models, which suppress noise implicitly and thus risk degrading speaker-characteristic features. By disentangling noise modeling from speech enhancement and enabling information flow between these operations, ParaNoise-SV attains superior noise resilience and improved verification accuracy across varied acoustic conditions.

1. Integrated Dual U-Net Architecture

ParaNoise-SV consists of three principal components: a Noise Extraction network (NE), a Speech Enhancement network (SE), and a Speaker Verification module. Both NE and SE use U-Net–style encoder–decoder architectures based on SE-ResNet blocks, designed for spectral input. The NE network explicitly estimates the noise spectrogram from the input, while the SE network enhances the speech content by referencing the NE output. Parallel connections between corresponding layers in the NE and SE encoders permit dynamic guidance of the enhancement process by noise-related features, which is critical for preserving speaker-relevant information during denoising.

Summary Table: Network Structure

Component	Function	Interaction
Noise Extraction (NE)	Models input noise and produces noise feature maps	Guides SE via parallel encoder connections
Speech Enhancement (SE)	Enhances speech from noisy input, preserving speaker cues	Receives NE features during encoding
Speaker Verification (SV)	Classifies enhanced speech for speaker identity	Operates on SE output

2. Explicit Noise Extraction Mechanism

The NE network receives the normalized input spectrogram and processes it through consecutive encoder blocks:

Encoder: $N_{E,i} = e_N^{(i)}(N_{E,i-1})$ for $i = 1, \ldots, L$ , with $L = 4$ .
Decoder: Initialized at the deepest encoder output, decoding via skip connections: $N_{D,i} = d_N^{(i)}(N_{D,i-1}, N_{E,L-i})$ .
Output: The estimated noise spectrogram $\hat{N}$ is produced via transposed convolution: $\hat{N} = \mathrm{ConvTranspose}(N_{D,L}, N_{E,0})$ .

This explicit modeling allows for precise separation of speaker-irrelevant noise and the generation of reliable guidance features for speaker-preserving enhancement.

3. Speech Enhancement Informed by Noise Features

The SE network processes the input spectrogram by fusing noise features from the NE network at the encoder stage:

Encoder: $S_{E,i} = e_S^{(i)}(S_{E,i-1}, N_{E,i-1})$ for $i = 1, \ldots, L$ .
Decoder: $S_{D,i} = d_S^{(i)}(S_{D,i-1}, S_{E,L-i})$ .
Output: Enhanced speech spectrogram $\hat{S} = \mathrm{ConvTranspose}(S_{D,L}, S_{E,0})$ .

By propagating noise information early (at the encoder level), the SE network achieves robust suppression of noise while avoiding the loss of speaker-discriminative features.

4. Significance and Design of Parallel Connections

A key innovation is the parallel interconnection between encoder stages of NE and SE networks. Ablation experiments demonstrate that encoder-level (rather than decoder-level) parallel fusion yields optimal performance. Encoder connections provide the enhancement encoder with real-time noise estimates, allowing it to more effectively disentangle noise from speaker characteristics. Decoder-level fusion, or mixed encoder-decoder fusion, was found to degrade verification performance.

5. Loss Functions and Joint Optimization

ParaNoise-SV is trained with a multi-term objective that integrates noise extraction, speech enhancement, and speaker verification:

$L = L_n + L_s + L_C + L_{AP} + L_{AAM}$ $L = L_{n} + L_{s} + L_{C} + L_{A P} + L_{AA M}$
- $L_n$ : MSE between NE output and true noise spectrogram
- $L_s$ : MSE between SE output and clean speech spectrogram
- $L_C$ : Cross-entropy on initial speaker embedding
- $L_{AP}$ , $L_{AAM}$ : Angular prototypical and additive angular margin losses for robust speaker discrimination

These loss terms ensure the networks learn to extract, suppress, and verify under joint supervision, reinforcing the preservation of critical speech identity features.

6. Experimental Evaluation

Evaluation on the VoxCeleb1 dataset—augmented with noise from the MUSAN corpus (SNR 0–20 dB) and out-of-domain noise from NonSpeech100—demonstrates:

EER (clean): 1.75%
EER (combined clean & noisy): 3.40% (encoder-level fusion)
Relative EER reduction: 8.4% (seen noise), 8.2% (unseen noise), compared to previous joint SE-SV models
Robustness: Maintains competitive error rates under challenging noise scenarios

These results substantiate the efficacy of explicit noise modeling and parallel network guidance for noise-robust speaker verification.

7. Implications and Future Directions

ParaNoise-SV’s strategy of explicit noise modeling with parallel encoder connections offers several technical benefits:

Enhances noise disentanglement and suppression without sacrificing speaker identity
Demonstrates that early, encoder-stage fusion is critical for verification accuracy in noisy environments
Achieves strong results with moderate model complexity, facilitating integration with production-scale SV systems

Future directions suggested in the paper include expanding parallel fusion strategies, leveraging self-supervised pretraining for further accuracy improvements without excessive parameter growth, and adopting more adaptive noise synthesis to handle a broader spectrum of real acoustical challenges.

ParaNoise-SV establishes a rigorous, multi-component framework for noise-robust speaker verification, highlighting the importance of explicit noise extraction and encoder-level feature transfer for high-fidelity verification performance in both familiar and challenging noise environments (Kim et al., 10 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ParaNoise-SV.