GTCRN Model for Speech Enhancement

Updated 15 November 2025

GTCRN is a lightweight, modular deep learning architecture designed for advanced speech enhancement and target speech extraction in noisy environments.
It employs grouped temporal convolution blocks and dual-path recurrent modules to efficiently model spectral–temporal features and reduce computational load.
The model integrates with systems like SEF-PNet and IVA-based preprocessing, demonstrating measurable improvements in SI-SDR, PESQ, and STOI across various tests.

The Grouped Temporal Convolutional Recurrent Network (GTCRN) is a lightweight, modular deep learning architecture designed for speech enhancement and target speech extraction (TSE) in noisy and multi-speaker scenarios. It integrates grouped temporal convolutional blocks and grouped dual-path recurrent modules to achieve efficient spectral–temporal modeling, supporting both single-channel and dual-channel inputs. GTCRN has been deployed as a guiding enhancement stage for advanced TSE backbones such as SEF-PNet and CIE-mDPTNet (Huang et al., 27 Aug 2025), and is also a core component in low-resource hybrid enhancement systems with IVA-based preprocessing under adverse SNR conditions (Wang et al., 26 May 2025).

1. Architectural Composition

GTCRN is built around a sequence of signal domain transformations and learnable modules:

ERB-band front-end: The input is a complex STFT, $X \in \mathbb{C}^{F \times T}$ , mapped to $P$ ERB bands via a fixed filterbank for efficient and perceptually motivated dimensionality reduction. The ERB band-mapped signal is $Y_{erb} = ERB(|X|) \cdot e^{j \cdot \angle(X)}$ .
Encoder: Consists of two standard convolutional blocks (1D or 2D, depending on channel configuration, with batch normalization and PReLU) followed by three Grouped Temporal Convolution ("GT-Conv") blocks. Within GT-Conv blocks, input channels are split into groups; each group undergoes 1×1 grouping convolution, group-specific dilated convolution, gating, and residual addition. Channel-wise operations and channel shuffle ensure information spread.
G-DPRNN Bottleneck: Sequence is divided into overlapping chunks; intra-chunk and inter-chunk bi-LSTM (or grouped bidirectional GRUs in some dual-channel variants) model local and global dependencies. The grouped structure maintains computational efficiency.
Decoder: Mirrors the encoder through transposed convolutions and GT-Conv-transpose blocks. The output is projected back to the ERB band domain and reconstructed to the STFT domain via overlap-add and ERB-inverse filtering.
Dual-channel variants: For microphone-array inputs, GTCRN incorporates feature selection, band merging (ERB + low-frequency bins), subband feature extraction, skip connections, and flexible masking strategies; auxiliary IPA-separated features can be concatenated to the raw mixture.

Component	Single-channel (SEF-PNet)	Dual-channel (Hybrid/IVA)
Encoder	2×Conv + 3×GT-Conv blocks	2×2D-Conv + 3×Grouped T-Conv blocks
RNN bottleneck	G-DPRNN (Bi-LSTM)	Grouped Dual-Path GRU
Decoder	Symmetric to encoder	Symmetric to encoder

The exact channel counts, kernel sizes, grouping factors, and dilation schedules are architecture-dependent and, where not reported, should be sourced from original references.

2. Mathematical Signal Processing Flow

The GTCRN maps noisy time-domain input $x(t)$ to a denoised estimate $\hat{y}(t)$ via a cascade:

STFT: $X(f, \tau) = STFT\{x(t)\}$
ERB Mapping: $M(f, \tau) = |X(f, \tau)|$ , $\Phi(f, \tau) = \angle X(f, \tau)$ ; $E_p(p, \tau) = \sum_n w_p(n) M(n, \tau)$ , $Y_p(p, \tau) = E_p(p, \tau) e^{j \Phi(p, \tau)}$
Encoder $\to$ G-DPRNN $\to$ Decoder:
- $H = Encoder(\{Y_p\})$
- $H' = G\text{-}DPRNN(H)$
- $\hat{Y}_p = Decoder(H')$
ERB Inverse & ISTFT: $\hat{X}(f, \tau) = INV\text{-}ERB(\{\hat{Y}_p\})$ , $\hat{y}(t) = ISTFT\{\hat{X}(f, \tau)\}$

In compact notation: $\hat{y}(t) = GTCRN\{x(t)\}$ , or

$\hat{X} = D_{erb}^{-1} \Bigl(Dec\Bigl(GDPRNN(Enc(D_{erb}(X)))\Bigr)\Bigr)$

For dual-channel setups with auxiliary IVA features, inputs $y_1(t)$ and $y_2(t)$ are STFTed, combined with IVA coarse estimates, subjected to band merging, subband grouping, and processed as outlined above.

3. Training Objectives and Optimization Schedules

Objective: Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) maximization. For clean mixture $y_{clean}(t)$ and output $y_d(t)$ :

$\text{SI-SDR}(y_d, y_{clean}) = 10 \log_{10} \frac{\|\alpha y_{clean}\|^2}{\|\alpha y_{clean} - y_d\|^2}, \quad \alpha = \frac{\langle y_d, y_{clean} \rangle}{\langle y_{clean}, y_{clean} \rangle}$

Loss: $L_{enh} = -\text{SI-SDR}(y_d, y_{clean})$ .

Optimization:
- Adam optimizer, initial learning rate $5 \times 10^{-4}$ .
- For first 100 epochs: $lr \to 0.98 \times lr$ every 2 epochs.
- For final 20 epochs: $lr \to 0.9 \times lr$ every epoch.
- Gradient clipping: max $L_2$ norm is 1.0.
- No regularization (no dropout, no weight decay).

Hybrid Loss (Eq. 9–13):

$\mathcal{L} = \alpha \mathcal{L}_{SISNR}(\hat{x}, x) + (1-\beta)\mathcal{L}_{mag}(|\hat{S}|, |S|) + \beta[\mathcal{L}_{real}(\hat{S}_r, S_r) + \mathcal{L}_{imag}(\hat{S}_i, S_i)]$

with $\alpha=0.01, \beta=0.3$ .

$\mathcal{L}_{SISNR}$ and $\mathcal{L}_{mag}$ operate on time-domain signal and magnitude; $\mathcal{L}_{real}$ and $\mathcal{L}_{imag}$ are MSE on normalized real/imaginary spectrogram components.
Adam optimizer with linear warm-up (25k steps) and cosine-annealing to $1\times10^{-6}$ across 200 epochs. Batch size = 8.
No reported regularization.

4. Integration Within TSE and Hybrid Enhancement Frameworks

LGTSE (Noise-Agnostic Guidance): GTCRN-generated $Y_d$ (denoised features) are used for context-dependent enrollment speech adaptation:

$E_{Y_d} = E \cdot \text{softmax}(E^T \cdot Y_d)$
Input to TSE: $[Y; E_{Y_d}]$

D-LGTSE (Distortion-Aware): Expanded input set to account for signal distortion:

Three modes (concat, on-the-fly, offline), e.g., $[Y; Y_d; E_{Y_d}]$
Training alternates between original and denoised mixtures to widen noise/distortion range.

IVA-guided Hybrid (Low-SNR): IVA provides auxiliary separated speech/noise features. GTCRN refines mixture by CRM masking:

Complex ratio mask applied to original mixture, not IVA output (“Masking-2” strategy).
Subband grouping and parallel processing via group-convolutions and grouped RNNs ensure lightweight complexity.

5. Quantitative Performance Evaluation

GTCRN integration yields measurable improvements in enhancement and extraction tasks:

On Libri2Mix 2-speaker + noise:
- Baseline SEF-PNet: SI-SDR=7.43 dB, PESQ=2.14, STOI=80.31%
- +LGTSE (GTCRN): SI-SDR=7.88 (+0.45), PESQ=2.21 (+0.07), STOI=81.27 (+0.96)
- +D-LGTSE (offline): SI-SDR=8.32 (+0.89), PESQ=2.30 (+0.16), STOI=82.28 (+1.97) (Huang et al., 27 Aug 2025)
- CIE-mDPTNet backbone with D-LGTSE: SI-SDR=11.70 (+0.83), PESQ=2.86 (+0.13), STOI=88.83 (+1.57)
Low-SNR dual-channel hybrid tests: (Wang et al., 26 May 2025)
- PESQ: 1.17–1.71 (no IVA) vs. 1.39–1.71 (hybrid)
- STOI: 61.16%–81.96% (no IVA) vs. 72.38%–81.96% (hybrid)
- DNSMOS-P.808: 2.59–3.39 (no IVA) vs. 3.03–3.39 (hybrid)

Parameter and Complexity:

Baseline single-channel GTCRN: 23.43k params, 32.07 MMAC/s.
Dual-channel: 23.91k params (+0.48k), 43.20 MMAC/s.

6. Algorithmic Innovations and Implementation Considerations

Grouped Operations: Grouped convolutions and grouped RNNs strike a balance between parameter reduction and expressive capacity, suitable for low-resource or real-time applications.
Chunking and Dual-Path Recurrence: Overlapping chunking enables local and global modeling, while grouped channels parallelize computation.
Contextual Guidance: Denoising prior to enrollment-feature interaction reduces interference; multi-mode training with denoised/distorted mixtures increases robustness.
Masking strategies: Complex ratio masking applied to the original mixture preserves phase information compared to masking auxiliary separated features.
Auxiliary IVA Integration: Coarse separation by IVA notably improves enhancement, especially under extreme SNR, at minimal extra computational cost.

Common misconceptions include equating grouped blocks solely with reduced accuracy; ablation results demonstrate that channel grouping and hybrid input strategies outperform baseline designs without substantial loss.

7. Limitations and Future Prospects

Parameters such as the exact size and depth of grouped blocks, dilation rates, and RNN hidden dimensions remain subject to further specification from original references and empirical tuning. While dual-channel and IVA-augmented hybrids offer superior performance under low-SNR, the relative gains diminish under high-SNR or clean conditions. The modularity of GTCRN makes it a suitable foundation for future work in scalable enhancement and joint speech processing tasks; however, optimal integration with newer TSE backbones and alternate front-end feature domains (e.g., beamforming in multi-microphone settings) warrants ongoing investigation. The public availability of reference implementations facilitates reproducibility and extension in academic and applied research.

PDF Markdown Chat (Pro)

References (2)

Lightweight speech enhancement guided target speech extraction in noisy multi-speaker scenarios (2025)

A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GTCRN Model.

GTCRN Model for Speech Enhancement

1. Architectural Composition

2. Mathematical Signal Processing Flow

3. Training Objectives and Optimization Schedules

GTCRN Standalone (TSE Guidance) (Huang et al., 27 Aug 2025)

Hybrid Dual-Channel System (Wang et al., 26 May 2025)

4. Integration Within TSE and Hybrid Enhancement Frameworks

5. Quantitative Performance Evaluation

6. Algorithmic Innovations and Implementation Considerations

7. Limitations and Future Prospects

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

GTCRN Model for Speech Enhancement

1. Architectural Composition

2. Mathematical Signal Processing Flow

3. Training Objectives and Optimization Schedules

GTCRN Standalone (TSE Guidance) (Huang et al., 27 Aug 2025)

Hybrid Dual-Channel System (Wang et al., 26 May 2025)

4. Integration Within TSE and Hybrid Enhancement Frameworks

5. Quantitative Performance Evaluation

6. Algorithmic Innovations and Implementation Considerations

7. Limitations and Future Prospects

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics