TF-CorrNet: Dual-Path Multi-Channel Speech Separation
- The paper introduces the explicit use of inter-microphone complex correlation features with PHAT-β weighting, processed via dual-path deep modeling to estimate separation filters.
- A dual-path strategy alternates time and frequency modeling, effectively capturing both the temporal stability of spatial cues and source-dependent spectral patterns.
- TF-CorrNet achieves state-of-the-art performance with lower computational cost and improved metrics over traditional mask-based separation methods.
TF-CorrNet designates a class of neural architectures that leverage time–frequency spatial correlations for multi-channel continuous speech separation. The principal innovation of TF-CorrNet (Shin et al., 20 Sep 2025) is the explicit use of inter-microphone complex correlation features—re-weighted with a generalized phase transform PHAT-β—and their subsequent dual-path deep modeling, to estimate separation filters rather than direct masks. This approach directly embeds the spatial structure of multi-channel audio input while efficiently modeling source-dependent spectral patterns.
1. Spatial Correlation with PHAT-β Weighting
TF-CorrNet inputs are constructed from spatial correlation features for each microphone pair at each time–frequency bin. If is the short-time Fourier transform (STFT) coefficient at time , frequency , and microphone , the cross-power spectral density is:
where designates complex conjugation.
The real and imaginary components of are stacked so that both the per-microphone power () and inter-microphone spatial relation () are represented. However, scale variance in impairs stable training. To address this, TF-CorrNet applies PHAT-β weighting:
with . At , the transform is equivalent to classic PHAT, emphasizing phase and discarding magnitude. At , spectral and spatial cues are blended, enabling a tunable trade-off between source spectral information and localization cues.
2. Dual-Path Time-Frequency Feature Modeling
TF-CorrNet employs a dual-path strategy, processing the spatial correlation tensor alternately along the time and frequency axes:
- Frequency Module: Treats the input as independent sequences of length . Global–local Transformer blocks model dependencies along frequency, exploiting the time-invariant nature of spatial cues for fixed source locations.
- Temporal Module: Treats input as independent sequences of length . This module reinforces steady spatial structure for each frequency, analogous to classical beamforming temporal accumulation.
Alternation between these two modules ensures effective modeling of both spectral interdependencies (important for grouping harmonics) and temporal stability of spatial cues (crucial for speaker localization in mixtures).
3. Spectral Module for Source Pattern Modeling
Complementing spatial processing, TF-CorrNet incorporates a spectral module that explicitly learns direct time–frequency source patterns, capturing intrinsic speech structures such as harmonics and formants:
- Input features are linearly projected to a lower channel dimension
- Features are further projected along frequency to reduced spectral dimension
- A global–local Transformer extracts dependencies among latent spectral features
- Successive linear projections recover the feature map
This module improves separation by preventing the network from exclusive reliance on spatial cues and fostering robust modeling of source-intrinsic spectral regularities.
4. Separation Filter Estimation and Output
The network leverages the processed feature map to estimate linear separation filters for each output stream. Instead of direct mask prediction (as in TF-GridNet), TF-CorrNet’s filter estimation preserves more input structure and allows the model to adaptively attenuate interference and enhance target sources.
5. Experimental Evaluation and Computational Efficiency
On the LibriCSS dataset, TF-CorrNet yields competitive separation performance with lower computational cost:
- SDRi: Conformer backbone achieves ~11.75 dB, global–local Transformer ~11.38 dB
- Computational cost: 44.5–85.5 Giga MACs/second, 4.2–5.1 million parameters
TF-CorrNet outperformed TF-GridNet in SDR, PESQ, STOI, and word error rate (WER) in both simulated and real continuous speech mixing conditions, including challenging overlap scenarios. Further, efficient filter-based transformation and dual-path module design realize state-of-the-art results in real-time settings.
6. Comparison with Prior Methods
TF-CorrNet differs from traditional approaches (e.g., IPD-magnitude stacking, real–imaginary concatenation):
- Explicit correlation input with PHAT-β weighting captures spatial structure more robustly
- Separation filter estimation preserves input signal fidelity
- Dual-path deep blocks enable more nuanced time–frequency structure exploitation compared to direct mask approaches
- Experimental superiority demonstrated for both overlapping speech and continuous stream merging tasks
7. Significance and Implications
TF-CorrNet demonstrates that spatial correlation features, when processed with dual-path deep networks and complemented by spectral modeling, markedly improve separation performance and computational efficiency for multi-channel continuous speech. The architecture suggests further investigation into filter-based separation and alternate correlation weighting schemes for source separation tasks.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free