Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

TF-CorrNet: Dual-Path Multi-Channel Speech Separation

Updated 27 September 2025
  • The paper introduces the explicit use of inter-microphone complex correlation features with PHAT-β weighting, processed via dual-path deep modeling to estimate separation filters.
  • A dual-path strategy alternates time and frequency modeling, effectively capturing both the temporal stability of spatial cues and source-dependent spectral patterns.
  • TF-CorrNet achieves state-of-the-art performance with lower computational cost and improved metrics over traditional mask-based separation methods.

TF-CorrNet designates a class of neural architectures that leverage time–frequency spatial correlations for multi-channel continuous speech separation. The principal innovation of TF-CorrNet (Shin et al., 20 Sep 2025) is the explicit use of inter-microphone complex correlation features—re-weighted with a generalized phase transform PHAT-β—and their subsequent dual-path deep modeling, to estimate separation filters rather than direct masks. This approach directly embeds the spatial structure of multi-channel audio input while efficiently modeling source-dependent spectral patterns.

1. Spatial Correlation with PHAT-β Weighting

TF-CorrNet inputs are constructed from spatial correlation features for each microphone pair at each time–frequency bin. If XtfmX_{tfm} is the short-time Fourier transform (STFT) coefficient at time tt, frequency ff, and microphone mm, the cross-power spectral density is:

Φtfmm=XtfmXtfm\Phi_{tfmm'} = X_{tfm} \cdot X^*_{tfm'}

where ^* designates complex conjugation.

The real and imaginary components of Φtfmm\Phi_{tfmm'} are stacked so that both the per-microphone power (m=mm = m') and inter-microphone spatial relation (mmm \neq m') are represented. However, scale variance in Φtfmm\Phi_{tfmm'} impairs stable training. To address this, TF-CorrNet applies PHAT-β weighting:

ΦtfmmΦtfmmΦtfmmβ\Phi_{tfmm'} \leftarrow \frac{\Phi_{tfmm'}}{|\Phi_{tfmm'}|^\beta}

with β[0,1]\beta \in [0,1]. At β=1\beta = 1, the transform is equivalent to classic PHAT, emphasizing phase and discarding magnitude. At β<1\beta < 1, spectral and spatial cues are blended, enabling a tunable trade-off between source spectral information and localization cues.

2. Dual-Path Time-Frequency Feature Modeling

TF-CorrNet employs a dual-path strategy, processing the spatial correlation tensor alternately along the time and frequency axes:

  • Frequency Module: Treats the input as TT independent sequences of length FF. Global–local Transformer blocks model dependencies along frequency, exploiting the time-invariant nature of spatial cues for fixed source locations.
  • Temporal Module: Treats input as FF independent sequences of length TT. This module reinforces steady spatial structure for each frequency, analogous to classical beamforming temporal accumulation.

Alternation between these two modules ensures effective modeling of both spectral interdependencies (important for grouping harmonics) and temporal stability of spatial cues (crucial for speaker localization in mixtures).

3. Spectral Module for Source Pattern Modeling

Complementing spatial processing, TF-CorrNet incorporates a spectral module that explicitly learns direct time–frequency source patterns, capturing intrinsic speech structures such as harmonics and formants:

  • Input features are linearly projected to a lower channel dimension CC'
  • Features are further projected along frequency to reduced spectral dimension FF'
  • A global–local Transformer extracts dependencies among latent spectral features
  • Successive linear projections recover the C×T×FC \times T \times F feature map

This module improves separation by preventing the network from exclusive reliance on spatial cues and fostering robust modeling of source-intrinsic spectral regularities.

4. Separation Filter Estimation and Output

The network leverages the processed feature map to estimate linear separation filters for each output stream. Instead of direct mask prediction (as in TF-GridNet), TF-CorrNet’s filter estimation preserves more input structure and allows the model to adaptively attenuate interference and enhance target sources.

5. Experimental Evaluation and Computational Efficiency

On the LibriCSS dataset, TF-CorrNet yields competitive separation performance with lower computational cost:

  • SDRi: Conformer backbone achieves ~11.75 dB, global–local Transformer ~11.38 dB
  • Computational cost: 44.5–85.5 Giga MACs/second, 4.2–5.1 million parameters

TF-CorrNet outperformed TF-GridNet in SDR, PESQ, STOI, and word error rate (WER) in both simulated and real continuous speech mixing conditions, including challenging overlap scenarios. Further, efficient filter-based transformation and dual-path module design realize state-of-the-art results in real-time settings.

6. Comparison with Prior Methods

TF-CorrNet differs from traditional approaches (e.g., IPD-magnitude stacking, real–imaginary concatenation):

  • Explicit correlation input with PHAT-β weighting captures spatial structure more robustly
  • Separation filter estimation preserves input signal fidelity
  • Dual-path deep blocks enable more nuanced time–frequency structure exploitation compared to direct mask approaches
  • Experimental superiority demonstrated for both overlapping speech and continuous stream merging tasks

7. Significance and Implications

TF-CorrNet demonstrates that spatial correlation features, when processed with dual-path deep networks and complemented by spectral modeling, markedly improve separation performance and computational efficiency for multi-channel continuous speech. The architecture suggests further investigation into filter-based separation and alternate correlation weighting schemes for source separation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TF-CorrNet.