Papers
Topics
Authors
Recent
Search
2000 character limit reached

X-SepFormer: Target Speaker Extraction

Updated 8 March 2026
  • X-SepFormer is a novel neural architecture that extends the dual-path SepFormer with repeated speaker-conditioned cross-attention to directly combat speaker confusion.
  • It introduces innovative chunk-level SI-SDR improvement and distribution-based weighted loss schemes to focus learning on confusing segments.
  • Empirical results on WSJ0-2mix show that X-SepFormer achieves state-of-the-art SI-SDRi and PESQ scores, with a significant reduction in speaker confusion rates.

X-SepFormer denotes a family of end-to-end neural architectures for target speaker extraction (TSE) that extend the SepFormer dual-path Transformer backbone with chunk-level error-aware loss objectives and speaker conditioning, aiming to explicitly address speaker confusion (SC) in multi-talker separation. X-SepFormer incorporates both architectural and optimization-level innovations to surpass traditional time-domain mask-based speech separation methods in accuracy and user experience, particularly on benchmarks such as WSJ0-2mix (Liu et al., 2023).

1. Architectural Foundation and Model Structure

X-SepFormer is built upon the SepFormer dual-path Transformer, an encoder–masking–decoder pipeline optimized for speech separation. The system takes as input the mixture waveform y∈RTy \in \mathbb{R}^T and a reference utterance rs∈RTrr_s \in \mathbb{R}^{T^r} from the target speaker. The key architectural components are as follows:

  • Speaker Embedding: The reference utterance is processed with an ECAPA-TDNN network to obtain a fixed-dimensional speaker embedding ee representing the target identity.
  • Encoder: A 1D convolution followed by ReLU encodes the input mixture waveform into frame-level features.
  • Stacked Transformer Blocks: The masking network contains four Intra-Transformer blocks (self-attending within chunks) and four Inter-Transformer blocks (self-attending across chunks at each intra-chunk time index). This is a departure from the original SepFormer, which uses two of each; increasing to four improves representational capacity.
  • Speaker-Conditioned Cross-Attention: At the input of every Intra- and Inter-Transformer block, the target speaker embedding ee is incorporated via cross-attention: the block's features act as queries while ee is projected to keys and values, enforcing target-focus at each depth.
  • Mask Estimation and Decoding: The final masking layer estimates a single-channel mask, which is element-wise multiplied with the encoded mixture, and a transposed 1D convolution reconstructs the extracted signal y^\hat{y}.

This repeated, depth-wise cross-attention fusion distinguishes X-SepFormer from previous TSE models employing only single-point fusion or simple concatenation of embeddings (Liu et al., 2023).

2. Loss Functions Targeting Speaker Confusion

X-SepFormer introduces two novel, chunk-wise error-aware loss schemes designed to reduce speaker confusion:

  • Chunk-level SI-SDR Improvement (SI-SDRik\mathrm{SI\text{-}SDR}_i^k): The mixture and estimated signals are segmented into overlapping chunks of length LL and hop OO. For each chunk kk, the improvement over the mixture is

SI-SDRik=SI-SDR(y^k,xsk)−SI-SDR(y^k,yk),\mathrm{SI\text{-}SDR}_i^k = \mathrm{SI\text{-}SDR}(\hat{y}^k, x_s^k) - \mathrm{SI\text{-}SDR}(\hat{y}^k, y^k),

where xsx_s is the clean target and yy is the mixture. Chunks with SI-SDRik<0\mathrm{SI\text{-}SDR}_i^k < 0 are labeled as "speaker confusion" chunks, and the overall chunkwise SC-ratio rscrr_{scr} quantifies the frequency of such confusions.

  • Distribution-based Weighted Loss (Lweight-SI-SDR\mathcal{L}_{\mathrm{weight\text{-}SI\text{-}SDR}}): The set {SI-SDRik}k=1M\{\mathrm{SI\text{-}SDR}_i^k\}_{k=1}^M is binned into four classes (e.g., (−∞,−5], (−5,0], (0,5], (5,∞)(-\infty,-5],\,(-5,0],\,(0,5],\,(5,\infty)). Each class receives a weight ωj\omega_j, assigning the largest weights to confusion-prone bins, so the chunkwise loss is dominated by segments likely to exhibit speaker confusion. The weighted loss is

Lweight-SI-SDR=−SI-SDR(y^,xs)Nvalid ∑j=03ωj sj,\mathcal{L}_{\mathrm{weight\text{-}SI\text{-}SDR}} = -\frac{\mathrm{SI{\text-}SDR}(\hat{y},x_s)}{N_{valid}\,\sum_{j=0}^3\omega_j\,s_j},

where sjs_j is the count of chunks in class jj.

These losses augment or replace the standard utterance-level SI-SDR loss in a two-stage training scheme: after pretraining with global SI-SDR, the model is finetuned using error-aware objectives with appropriate weighting.

3. Training Procedures and Hyperparameter Choices

Training proceeds in two phases:

  1. Pretraining: 18 epochs using the standard utterance-level SI-SDR loss, with Adam optimizer at base learning rate 1.5×10−41.5 \times 10^{-4}.
  2. Finetuning: 10–15 further epochs with either the scale- or weight-based speaker confusion loss objectives.

Relevant hyperparameters include chunk length L=250L=250 ms, hop O=125O=125 ms, energy threshold η=15\eta=15 for chunk validity, and weights (ω0,ω1,ω2,ω3)=(5,5,1,1)(\omega_0,\omega_1,\omega_2,\omega_3) = (5,5,1,1). Data augmentation strategies, such as dynamic mixing and speed perturbation, further boost performance (Liu et al., 2023).

4. Empirical Evaluation and Performance on WSJ0-2mix

Experimental results were reported on the standard WSJ0-2mix dataset:

  • Prior benchmark: SepFormer, trained for generic speech separation, yields SI-SDRi = 22.3 dB (upper bound for non-targeted separation).
  • Best prior TSE method: SpEx + PC achieves SI-SDRi ≈ 18.8 dB.
  • X-SepFormer Baseline (no data augmentation): SI-SDRi = 18.9 dB, PESQ = 3.74, rscr=9.17%r_{scr}=9.17\%.
  • X-SepFormer + Scale Loss: SI-SDRi = 19.1 dB, rscr=8.56%r_{scr}=8.56\%.
  • X-SepFormer + Weight Loss: SI-SDRi = 18.8 dB, rscr=8.03%r_{scr}=8.03\%.
  • X-SepFormer + Both Loss + DA (best): SI-SDRi = 19.4 dB, PESQ = 3.81, rscr=7.14%r_{scr} = 7.14\%.

Compared to the strongest prior TSE system, X-SepFormer achieves a 14.8% relative reduction in the chunk-level speaker confusion rate and establishes new state-of-the-art SI-SDRi and PESQ for TSE (Liu et al., 2023).

5. Mechanisms for Reducing Speaker Confusion

A central aim in X-SepFormer is to minimize speaker confusion, a situation where the output waveform predominantly contains the interfering speaker. By tracking SI-SDR improvement per chunk and assigning greater loss gradients to low- or negative-improvement regions, X-SepFormer enforces localized correction where confusion is likely. This mechanism shifts the empirical distribution of chunkwise SI-SDRik\mathrm{SI\text{-}SDR}_i^k for test utterances, with a pronounced decrease in negative or low-SI-SDRi chunks, as evidenced in frequency plots in the source study. Fusion of speaker identity via cross-attention at every block—as opposed to early or late fusion alone—further reduces confusions compared to standard SepFormer or less frequent speaker-injection schemes.

6. Position within the SepFormer Ecosystem and Relationship to Efficient Variants

X-SepFormer operates atop the dual-path SepFormer structure previously shown to outperform Conv-TasNet, DualPathRNN, and related architectures in time-domain separation (Subakan et al., 2022). While resource-efficient variants such as RE-SepFormer (Libera et al., 2022) and AV-SepFormer (Lin et al., 2023) focus on computation/memory or multi-modal fusion respectively, X-SepFormer is characterized by its optimization-centric innovation: directly targeting error modes at chunk scale.

A plausible implication is that architectural elements of X-SepFormer, such as repeated cross-attention speaker fusion and error-weighted training, could be synergistically combined with memory and inference-efficient strategies (e.g., non-overlapping chunking, hierarchical pooling) to further balance separation quality with deployment constraints.

7. Ablations and Design Choices

Empirical ablations confirm several architectural and training design decisions:

  • Stacked Intra/Inter Blocks: Increasing from two to four blocks accelerates convergence and enhances separation accuracy.
  • Cross-Attention Fusion: Injecting the speaker embedding via cross-attention at each block outperforms alternatives such as early concatenation.
  • Loss Function Selection: The scale-based loss reduces the SC-ratio by approximately 5%, whereas the distribution-weighted loss achieves around 15% relative reduction.

These findings underscore the necessity of both deep conditional modeling and explicit optimization against confusion errors for robust TSE.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to X-SepFormer.