X-SepFormer: Target Speaker Extraction

Updated 8 March 2026

X-SepFormer is a novel neural architecture that extends the dual-path SepFormer with repeated speaker-conditioned cross-attention to directly combat speaker confusion.
It introduces innovative chunk-level SI-SDR improvement and distribution-based weighted loss schemes to focus learning on confusing segments.
Empirical results on WSJ0-2mix show that X-SepFormer achieves state-of-the-art SI-SDRi and PESQ scores, with a significant reduction in speaker confusion rates.

X-SepFormer denotes a family of end-to-end neural architectures for target speaker extraction (TSE) that extend the SepFormer dual-path Transformer backbone with chunk-level error-aware loss objectives and speaker conditioning, aiming to explicitly address speaker confusion (SC) in multi-talker separation. X-SepFormer incorporates both architectural and optimization-level innovations to surpass traditional time-domain mask-based speech separation methods in accuracy and user experience, particularly on benchmarks such as WSJ0-2mix (Liu et al., 2023).

1. Architectural Foundation and Model Structure

X-SepFormer is built upon the SepFormer dual-path Transformer, an encoder–masking–decoder pipeline optimized for speech separation. The system takes as input the mixture waveform $y \in \mathbb{R}^T$ and a reference utterance $r_s \in \mathbb{R}^{T^r}$ from the target speaker. The key architectural components are as follows:

Speaker Embedding: The reference utterance is processed with an ECAPA-TDNN network to obtain a fixed-dimensional speaker embedding $e$ representing the target identity.
Encoder: A 1D convolution followed by ReLU encodes the input mixture waveform into frame-level features.
Stacked Transformer Blocks: The masking network contains four Intra-Transformer blocks (self-attending within chunks) and four Inter-Transformer blocks (self-attending across chunks at each intra-chunk time index). This is a departure from the original SepFormer, which uses two of each; increasing to four improves representational capacity.
Speaker-Conditioned Cross-Attention: At the input of every Intra- and Inter-Transformer block, the target speaker embedding $e$ is incorporated via cross-attention: the block's features act as queries while $e$ is projected to keys and values, enforcing target-focus at each depth.
Mask Estimation and Decoding: The final masking layer estimates a single-channel mask, which is element-wise multiplied with the encoded mixture, and a transposed 1D convolution reconstructs the extracted signal $\hat{y}$ .

This repeated, depth-wise cross-attention fusion distinguishes X-SepFormer from previous TSE models employing only single-point fusion or simple concatenation of embeddings (Liu et al., 2023).

2. Loss Functions Targeting Speaker Confusion

X-SepFormer introduces two novel, chunk-wise error-aware loss schemes designed to reduce speaker confusion:

Chunk-level SI-SDR Improvement ( $\mathrm{SI\text{-}SDR}_i^k$ ): The mixture and estimated signals are segmented into overlapping chunks of length $L$ and hop $O$ . For each chunk $k$ , the improvement over the mixture is

$\mathrm{SI\text{-}SDR}_i^k = \mathrm{SI\text{-}SDR}(\hat{y}^k, x_s^k) - \mathrm{SI\text{-}SDR}(\hat{y}^k, y^k),$

where $x_s$ is the clean target and $y$ is the mixture. Chunks with $\mathrm{SI\text{-}SDR}_i^k < 0$ are labeled as "speaker confusion" chunks, and the overall chunkwise SC-ratio $r_{scr}$ quantifies the frequency of such confusions.

Distribution-based Weighted Loss ( $\mathcal{L}_{\mathrm{weight\text{-}SI\text{-}SDR}}$ ): The set $\{\mathrm{SI\text{-}SDR}_i^k\}_{k=1}^M$ is binned into four classes (e.g., $(-\infty,-5],\,(-5,0],\,(0,5],\,(5,\infty)$ ). Each class receives a weight $\omega_j$ , assigning the largest weights to confusion-prone bins, so the chunkwise loss is dominated by segments likely to exhibit speaker confusion. The weighted loss is

$\mathcal{L}_{\mathrm{weight\text{-}SI\text{-}SDR}} = -\frac{\mathrm{SI{\text-}SDR}(\hat{y},x_s)}{N_{valid}\,\sum_{j=0}^3\omega_j\,s_j},$

where $s_j$ is the count of chunks in class $j$ .

These losses augment or replace the standard utterance-level SI-SDR loss in a two-stage training scheme: after pretraining with global SI-SDR, the model is finetuned using error-aware objectives with appropriate weighting.

3. Training Procedures and Hyperparameter Choices

Training proceeds in two phases:

Pretraining: 18 epochs using the standard utterance-level SI-SDR loss, with Adam optimizer at base learning rate $1.5 \times 10^{-4}$ .
Finetuning: 10–15 further epochs with either the scale- or weight-based speaker confusion loss objectives.

Relevant hyperparameters include chunk length $L=250$ ms, hop $O=125$ ms, energy threshold $\eta=15$ for chunk validity, and weights $(\omega_0,\omega_1,\omega_2,\omega_3) = (5,5,1,1)$ . Data augmentation strategies, such as dynamic mixing and speed perturbation, further boost performance (Liu et al., 2023).

4. Empirical Evaluation and Performance on WSJ0-2mix

Experimental results were reported on the standard WSJ0-2mix dataset:

Prior benchmark: SepFormer, trained for generic speech separation, yields SI-SDRi = 22.3 dB (upper bound for non-targeted separation).
Best prior TSE method: SpEx + PC achieves SI-SDRi ≈ 18.8 dB.
X-SepFormer Baseline (no data augmentation): SI-SDRi = 18.9 dB, PESQ = 3.74, $r_{scr}=9.17\%$ .
X-SepFormer + Scale Loss: SI-SDRi = 19.1 dB, $r_{scr}=8.56\%$ .
X-SepFormer + Weight Loss: SI-SDRi = 18.8 dB, $r_{scr}=8.03\%$ .
X-SepFormer + Both Loss + DA (best): SI-SDRi = 19.4 dB, PESQ = 3.81, $r_{scr} = 7.14\%$ .

Compared to the strongest prior TSE system, X-SepFormer achieves a 14.8% relative reduction in the chunk-level speaker confusion rate and establishes new state-of-the-art SI-SDRi and PESQ for TSE (Liu et al., 2023).

5. Mechanisms for Reducing Speaker Confusion

A central aim in X-SepFormer is to minimize speaker confusion, a situation where the output waveform predominantly contains the interfering speaker. By tracking SI-SDR improvement per chunk and assigning greater loss gradients to low- or negative-improvement regions, X-SepFormer enforces localized correction where confusion is likely. This mechanism shifts the empirical distribution of chunkwise $\mathrm{SI\text{-}SDR}_i^k$ for test utterances, with a pronounced decrease in negative or low-SI-SDRi chunks, as evidenced in frequency plots in the source study. Fusion of speaker identity via cross-attention at every block—as opposed to early or late fusion alone—further reduces confusions compared to standard SepFormer or less frequent speaker-injection schemes.

6. Position within the SepFormer Ecosystem and Relationship to Efficient Variants

X-SepFormer operates atop the dual-path SepFormer structure previously shown to outperform Conv-TasNet, DualPathRNN, and related architectures in time-domain separation (Subakan et al., 2022). While resource-efficient variants such as RE-SepFormer (Libera et al., 2022) and AV-SepFormer (Lin et al., 2023) focus on computation/memory or multi-modal fusion respectively, X-SepFormer is characterized by its optimization-centric innovation: directly targeting error modes at chunk scale.

A plausible implication is that architectural elements of X-SepFormer, such as repeated cross-attention speaker fusion and error-weighted training, could be synergistically combined with memory and inference-efficient strategies (e.g., non-overlapping chunking, hierarchical pooling) to further balance separation quality with deployment constraints.

7. Ablations and Design Choices

Empirical ablations confirm several architectural and training design decisions:

Stacked Intra/Inter Blocks: Increasing from two to four blocks accelerates convergence and enhances separation accuracy.
Cross-Attention Fusion: Injecting the speaker embedding via cross-attention at each block outperforms alternatives such as early concatenation.
Loss Function Selection: The scale-based loss reduces the SC-ratio by approximately 5%, whereas the distribution-weighted loss achieves around 15% relative reduction.

These findings underscore the necessity of both deep conditional modeling and explicit optimization against confusion errors for robust TSE.

References

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion (Liu et al., 2023)
Exploring Self-Attention Mechanisms for Speech Separation (Subakan et al., 2022)
Resource-Efficient Separation Transformer (Libera et al., 2022)
AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction (Lin et al., 2023)