X-SepFormer: Target Speaker Extraction
- X-SepFormer is a novel neural architecture that extends the dual-path SepFormer with repeated speaker-conditioned cross-attention to directly combat speaker confusion.
- It introduces innovative chunk-level SI-SDR improvement and distribution-based weighted loss schemes to focus learning on confusing segments.
- Empirical results on WSJ0-2mix show that X-SepFormer achieves state-of-the-art SI-SDRi and PESQ scores, with a significant reduction in speaker confusion rates.
X-SepFormer denotes a family of end-to-end neural architectures for target speaker extraction (TSE) that extend the SepFormer dual-path Transformer backbone with chunk-level error-aware loss objectives and speaker conditioning, aiming to explicitly address speaker confusion (SC) in multi-talker separation. X-SepFormer incorporates both architectural and optimization-level innovations to surpass traditional time-domain mask-based speech separation methods in accuracy and user experience, particularly on benchmarks such as WSJ0-2mix (Liu et al., 2023).
1. Architectural Foundation and Model Structure
X-SepFormer is built upon the SepFormer dual-path Transformer, an encoder–masking–decoder pipeline optimized for speech separation. The system takes as input the mixture waveform and a reference utterance from the target speaker. The key architectural components are as follows:
- Speaker Embedding: The reference utterance is processed with an ECAPA-TDNN network to obtain a fixed-dimensional speaker embedding representing the target identity.
- Encoder: A 1D convolution followed by ReLU encodes the input mixture waveform into frame-level features.
- Stacked Transformer Blocks: The masking network contains four Intra-Transformer blocks (self-attending within chunks) and four Inter-Transformer blocks (self-attending across chunks at each intra-chunk time index). This is a departure from the original SepFormer, which uses two of each; increasing to four improves representational capacity.
- Speaker-Conditioned Cross-Attention: At the input of every Intra- and Inter-Transformer block, the target speaker embedding is incorporated via cross-attention: the block's features act as queries while is projected to keys and values, enforcing target-focus at each depth.
- Mask Estimation and Decoding: The final masking layer estimates a single-channel mask, which is element-wise multiplied with the encoded mixture, and a transposed 1D convolution reconstructs the extracted signal .
This repeated, depth-wise cross-attention fusion distinguishes X-SepFormer from previous TSE models employing only single-point fusion or simple concatenation of embeddings (Liu et al., 2023).
2. Loss Functions Targeting Speaker Confusion
X-SepFormer introduces two novel, chunk-wise error-aware loss schemes designed to reduce speaker confusion:
- Chunk-level SI-SDR Improvement (): The mixture and estimated signals are segmented into overlapping chunks of length and hop . For each chunk , the improvement over the mixture is
where is the clean target and is the mixture. Chunks with are labeled as "speaker confusion" chunks, and the overall chunkwise SC-ratio quantifies the frequency of such confusions.
- Distribution-based Weighted Loss (): The set is binned into four classes (e.g., ). Each class receives a weight , assigning the largest weights to confusion-prone bins, so the chunkwise loss is dominated by segments likely to exhibit speaker confusion. The weighted loss is
where is the count of chunks in class .
These losses augment or replace the standard utterance-level SI-SDR loss in a two-stage training scheme: after pretraining with global SI-SDR, the model is finetuned using error-aware objectives with appropriate weighting.
3. Training Procedures and Hyperparameter Choices
Training proceeds in two phases:
- Pretraining: 18 epochs using the standard utterance-level SI-SDR loss, with Adam optimizer at base learning rate .
- Finetuning: 10–15 further epochs with either the scale- or weight-based speaker confusion loss objectives.
Relevant hyperparameters include chunk length ms, hop ms, energy threshold for chunk validity, and weights . Data augmentation strategies, such as dynamic mixing and speed perturbation, further boost performance (Liu et al., 2023).
4. Empirical Evaluation and Performance on WSJ0-2mix
Experimental results were reported on the standard WSJ0-2mix dataset:
- Prior benchmark: SepFormer, trained for generic speech separation, yields SI-SDRi = 22.3 dB (upper bound for non-targeted separation).
- Best prior TSE method: SpEx + PC achieves SI-SDRi ≈ 18.8 dB.
- X-SepFormer Baseline (no data augmentation): SI-SDRi = 18.9 dB, PESQ = 3.74, .
- X-SepFormer + Scale Loss: SI-SDRi = 19.1 dB, .
- X-SepFormer + Weight Loss: SI-SDRi = 18.8 dB, .
- X-SepFormer + Both Loss + DA (best): SI-SDRi = 19.4 dB, PESQ = 3.81, .
Compared to the strongest prior TSE system, X-SepFormer achieves a 14.8% relative reduction in the chunk-level speaker confusion rate and establishes new state-of-the-art SI-SDRi and PESQ for TSE (Liu et al., 2023).
5. Mechanisms for Reducing Speaker Confusion
A central aim in X-SepFormer is to minimize speaker confusion, a situation where the output waveform predominantly contains the interfering speaker. By tracking SI-SDR improvement per chunk and assigning greater loss gradients to low- or negative-improvement regions, X-SepFormer enforces localized correction where confusion is likely. This mechanism shifts the empirical distribution of chunkwise for test utterances, with a pronounced decrease in negative or low-SI-SDRi chunks, as evidenced in frequency plots in the source study. Fusion of speaker identity via cross-attention at every block—as opposed to early or late fusion alone—further reduces confusions compared to standard SepFormer or less frequent speaker-injection schemes.
6. Position within the SepFormer Ecosystem and Relationship to Efficient Variants
X-SepFormer operates atop the dual-path SepFormer structure previously shown to outperform Conv-TasNet, DualPathRNN, and related architectures in time-domain separation (Subakan et al., 2022). While resource-efficient variants such as RE-SepFormer (Libera et al., 2022) and AV-SepFormer (Lin et al., 2023) focus on computation/memory or multi-modal fusion respectively, X-SepFormer is characterized by its optimization-centric innovation: directly targeting error modes at chunk scale.
A plausible implication is that architectural elements of X-SepFormer, such as repeated cross-attention speaker fusion and error-weighted training, could be synergistically combined with memory and inference-efficient strategies (e.g., non-overlapping chunking, hierarchical pooling) to further balance separation quality with deployment constraints.
7. Ablations and Design Choices
Empirical ablations confirm several architectural and training design decisions:
- Stacked Intra/Inter Blocks: Increasing from two to four blocks accelerates convergence and enhances separation accuracy.
- Cross-Attention Fusion: Injecting the speaker embedding via cross-attention at each block outperforms alternatives such as early concatenation.
- Loss Function Selection: The scale-based loss reduces the SC-ratio by approximately 5%, whereas the distribution-weighted loss achieves around 15% relative reduction.
These findings underscore the necessity of both deep conditional modeling and explicit optimization against confusion errors for robust TSE.
References
- X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion (Liu et al., 2023)
- Exploring Self-Attention Mechanisms for Speech Separation (Subakan et al., 2022)
- Resource-Efficient Separation Transformer (Libera et al., 2022)
- AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction (Lin et al., 2023)