Conformer-based CSS Architectures
- Conformer-based CSS architectures are time-domain neural models that combine local convolution with long-range self-attention for effective extractive speech separation.
- They integrate a learned speaker embedding within an encoder-separator-decoder pipeline, utilizing designs like Conformer-FFN and TCN-Conformer to isolate a target speaker.
- Empirical evaluations reveal that TCN-Conformer architectures notably improve SI-SDR performance, underscoring the advantages of architectural hybridization in challenging audio conditions.
Conformer-based Conditional Speaker Separation (CSS) architectures constitute a class of time-domain neural models for extractive speech separation, exploiting both local convolutional context and long-range self-attention. In the context of target speaker extraction, these architectures integrate a learned speaker embedding into the separator network, enabling extraction of a desired speaker from a single-channel input mixture. The dominant approaches—Conformer-FFN and TCN-Conformer—jointly train a speaker embedder and separator, and are characterized by the systematic combination of Conformer blocks (comprising multi-head self-attention, convolution, and feed-forward layers) with either additional feed-forward modules or temporal convolutional network (TCN) blocks to maximize separation performance in diverse and challenging audio conditions (Sinha et al., 2022).
1. Architecture and System Overview
Conformer-based CSS systems operate within a three-stage, time-domain extraction pipeline. The initial encoder converts waveform frames into latent vectors . A speaker embedder—a ResNet-based network—outputs a fixed embedding for the target, which is tile-repeated across the sequence dimension. The separator network, parameterized by either Conformer-FFN or TCN-Conformer design, receives the concatenated input and estimates a mask . The decoder, realized by a transposed convolution, reconstructs the target time-domain signal as
Conformer-FFN Separator
This design consists of repeated stacks, each containing:
- A Conformer block as originally formulated in Gulati et al. (2020).
- An external feed-forward network (FFN) comprising two linear layers, Swish activation, and dropout.
- At each step, the output is concatenated with before input to the next Conformer block.
Block diagram for :
0
TCN-Conformer Separator
This architecture stacks 1 blocks of the following sequence:
- Temporal Convolutional Network (TCN) block: dilated 1D convolution, PReLU and group LayerNorm (gLN), depthwise separable structure.
- Conformer block as above.
- Concatenation with 2 before next TCN block.
Block diagram for 3:
4
2. Conformer Block Formulation
Each Conformer block combines four key submodules:
- Feed-Forward Network (FFN) Pre-MHSA:
5
First FFN employs residual scaling by 0.5.
- Multi-Head Self-Attention (MHSA):
6
7
8
Heads are concatenated and linearly projected, followed by dropout and residual addition.
- Convolutional Module:
Involves pointwise convolution (expanding to 2D), gated linear unit (GLU) activation, 1D depthwise separable convolution (kernel size 9), batch normalization, Swish nonlinearity, second pointwise convolution, and dropout.
- FFN Post-Conv:
Another instance of the FFN as above, again with 0.5 residual scaling.
Full forward pass:
- 0 Pre-FFN(1)
- 2
- 3
- 4
- Output = 5
3. Mask Estimation and Reconstruction
Both separator types parameterize a mask estimator 6 computing
7
This mask is element-wise applied to the encoded mixture 8, and the masked representation is inverted by the decoder to yield the separated time-domain waveform.
4. Training Objective and Optimization
Training is end-to-end and supervised, jointly optimizing the speaker embedder and separator via a multi-task loss:
9
with
- Separator loss: Negative multi-scale SI-SNR over three encoder scales,
0
1 are typically uniform. - SI-SNR is computed as
2
- Embedder loss: Cross-entropy loss over 3 training speakers,
4
ADAM optimization is used for 150 epochs on 4s audio segments, employing early stopping (6 epochs patience).
5. Empirical Evaluation and Results
Table: SI-SDR gain (dB) relative to the input mixture, for systems trained on 2-mix only (5=3):
| System | 2-mix | 3-mix | noisy-mix |
|---|---|---|---|
| Mixture (input) | 2.51 | -1.27 | -3.21 |
| TCN baseline [20] | 16.15 | 4.18 | -2.30 |
| Conformer-FFN | 15.60 | 4.08 | -3.64 |
| TCN-Conformer | 16.85 | 4.56 | -0.24 |
In the joint test scenario (2/3/noisy-mix):
| System | K | 2-mix | 3-mix | noisy-mix |
|---|---|---|---|---|
| TCN baseline | – | 14.87 | 8.43 | 7.92 |
| Conformer-FFN | 4 | 14.07 | 7.67 | 7.56 |
| TCN-Conformer | 4 | 17.51 | 10.70 | 9.32 |
TCN-Conformer with 6 achieves absolute SI-SDR improvements over the TCN baseline of +2.64 dB (2-mix), +2.27 dB (3-mix), and +1.40 dB (noisy-mix). Conformer-FFN gives modest gains up to 7 but does not surpass the TCN baseline.
6. Ablation Studies and Hyperparameter Configurations
Ablations varying 8 (number of stacks) demonstrate that Conformer-FFN yields limited improvement, while TCN-Conformer shows monotonically increasing SI-SDR as 9 grows from 1 to 3 to 4. Core hyperparameters include:
- Encoder/decoder filter lengths: [2.5 ms, 10 ms, 20 ms].
- Separator dimension: 0 (after concat); external FFN output size: 256.
- Conformer: 1 attention heads, 2, Conv kernel 3, FFN expansion factor: 4.
- TCN block: two 4 conv, PReLU+gLN, depthwise separable CNN (kernel=3, dilation=5).
- External FFN: two linear layers, Swish, dropout 6.
- 7 stacks.
This suggests that deeper (larger 8) TCN-Conformer architectures, through alternating convolutional and Conformer modules, more effectively capture both short-term and long-range dependencies required for robust CSS.
7. Context and Significance
The Conformer-based CSS architectures interleave convolutional and self-attention mechanisms, allowing simultaneous modeling of local structure (via TCN/dilated CNN) and global dependencies (via MHSA). The TCN-Conformer design, by alternating TCN and Conformer blocks, consistently outperforms baselines in all tested conditions. These findings establish the utility of Conformer-based designs for speaker-conditioned separation, highlighting the importance of architectural hybridization for tackling complex audio mixtures (Sinha et al., 2022).