Conformer-based CSS Architectures

Updated 2 May 2026

Conformer-based CSS architectures are time-domain neural models that combine local convolution with long-range self-attention for effective extractive speech separation.
They integrate a learned speaker embedding within an encoder-separator-decoder pipeline, utilizing designs like Conformer-FFN and TCN-Conformer to isolate a target speaker.
Empirical evaluations reveal that TCN-Conformer architectures notably improve SI-SDR performance, underscoring the advantages of architectural hybridization in challenging audio conditions.

Conformer-based Conditional Speaker Separation (CSS) architectures constitute a class of time-domain neural models for extractive speech separation, exploiting both local convolutional context and long-range self-attention. In the context of target speaker extraction, these architectures integrate a learned speaker embedding into the separator network, enabling extraction of a desired speaker from a single-channel input mixture. The dominant approaches—Conformer-FFN and TCN-Conformer—jointly train a speaker embedder and separator, and are characterized by the systematic combination of Conformer blocks (comprising multi-head self-attention, convolution, and feed-forward layers) with either additional feed-forward modules or temporal convolutional network (TCN) blocks to maximize separation performance in diverse and challenging audio conditions (Sinha et al., 2022).

1. Architecture and System Overview

Conformer-based CSS systems operate within a three-stage, time-domain extraction pipeline. The initial encoder converts waveform frames $y(t)$ into latent vectors $Z \in \mathbb{R}^{T \times D}$ . A speaker embedder—a ResNet-based network—outputs a fixed embedding $E_{spk} \in \mathbb{R}^{256}$ for the target, which is tile-repeated across the sequence dimension. The separator network, parameterized by either Conformer-FFN or TCN-Conformer design, receives the concatenated input $[Z\,\|\,E_{spk}]$ and estimates a mask $M \in [0,1]^{T \times D}$ . The decoder, realized by a transposed convolution, reconstructs the target time-domain signal $\hat{s}(t)$ as

$\hat{S} = \text{Decoder}( \hat{M} \odot Z )$

Conformer-FFN Separator

This design consists of $K$ repeated stacks, each containing:

A Conformer block as originally formulated in Gulati et al. (2020).
An external feed-forward network (FFN) comprising two linear layers, Swish activation, and dropout.
At each step, the output is concatenated with $E_{spk}$ before input to the next Conformer block.

Block diagram for $K=3$ :

$Z \in \mathbb{R}^{T \times D}$ 0

TCN-Conformer Separator

This architecture stacks $Z \in \mathbb{R}^{T \times D}$ 1 blocks of the following sequence:

Temporal Convolutional Network (TCN) block: dilated 1D convolution, PReLU and group LayerNorm (gLN), depthwise separable structure.
Conformer block as above.
Concatenation with $Z \in \mathbb{R}^{T \times D}$ 2 before next TCN block.

Block diagram for $Z \in \mathbb{R}^{T \times D}$ 3:

$Z \in \mathbb{R}^{T \times D}$ 4

2. Conformer Block Formulation

Each Conformer block combines four key submodules:

Feed-Forward Network (FFN) Pre-MHSA:

$Z \in \mathbb{R}^{T \times D}$ 5

First FFN employs residual scaling by 0.5.

Multi-Head Self-Attention (MHSA):

$Z \in \mathbb{R}^{T \times D}$ 6

$Z \in \mathbb{R}^{T \times D}$ 7

$Z \in \mathbb{R}^{T \times D}$ 8

Heads are concatenated and linearly projected, followed by dropout and residual addition.

Convolutional Module:

Involves pointwise convolution (expanding to 2D), gated linear unit (GLU) activation, 1D depthwise separable convolution (kernel size $Z \in \mathbb{R}^{T \times D}$ 9), batch normalization, Swish nonlinearity, second pointwise convolution, and dropout.

FFN Post-Conv:

Another instance of the FFN as above, again with 0.5 residual scaling.

Full forward pass:

$E_{spk} \in \mathbb{R}^{256}$ 0 Pre-FFN( $E_{spk} \in \mathbb{R}^{256}$ 1)
$E_{spk} \in \mathbb{R}^{256}$ 2
$E_{spk} \in \mathbb{R}^{256}$ 3
$E_{spk} \in \mathbb{R}^{256}$ 4
Output = $E_{spk} \in \mathbb{R}^{256}$ 5

3. Mask Estimation and Reconstruction

Both separator types parameterize a mask estimator $E_{spk} \in \mathbb{R}^{256}$ 6 computing

$E_{spk} \in \mathbb{R}^{256}$ 7

This mask is element-wise applied to the encoded mixture $E_{spk} \in \mathbb{R}^{256}$ 8, and the masked representation is inverted by the decoder to yield the separated time-domain waveform.

4. Training Objective and Optimization

Training is end-to-end and supervised, jointly optimizing the speaker embedder and separator via a multi-task loss:

$E_{spk} \in \mathbb{R}^{256}$ 9

with

Separator loss: Negative multi-scale SI-SNR over three encoder scales,

$[Z\,\|\,E_{spk}]$ 0

$[Z\,\|\,E_{spk}]$ 1 are typically uniform. - SI-SNR is computed as

$[Z\,\|\,E_{spk}]$ 2

Embedder loss: Cross-entropy loss over $[Z\,\|\,E_{spk}]$ 3 training speakers,

$[Z\,\|\,E_{spk}]$ 4

ADAM optimization is used for 150 epochs on 4s audio segments, employing early stopping (6 epochs patience).

5. Empirical Evaluation and Results

Table: SI-SDR gain (dB) relative to the input mixture, for systems trained on 2-mix only ( $[Z\,\|\,E_{spk}]$ 5=3):

System	2-mix	3-mix	noisy-mix
Mixture (input)	2.51	-1.27	-3.21
TCN baseline [20]	16.15	4.18	-2.30
Conformer-FFN	15.60	4.08	-3.64
TCN-Conformer	16.85	4.56	-0.24

In the joint test scenario (2/3/noisy-mix):

System	K	2-mix	3-mix	noisy-mix
TCN baseline	–	14.87	8.43	7.92
Conformer-FFN	4	14.07	7.67	7.56
TCN-Conformer	4	17.51	10.70	9.32

TCN-Conformer with $[Z\,\|\,E_{spk}]$ 6 achieves absolute SI-SDR improvements over the TCN baseline of +2.64 dB (2-mix), +2.27 dB (3-mix), and +1.40 dB (noisy-mix). Conformer-FFN gives modest gains up to $[Z\,\|\,E_{spk}]$ 7 but does not surpass the TCN baseline.

6. Ablation Studies and Hyperparameter Configurations

Ablations varying $[Z\,\|\,E_{spk}]$ 8 (number of stacks) demonstrate that Conformer-FFN yields limited improvement, while TCN-Conformer shows monotonically increasing SI-SDR as $[Z\,\|\,E_{spk}]$ 9 grows from 1 to 3 to 4. Core hyperparameters include:

Encoder/decoder filter lengths: [2.5 ms, 10 ms, 20 ms].
Separator dimension: $M \in [0,1]^{T \times D}$ 0 (after concat); external FFN output size: 256.
Conformer: $M \in [0,1]^{T \times D}$ 1 attention heads, $M \in [0,1]^{T \times D}$ 2, Conv kernel $M \in [0,1]^{T \times D}$ 3, FFN expansion factor: 4.
TCN block: two $M \in [0,1]^{T \times D}$ 4 conv, PReLU+gLN, depthwise separable CNN (kernel=3, dilation= $M \in [0,1]^{T \times D}$ 5).
External FFN: two linear layers, Swish, dropout $M \in [0,1]^{T \times D}$ 6.
$M \in [0,1]^{T \times D}$ 7 stacks.

This suggests that deeper (larger $M \in [0,1]^{T \times D}$ 8) TCN-Conformer architectures, through alternating convolutional and Conformer modules, more effectively capture both short-term and long-range dependencies required for robust CSS.

7. Context and Significance

The Conformer-based CSS architectures interleave convolutional and self-attention mechanisms, allowing simultaneous modeling of local structure (via TCN/dilated CNN) and global dependencies (via MHSA). The TCN-Conformer design, by alternating TCN and Conformer blocks, consistently outperforms baselines in all tested conditions. These findings establish the utility of Conformer-based designs for speaker-conditioned separation, highlighting the importance of architectural hybridization for tackling complex audio mixtures (Sinha et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conformer-based CSS Architectures.