Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conformer-based CSS Architectures

Updated 2 May 2026
  • Conformer-based CSS architectures are time-domain neural models that combine local convolution with long-range self-attention for effective extractive speech separation.
  • They integrate a learned speaker embedding within an encoder-separator-decoder pipeline, utilizing designs like Conformer-FFN and TCN-Conformer to isolate a target speaker.
  • Empirical evaluations reveal that TCN-Conformer architectures notably improve SI-SDR performance, underscoring the advantages of architectural hybridization in challenging audio conditions.

Conformer-based Conditional Speaker Separation (CSS) architectures constitute a class of time-domain neural models for extractive speech separation, exploiting both local convolutional context and long-range self-attention. In the context of target speaker extraction, these architectures integrate a learned speaker embedding into the separator network, enabling extraction of a desired speaker from a single-channel input mixture. The dominant approaches—Conformer-FFN and TCN-Conformer—jointly train a speaker embedder and separator, and are characterized by the systematic combination of Conformer blocks (comprising multi-head self-attention, convolution, and feed-forward layers) with either additional feed-forward modules or temporal convolutional network (TCN) blocks to maximize separation performance in diverse and challenging audio conditions (Sinha et al., 2022).

1. Architecture and System Overview

Conformer-based CSS systems operate within a three-stage, time-domain extraction pipeline. The initial encoder converts waveform frames y(t)y(t) into latent vectors Z∈RT×DZ \in \mathbb{R}^{T \times D}. A speaker embedder—a ResNet-based network—outputs a fixed embedding Espk∈R256E_{spk} \in \mathbb{R}^{256} for the target, which is tile-repeated across the sequence dimension. The separator network, parameterized by either Conformer-FFN or TCN-Conformer design, receives the concatenated input [Z ∥ Espk][Z\,\|\,E_{spk}] and estimates a mask M∈[0,1]T×DM \in [0,1]^{T \times D}. The decoder, realized by a transposed convolution, reconstructs the target time-domain signal s^(t)\hat{s}(t) as

S^=Decoder(M^⊙Z)\hat{S} = \text{Decoder}( \hat{M} \odot Z )

Conformer-FFN Separator

This design consists of KK repeated stacks, each containing:

  • A Conformer block as originally formulated in Gulati et al. (2020).
  • An external feed-forward network (FFN) comprising two linear layers, Swish activation, and dropout.
  • At each step, the output is concatenated with EspkE_{spk} before input to the next Conformer block.

Block diagram for K=3K=3:

Z∈RT×DZ \in \mathbb{R}^{T \times D}0

TCN-Conformer Separator

This architecture stacks Z∈RT×DZ \in \mathbb{R}^{T \times D}1 blocks of the following sequence:

  • Temporal Convolutional Network (TCN) block: dilated 1D convolution, PReLU and group LayerNorm (gLN), depthwise separable structure.
  • Conformer block as above.
  • Concatenation with Z∈RT×DZ \in \mathbb{R}^{T \times D}2 before next TCN block.

Block diagram for Z∈RT×DZ \in \mathbb{R}^{T \times D}3:

Z∈RT×DZ \in \mathbb{R}^{T \times D}4

2. Conformer Block Formulation

Each Conformer block combines four key submodules:

  • Feed-Forward Network (FFN) Pre-MHSA:

Z∈RT×DZ \in \mathbb{R}^{T \times D}5

First FFN employs residual scaling by 0.5.

Z∈RT×DZ \in \mathbb{R}^{T \times D}6

Z∈RT×DZ \in \mathbb{R}^{T \times D}7

Z∈RT×DZ \in \mathbb{R}^{T \times D}8

Heads are concatenated and linearly projected, followed by dropout and residual addition.

  • Convolutional Module:

Involves pointwise convolution (expanding to 2D), gated linear unit (GLU) activation, 1D depthwise separable convolution (kernel size Z∈RT×DZ \in \mathbb{R}^{T \times D}9), batch normalization, Swish nonlinearity, second pointwise convolution, and dropout.

  • FFN Post-Conv:

Another instance of the FFN as above, again with 0.5 residual scaling.

Full forward pass:

  1. Espk∈R256E_{spk} \in \mathbb{R}^{256}0 Pre-FFN(Espk∈R256E_{spk} \in \mathbb{R}^{256}1)
  2. Espk∈R256E_{spk} \in \mathbb{R}^{256}2
  3. Espk∈R256E_{spk} \in \mathbb{R}^{256}3
  4. Espk∈R256E_{spk} \in \mathbb{R}^{256}4
  5. Output = Espk∈R256E_{spk} \in \mathbb{R}^{256}5

3. Mask Estimation and Reconstruction

Both separator types parameterize a mask estimator Espk∈R256E_{spk} \in \mathbb{R}^{256}6 computing

Espk∈R256E_{spk} \in \mathbb{R}^{256}7

This mask is element-wise applied to the encoded mixture Espk∈R256E_{spk} \in \mathbb{R}^{256}8, and the masked representation is inverted by the decoder to yield the separated time-domain waveform.

4. Training Objective and Optimization

Training is end-to-end and supervised, jointly optimizing the speaker embedder and separator via a multi-task loss:

Espk∈R256E_{spk} \in \mathbb{R}^{256}9

with

  • Separator loss: Negative multi-scale SI-SNR over three encoder scales,

[Z ∥ Espk][Z\,\|\,E_{spk}]0

[Z ∥ Espk][Z\,\|\,E_{spk}]1 are typically uniform. - SI-SNR is computed as

[Z ∥ Espk][Z\,\|\,E_{spk}]2

  • Embedder loss: Cross-entropy loss over [Z ∥ Espk][Z\,\|\,E_{spk}]3 training speakers,

[Z ∥ Espk][Z\,\|\,E_{spk}]4

ADAM optimization is used for 150 epochs on 4s audio segments, employing early stopping (6 epochs patience).

5. Empirical Evaluation and Results

Table: SI-SDR gain (dB) relative to the input mixture, for systems trained on 2-mix only ([Z ∥ Espk][Z\,\|\,E_{spk}]5=3):

System 2-mix 3-mix noisy-mix
Mixture (input) 2.51 -1.27 -3.21
TCN baseline [20] 16.15 4.18 -2.30
Conformer-FFN 15.60 4.08 -3.64
TCN-Conformer 16.85 4.56 -0.24

In the joint test scenario (2/3/noisy-mix):

System K 2-mix 3-mix noisy-mix
TCN baseline – 14.87 8.43 7.92
Conformer-FFN 4 14.07 7.67 7.56
TCN-Conformer 4 17.51 10.70 9.32

TCN-Conformer with [Z ∥ Espk][Z\,\|\,E_{spk}]6 achieves absolute SI-SDR improvements over the TCN baseline of +2.64 dB (2-mix), +2.27 dB (3-mix), and +1.40 dB (noisy-mix). Conformer-FFN gives modest gains up to [Z ∥ Espk][Z\,\|\,E_{spk}]7 but does not surpass the TCN baseline.

6. Ablation Studies and Hyperparameter Configurations

Ablations varying [Z ∥ Espk][Z\,\|\,E_{spk}]8 (number of stacks) demonstrate that Conformer-FFN yields limited improvement, while TCN-Conformer shows monotonically increasing SI-SDR as [Z ∥ Espk][Z\,\|\,E_{spk}]9 grows from 1 to 3 to 4. Core hyperparameters include:

  • Encoder/decoder filter lengths: [2.5 ms, 10 ms, 20 ms].
  • Separator dimension: M∈[0,1]T×DM \in [0,1]^{T \times D}0 (after concat); external FFN output size: 256.
  • Conformer: M∈[0,1]T×DM \in [0,1]^{T \times D}1 attention heads, M∈[0,1]T×DM \in [0,1]^{T \times D}2, Conv kernel M∈[0,1]T×DM \in [0,1]^{T \times D}3, FFN expansion factor: 4.
  • TCN block: two M∈[0,1]T×DM \in [0,1]^{T \times D}4 conv, PReLU+gLN, depthwise separable CNN (kernel=3, dilation=M∈[0,1]T×DM \in [0,1]^{T \times D}5).
  • External FFN: two linear layers, Swish, dropout M∈[0,1]T×DM \in [0,1]^{T \times D}6.
  • M∈[0,1]T×DM \in [0,1]^{T \times D}7 stacks.

This suggests that deeper (larger M∈[0,1]T×DM \in [0,1]^{T \times D}8) TCN-Conformer architectures, through alternating convolutional and Conformer modules, more effectively capture both short-term and long-range dependencies required for robust CSS.

7. Context and Significance

The Conformer-based CSS architectures interleave convolutional and self-attention mechanisms, allowing simultaneous modeling of local structure (via TCN/dilated CNN) and global dependencies (via MHSA). The TCN-Conformer design, by alternating TCN and Conformer blocks, consistently outperforms baselines in all tested conditions. These findings establish the utility of Conformer-based designs for speaker-conditioned separation, highlighting the importance of architectural hybridization for tackling complex audio mixtures (Sinha et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conformer-based CSS Architectures.