Kaldi-Based ASR Protocol

Updated 2 May 2026

Kaldi-based ASR protocol is a method combining traditional Kaldi techniques with advanced deep learning models for target speaker extraction.
It employs Conformer and TCN architecture blocks, multi-scale encoding, and speaker conditioning to enhance speech separation in multi-speaker and noisy scenarios.
Empirical evaluations demonstrate that the TCN-Conformer separator improves SI-SDR by 2-3 dB over baselines, underscoring its efficiency in adverse acoustic conditions.

A Kaldi-based ASR protocol refers to end-to-end speech processing systems that leverage protocol components or modeling strategies akin to the Kaldi toolkit, with a particular focus on architectures, workflows, and system evaluation typical of state-of-the-art research. In the context of target speaker extraction (TSE), recent approaches integrate conformer-based and temporal convolutional network (TCN) separators, speaker conditioning, and multi-scale encoders for robust performance in adverse speech mixtures. Two principal separator architectures—the Conformer-FFN and the TCN-Conformer—demonstrate how such protocols combine deep learning blocks with established practices in the Kaldi ecosystem to enable accurate, speaker-aware speech separation and robust automatic speech recognition in multi-speaker and noisy environments (Sinha et al., 2022).

1. Conformer Layer: Structure and Computation

Each Conformer block comprises four sub-layers: (1) position-wise feed-forward (FFN), (2) multi-head self-attention (MHSA), (3) depthwise convolutional module, and (4) a second FFN. All sub-layers deploy pre-normalization and residual connections.

Given input $X \in \mathbb{R}^{T \times D}$ :

FFN₁: $Y_1 = \text{LayerNorm}(X)$ , $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ , $W_1 \in \mathbb{R}^{D \times 4D}$ , $W_2 \in \mathbb{R}^{4D \times D}$ , and $Z_1 = X + \frac{1}{2}\text{FFN}(Y_1)$ .
MHSA: LayerNorm normalizes $Z_1$ ; queries, keys, and values are projected into $H=8$ heads, $d_k = D / H$ , with standard scaled dot-product attention and output dimension $D$ ; $Y_1 = \text{LayerNorm}(X)$ 0.
Convolutional Module: Includes point-wise convolution (expansion factor $Y_1 = \text{LayerNorm}(X)$ 1), gated linear unit (GLU), depthwise convolution (kernel size=31, padding=15, groups=D), batch normalization, Swish activation, another point-wise convolution, dropout, and residual link; output $Y_1 = \text{LayerNorm}(X)$ 2.
FFN₂: Second FFN and residual connection as in FFN₁ yields $Y_1 = \text{LayerNorm}(X)$ 3.

LayerNorm and Dropout occur within each residual block, following best practices for stability and regularization.

2. Separator Architectures: Conformer-FFN and TCN-Conformer

Both separators process features derived from a multi-scale encoder. For frame lengths $Y_1 = \text{LayerNorm}(X)$ 4, the encoder outputs $Y_1 = \text{LayerNorm}(X)$ 5 per scale $Y_1 = \text{LayerNorm}(X)$ 6. A ResNet-based speaker embedder extracts an embedding $Y_1 = \text{LayerNorm}(X)$ 7 from a reference utterance, tiled across time and concatenated to $Y_1 = \text{LayerNorm}(X)$ 8.

Conformer-FFN Network Workflow:

Input projection: $Y_1 = \text{LayerNorm}(X)$ 9, projected to $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 0 dimensions.
K Conformer stacks: For $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 1 ( $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 2), $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 3 (input $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 4), followed by an ExternalFFN ( $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 5), then concatenation with $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 6.
Mask estimation: $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 7.
Reconstruction: Masking $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 8, multi-scale decoder reconstructs $\text{FFN}(X) = \text{Dropout}(\text{Swish}(XW_1 + b_1))W_2 + b_2$ 9.

TCN-Conformer Network Workflow:

Initialization: $W_1 \in \mathbb{R}^{D \times 4D}$ 0. Projection to match TCN input, typically $W_1 \in \mathbb{R}^{D \times 4D}$ 1 channels.
K block stacks: For $W_1 \in \mathbb{R}^{D \times 4D}$ 2, apply TCNBlock (per Lüo & Mesgarani, 2019) followed by ConformerBlock with concatenated $W_1 \in \mathbb{R}^{D \times 4D}$ 3. Mask generation and decoder are identical to the Conformer-FFN pathway.

The TCNBlock consists of one 1×1 convolution followed by PReLU and Group LayerNorm (GLN), a depthwise convolution (kernel size 3, configurable dilation), and another 1×1 convolution, aggregated via residual connection.

3. End-to-End Time-Domain Target Speaker Extraction Workflow

The architecture forms a pipeline comprising:

Multi-scale encoder: Applied to mixture $W_1 \in \mathbb{R}^{D \times 4D}$ 4, yielding $W_1 \in \mathbb{R}^{D \times 4D}$ 5.
Speaker embedder: Processes reference $W_1 \in \mathbb{R}^{D \times 4D}$ 6 to produce $W_1 \in \mathbb{R}^{D \times 4D}$ 7.
Separator (Conformer-FFN or TCN-Conformer): Accepts $W_1 \in \mathbb{R}^{D \times 4D}$ 8, outputs masks $W_1 \in \mathbb{R}^{D \times 4D}$ 9.
Masking and decoding: $W_2 \in \mathbb{R}^{4D \times D}$ 0 reconstructed to waveform $W_2 \in \mathbb{R}^{4D \times D}$ 1.

Speaker conditioning is performed by constant concatenation of the tiled embedding $W_2 \in \mathbb{R}^{4D \times D}$ 2 to every frame, with no feature-wise affine transform (e.g., FiLM) applied, although replacement is possible.

4. Training Procedures, Datasets, and Evaluation

Dataset: WSJ0 + WHAM, encompassing 2-speaker (“2-mix”), 3-speaker (“3-mix”), and noisy 2-speaker (“noisy-mix”) mixtures. Interferer SNR uniformly sampled from [0, 5] dB. Approximately $W_2 \in \mathbb{R}^{4D \times D}$ 3k training, $W_2 \in \mathbb{R}^{4D \times D}$ 4k development, and $W_2 \in \mathbb{R}^{4D \times D}$ 5k test utterances per condition.

Training regime: Separator and speaker embedder jointly trained using a multi-task loss:

$W_2 \in \mathbb{R}^{4D \times D}$ 6
SI-SNR loss applies scale weights $W_2 \in \mathbb{R}^{4D \times D}$ 7.
Cross-entropy used for speaker identification.
4-second input segments, Adam optimizer (learning rate $W_2 \in \mathbb{R}^{4D \times D}$ 8 decaying per schedule), 150 epochs, early stopping patience 6.

Evaluation metric: Scale-invariant signal-to-distortion ratio (SI-SDR) measured in decibels.

5. Comparative Results and Significance

Experimental results using the SI-SDR metric highlight the superior performance of the TCN-Conformer separator over the Conformer-FFN and pure-TCN baselines. For $W_2 \in \mathbb{R}^{4D \times D}$ 9 (2-mix-trained):

Separator	2-mix SI-SDR	3-mix SI-SDR	Noisy-mix SI-SDR
Baseline TCN	16.15 dB	4.18 dB	–2.30 dB
Conformer-FFN	15.60 dB	4.08 dB	–3.64 dB
TCN-Conformer	16.85 dB	4.56 dB	–0.24 dB

When trained on the joint dataset:

Separator	2-mix SI-SDR	3-mix SI-SDR	Noisy-mix SI-SDR
Baseline TCN	14.87 dB	8.43 dB	7.92 dB
Conformer-FFN	14.07 dB	7.67 dB	7.56 dB
TCN-Conformer	17.51 dB	10.70 dB	9.32 dB

The TCN-Conformer yields approximately +2–3 dB improvement over both alternatives in all evaluated settings (Sinha et al., 2022).

6. Block-Level Implementation and Pseudocode

The forward pass of the TCN-Conformer separator can be summarized as:

$Z_1 = X + \frac{1}{2}\text{FFN}(Y_1)$ 0

All network depths, kernel sizes, activation functions, and dropout rates precisely mirror the configuration reported in (Sinha et al., 2022).

7. Architectural Implications and Research Context

These separator designs exemplify the integration of conformer architectures and speaker conditioning mechanisms in end-to-end TSE systems. The demonstrated efficacy of the TCN-Conformer underscores the merit of combining temporal convolutional receptive field expansion with self-attention and convolutional feature transformations. The use of multi-scale encoding, residual pre-norm layers, and concatenative speaker conditioning offers a robust template for future ASR protocols targeting challenging acoustic scenarios. Substituting speaker conditioning mechanisms (e.g., using FiLM) remains a viable direction for further study, as noted in (Sinha et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kaldi-based ASR Protocol.