Papers
Topics
Authors
Recent
Search
2000 character limit reached

Frame-Level Diarization-Dependent Transformation (FDDT)

Updated 6 April 2026
  • FDDT is a conditioning mechanism that applies a frame-wise, differentiable transformation using diarization-derived probability vectors to modulate ASR encoder representations.
  • It integrates with models like Whisper by inserting learnable biases before self-attention blocks, reducing the need for speaker embedding vectors while enhancing target-speaker accuracy.
  • Empirical results demonstrate significant improvements, with absolute WER gains up to 39.1%, showcasing its robustness in processing overlapping and multi-speaker audio.

Frame-level Diarization-Dependent Transformation (FDDT) is a conditioning mechanism for automatic speech recognition (ASR) models that leverages frame-level diarization outputs to modulate the internal representations of each frame in the encoder. FDDT has been introduced and systematically developed to enable large pre-trained single-speaker ASR models, particularly Whisper, to perform robust target-speaker and speaker-attributed ASR in multi-speaker settings, eliminating the need for speaker embedding vectors and extensive speaker-specific training (Polok et al., 2024, Polok et al., 2024).

1. Formulation and Theoretical Foundation

FDDT is defined as a frame-wise, differentiable transformation applied to acoustic feature representations based on external diarization information. At each frame tt, one first defines a diarization-derived “STNO” probability vector:

Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top

where:

  • pStp_S^t = PP(silence at tt)
  • pTtp_T^t = PP(target speaker only at tt)
  • pNtp_N^t = PP(one or more non-target speakers only at Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top0)
  • Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top1 = Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top2(overlap: target speaker plus Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top31 non-target at Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top4)

The four probabilities sum to 1 for each frame. Each class Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top5 is associated with a learnable bias Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top6 and, in the most general form, a linear transform Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top7 per encoder layer Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top8.

Given an input representation Mt=[pSt,pTt,pNt,pOt]\mathbf{M}_t = [p_S^t,\, p_T^t,\, p_N^t,\, p_O^t]^\top9 at layer pStp_S^t0 and frame pStp_S^t1, FDDT computes a convex combination:

pStp_S^t2

This formulation enables soft, probabilistic modulation of frame representations. If diarization produces hard labels (one-hot), the sum collapses to selection of a single affine transform per frame.

2. Integration in Whisper and Encoder Modifications

In implemented systems, FDDT is inserted immediately before the first self-attention block of the Whisper encoder, optionally before each encoder block. For Whisper-medium or -large models, this requires four bias vectors per insertion point (or, when using the full affine variant, four weight matrices and four bias vectors per encoder layer). The default and most parameter-efficient instantiation learns only the biases, not the full affine transforms. Initialization is performed to minimally disturb pre-trained weights: bias vectors are zero, weight matrices are either identities (for target and overlap) or zero (for silence and non-target), thereby initially suppressing non-target representations (Polok et al., 2024, Polok et al., 2024).

The diarization front-end generates the required pStp_S^t3 speaker activity probabilities at a frame rate aligned with Whisper’s hop size (e.g., 10 ms). STNO probabilities are calculated as follows for target speaker pStp_S^t4:

  • pStp_S^t5
  • pStp_S^t6
  • pStp_S^t7
  • pStp_S^t8

During training, ground-truth diarization can be used; at inference, external diarization models are employed.

3. Training Procedure and Optimization

FDDT-based ASR systems use a hybrid CTC and attention loss, typically with a pStp_S^t9 weight for CTC. The optimizer is AdamW, employing bf16 precision and linear decay scheduling (with warm-up). FDDT parameters are trained with a larger learning rate (e.g., PP0) relative to the backbone (e.g., PP1).

A multi-phase training schedule is adopted:

  1. CTC preheating: Freeze all but the CTC head for pre-training (e.g., 10k steps).
  2. Amplification: Train FDDT+CTC on meeting data for one epoch.
  3. Joint fine-tuning: Unfreeze all weights, using early stopping based on validation.

Regularization is accomplished via weight decay and zero initialization of biases, which mitigates disruptive updates early in training (Polok et al., 2024).

4. Inference Algorithm

At inference, for each utterance and target speaker:

  1. Compute Mel-spectrogram and convolutional frontend features.
  2. Obtain frame-level diarization outputs (PP2).
  3. Calculate the STNO mask for each frame.
  4. Apply FDDT to form conditioned embeddings before or at each encoder block.
  5. Pass through the encoder and decoder (using CTC + attention decoding as needed).

The FDDT module ensures that the model focuses on target or target+overlap regions, smoothly interpolating when diarization probabilities are ambiguous. For speaker-attributed ASR, the process is repeated for each diarized speaker; transcripts are then collated to yield joint meeting transcriptions without explicit source separation (Polok et al., 2024).

5. Empirical Performance and Ablation Studies

Experiments in both "Target Speaker ASR with Whisper" (Polok et al., 2024) and "DiCoW" (Polok et al., 2024) report substantial improvements over strong baselines:

Dataset Baseline (Input Masking) FDDT (Whisper-large-v3) Absolute Gain
NOTSOFAR-1 35.5% ORC-WER 24.5% 11.0%
AMI-sdm 79.1% 48.5% 30.6%
Libri2Mix 56.7% 17.6% 39.1%

Ablations demonstrate:

  • Bias-only vs. affine: Bias-only achieves near-parity with affine transforms (28.0% vs. 26.7% WER).
  • Layer depth: Single-layer insertion matches multi-layer within 0.7% WER.
  • Mask complexity: The full 4-class STNO mask outperforms reduced (2-class) masks.
  • Initialization: Suppressive initialization boosts WER by 0.7% compared to random.

Combining multiple datasets in training slightly improves robustness and accuracy (e.g., training on NOTSOFAR-1+AMI+Libri2Mix yields 24.8% WER vs. 26.7% with NOTSOFAR-1 alone).

6. Motivations, Architectural Implications, and Robustness

The rationale underlying FDDT is the relative ease of learning frame-level conditions (is this frame target or not) compared to global speaker representations. FDDT avoids the embedding generalization collapse observed when targeting unseen speakers with classical approaches. By conditioning on already-available diarization probabilities, FDDT achieves efficient generalization, minimal parameter overhead, and fast convergence (empirically, under PP3 fine-tuning steps with Whisper encoders) (Polok et al., 2024).

FDDT’s convex mixing at the frame level allows for differentiable handling of overlaps and uncertainties from diarization. Empirical findings show the mechanism to be robust to diarization noise, and direct bias modulation at the embedding level suffices to re-orient the model’s focus on target speech, obviating the need for source separation or embedding-to-encoder mappings.

7. Extensions and Comparative Merits

FDDT supports sequential per-speaker inference, facilitating full speaker-attributed transcriptions by leveraging repeated runs of a single-speaker model. This eliminates the need for costly source separation or specialized multi-speaker outputs. Additionally, FDDT is not restricted to Whisper: it has demonstrated benefits when applied to other architectures such as Branchformer (Polok et al., 2024).

A summary of design motivations and comparative analysis:

Aspect FDDT Speaker Embedding Conditioning
Conditioning granularity Frame-level, STNO soft mask Global embedding
Generalization (unseen spk) Strong; uses diarization only Weak; requires speaker diversity
Training speed Fast (∼PP4 steps) Slow
Parameter cost Minimal (4×d per site) Variable (embedding/pooling)
Robustness to diar. errors High (soft mixing) Lower

In conclusion, frame-level diarization-dependent transformation provides a lightweight, empirically validated, and highly performant approach for extending single-speaker ASR models to robust target-speaker and speaker-attributed transcription in multi-speaker environments (Polok et al., 2024, Polok et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Frame-level Diarization-Dependent Transformation (FDDT).