Frame-Level Diarization-Dependent Transformation (FDDT)
- FDDT is a conditioning mechanism that applies a frame-wise, differentiable transformation using diarization-derived probability vectors to modulate ASR encoder representations.
- It integrates with models like Whisper by inserting learnable biases before self-attention blocks, reducing the need for speaker embedding vectors while enhancing target-speaker accuracy.
- Empirical results demonstrate significant improvements, with absolute WER gains up to 39.1%, showcasing its robustness in processing overlapping and multi-speaker audio.
Frame-level Diarization-Dependent Transformation (FDDT) is a conditioning mechanism for automatic speech recognition (ASR) models that leverages frame-level diarization outputs to modulate the internal representations of each frame in the encoder. FDDT has been introduced and systematically developed to enable large pre-trained single-speaker ASR models, particularly Whisper, to perform robust target-speaker and speaker-attributed ASR in multi-speaker settings, eliminating the need for speaker embedding vectors and extensive speaker-specific training (Polok et al., 2024, Polok et al., 2024).
1. Formulation and Theoretical Foundation
FDDT is defined as a frame-wise, differentiable transformation applied to acoustic feature representations based on external diarization information. At each frame , one first defines a diarization-derived “STNO” probability vector:
where:
- = (silence at )
- = (target speaker only at )
- = (one or more non-target speakers only at 0)
- 1 = 2(overlap: target speaker plus 31 non-target at 4)
The four probabilities sum to 1 for each frame. Each class 5 is associated with a learnable bias 6 and, in the most general form, a linear transform 7 per encoder layer 8.
Given an input representation 9 at layer 0 and frame 1, FDDT computes a convex combination:
2
This formulation enables soft, probabilistic modulation of frame representations. If diarization produces hard labels (one-hot), the sum collapses to selection of a single affine transform per frame.
2. Integration in Whisper and Encoder Modifications
In implemented systems, FDDT is inserted immediately before the first self-attention block of the Whisper encoder, optionally before each encoder block. For Whisper-medium or -large models, this requires four bias vectors per insertion point (or, when using the full affine variant, four weight matrices and four bias vectors per encoder layer). The default and most parameter-efficient instantiation learns only the biases, not the full affine transforms. Initialization is performed to minimally disturb pre-trained weights: bias vectors are zero, weight matrices are either identities (for target and overlap) or zero (for silence and non-target), thereby initially suppressing non-target representations (Polok et al., 2024, Polok et al., 2024).
The diarization front-end generates the required 3 speaker activity probabilities at a frame rate aligned with Whisper’s hop size (e.g., 10 ms). STNO probabilities are calculated as follows for target speaker 4:
- 5
- 6
- 7
- 8
During training, ground-truth diarization can be used; at inference, external diarization models are employed.
3. Training Procedure and Optimization
FDDT-based ASR systems use a hybrid CTC and attention loss, typically with a 9 weight for CTC. The optimizer is AdamW, employing bf16 precision and linear decay scheduling (with warm-up). FDDT parameters are trained with a larger learning rate (e.g., 0) relative to the backbone (e.g., 1).
A multi-phase training schedule is adopted:
- CTC preheating: Freeze all but the CTC head for pre-training (e.g., 10k steps).
- Amplification: Train FDDT+CTC on meeting data for one epoch.
- Joint fine-tuning: Unfreeze all weights, using early stopping based on validation.
Regularization is accomplished via weight decay and zero initialization of biases, which mitigates disruptive updates early in training (Polok et al., 2024).
4. Inference Algorithm
At inference, for each utterance and target speaker:
- Compute Mel-spectrogram and convolutional frontend features.
- Obtain frame-level diarization outputs (2).
- Calculate the STNO mask for each frame.
- Apply FDDT to form conditioned embeddings before or at each encoder block.
- Pass through the encoder and decoder (using CTC + attention decoding as needed).
The FDDT module ensures that the model focuses on target or target+overlap regions, smoothly interpolating when diarization probabilities are ambiguous. For speaker-attributed ASR, the process is repeated for each diarized speaker; transcripts are then collated to yield joint meeting transcriptions without explicit source separation (Polok et al., 2024).
5. Empirical Performance and Ablation Studies
Experiments in both "Target Speaker ASR with Whisper" (Polok et al., 2024) and "DiCoW" (Polok et al., 2024) report substantial improvements over strong baselines:
| Dataset | Baseline (Input Masking) | FDDT (Whisper-large-v3) | Absolute Gain |
|---|---|---|---|
| NOTSOFAR-1 | 35.5% ORC-WER | 24.5% | 11.0% |
| AMI-sdm | 79.1% | 48.5% | 30.6% |
| Libri2Mix | 56.7% | 17.6% | 39.1% |
Ablations demonstrate:
- Bias-only vs. affine: Bias-only achieves near-parity with affine transforms (28.0% vs. 26.7% WER).
- Layer depth: Single-layer insertion matches multi-layer within 0.7% WER.
- Mask complexity: The full 4-class STNO mask outperforms reduced (2-class) masks.
- Initialization: Suppressive initialization boosts WER by 0.7% compared to random.
Combining multiple datasets in training slightly improves robustness and accuracy (e.g., training on NOTSOFAR-1+AMI+Libri2Mix yields 24.8% WER vs. 26.7% with NOTSOFAR-1 alone).
6. Motivations, Architectural Implications, and Robustness
The rationale underlying FDDT is the relative ease of learning frame-level conditions (is this frame target or not) compared to global speaker representations. FDDT avoids the embedding generalization collapse observed when targeting unseen speakers with classical approaches. By conditioning on already-available diarization probabilities, FDDT achieves efficient generalization, minimal parameter overhead, and fast convergence (empirically, under 3 fine-tuning steps with Whisper encoders) (Polok et al., 2024).
FDDT’s convex mixing at the frame level allows for differentiable handling of overlaps and uncertainties from diarization. Empirical findings show the mechanism to be robust to diarization noise, and direct bias modulation at the embedding level suffices to re-orient the model’s focus on target speech, obviating the need for source separation or embedding-to-encoder mappings.
7. Extensions and Comparative Merits
FDDT supports sequential per-speaker inference, facilitating full speaker-attributed transcriptions by leveraging repeated runs of a single-speaker model. This eliminates the need for costly source separation or specialized multi-speaker outputs. Additionally, FDDT is not restricted to Whisper: it has demonstrated benefits when applied to other architectures such as Branchformer (Polok et al., 2024).
A summary of design motivations and comparative analysis:
| Aspect | FDDT | Speaker Embedding Conditioning |
|---|---|---|
| Conditioning granularity | Frame-level, STNO soft mask | Global embedding |
| Generalization (unseen spk) | Strong; uses diarization only | Weak; requires speaker diversity |
| Training speed | Fast (∼4 steps) | Slow |
| Parameter cost | Minimal (4×d per site) | Variable (embedding/pooling) |
| Robustness to diar. errors | High (soft mixing) | Lower |
In conclusion, frame-level diarization-dependent transformation provides a lightweight, empirically validated, and highly performant approach for extending single-speaker ASR models to robust target-speaker and speaker-attributed transcription in multi-speaker environments (Polok et al., 2024, Polok et al., 2024).