Frame-Level Diarization-Dependent Transformations

Updated 27 February 2026

FDDT is a conditioning mechanism that applies per-frame affine transformations using diarization masks to enable robust ASR in multi-speaker settings.
It integrates with transformer encoders by modifying hidden states at each layer without altering sequence length or positional encoding.
Empirical results demonstrate that FDDT significantly reduces word error rates compared to baseline methods, even in the presence of diarization imperfections.

Frame-Level Diarization-Dependent Transformations (FDDT) are a class of conditioning mechanisms for neural sequence models, particularly in speech processing, that modulate internal representations on a per-frame basis using externally provided diarization annotations. The primary context is target-speaker automatic speech recognition (ASR) and neural diarization. FDDT enables large, pre-trained, single-speaker models—such as Whisper—to function effectively in settings with overlapping or multi-speaker speech by leveraging fine-grained, frame-synchronous side information about speaker activity, rather than relying on global speaker embeddings or explicit source separation. The method is characterized by lightweight, learnable, typically affine, frame-level transformations controlled by diarization masks, supporting scenarios like target-speaker ASR, speaker-attributed ASR, and neural diarization conditioning (Polok et al., 2024, Polok et al., 2024, Fujita et al., 2023).

1. Mathematical Formulation and Conditioning Principle

FDDT operates by conditioning each frame's hidden state on the corresponding diarization label. In ASR settings, labels are commonly encoded as a four-way STNO (Silence, Target, Non-target, Overlap) indicator or mask per frame. The general FDDT affine conditioning at layer $\ell$ is:

$\mathbf{z}'^\ell_t = \sum_{C \in \{\mathcal{S}, \mathcal{T}, \mathcal{N}, \mathcal{O}\}} \left( W_C^\ell \mathbf{z}_t^\ell + \mathbf{b}_C^\ell \right) p^{t}_{C}$

Here:

$\mathbf{z}_t^\ell \in \mathbb{R}^{d_m}$ : frame- $t$ input at layer $\ell$ ;
$W_C^\ell, \mathbf{b}_C^\ell$ : class-specific learnable affine transform and bias;
$p^t_C$ : probability (from diarization) that frame $t$ is in class $C$ ;
$M^{t} = [p^{t}_{\mathcal{S}}, p^{t}_{\mathcal{T}}, p^{t}_{\mathcal{N}}, p^{t}_{\mathcal{O}}]^T$ .

For bias-only variants, $W_C^\ell$ may be restricted to the identity or a diagonal matrix, so that FDDT reduces to a per-class bias shift.

In neural diarization, an analogous mechanism appears: at every transformer layer, a set of attractors ( $A_l$ ) and their corresponding intermediate speaker labels ( $Y_l$ ) are estimated via cross-attention, then injected additively (with learned fusion) into the hidden state as self-conditioning (Fujita et al., 2023).

2. Diarization Signal Representation and STNO Mask Computation

A central element is the transformation of diarization output into a mask suitable for conditioning:

For each frame $t$ $t$ and a fixed "target" speaker $s_k$ $s_{k}$ , the frame's status is mapped to the four mutually exclusive classes via:
- $p^t_{\mathcal{S}} = \prod_{s=1}^{S} (1 - d(s,t))$ ("silence")
- $p^t_{\mathcal{T}} = d(s_k, t) \prod_{s \ne s_k} (1 - d(s, t))$ ("target only")
- $p^t_{\mathcal{N}} = (1 - p^t_{\mathcal{S}}) - d(s_k, t)$ ("non-target only")
- $p^t_{\mathcal{O}} = d(s_k, t) - p^t_{\mathcal{T}}$ ("overlap")

$d(s,t) \in [0,1]$ is the diarization system's posterior estimated for speaker $s$ at frame $t$ ; $S$ is the number of speakers. In most FDDT systems, a hard (one-hot) mask is produced per frame using thresholded diarization (Polok et al., 2024).

3. Integration into Encoder Architectures

FDDT modules are inserted at the start of each transformer encoder layer. The generic sequence is:

Input features $\rightarrow$ Positional Encoding $\rightarrow$ FDDT $\rightarrow$ LayerNorm $\rightarrow$ Self-Attention $\rightarrow$ LayerNorm $\rightarrow$ Feed-Forward.

Pseudocode for an encoder layer with FDDT (Polok et al., 2024, Polok et al., 2024):

def EncoderLayerWithFDDT(Z_l, M):
    Z_hat = []
    for t in range(T):
        z_hat = sum([W_C @ Z_l[:, t] + b_C) * M[C, t] for C in STNO])
        Z_hat.append(z_hat)
    Z_hat = np.stack(Z_hat, axis=1)
    H_l = LayerNorm(Z_hat)
    A_l = MultiHeadSelfAttention(H_l) + Z_hat
    F_l = LayerNorm(A_l)
    Z_next = FeedForward(F_l) + A_l
    return Z_next

Suppressive initialization is employed to avoid disruption of pretrained weights: $W_{\mathcal{T}}^\ell, W_{\mathcal{O}}^\ell$ initialized as identities; $W_{\mathcal{S}}^\ell, W_{\mathcal{N}}^\ell$ as zero; all biases as zero (Polok et al., 2024). This causes non-target or silence frames to be suppressed initially.

The FDDT conditioning step is parallel across frames and does not alter sequence length, attention computation, or positional encoding. This architectural approach allows fine-grained, frame-wise modulation based on diarization status without recurrent or sequential dependencies.

4. Training Regimes and Optimization

FDDT is trained atop a largely frozen or pretrained encoder-decoder ASR model, usually with a joint CTC-Attention hybrid loss:

$L = (1 - \lambda_{\mathrm{ctc}}) \cdot L_{\mathrm{CE}} + \lambda_{\mathrm{ctc}} \cdot L_{\mathrm{CTC}}$

Typical training proceeds in several phases:

CTC-only "preheating" on single-speaker data with only the CTC head learnable.
FDDT preheating with only FDDT and CTC head parameters updated on multi-speaker mixtures; base model parameters frozen and higher learning rate for new modules.
Final joint fine-tuning of all parameters on multi-speaker datasets until convergence.

Suppressive initialization and high learning rates for FDDT parameters prevent disruption of the main model. Standard AdamW, learning rate warm-up/decay, and weight decay are used (Polok et al., 2024, Polok et al., 2024).

5. Empirical Results and Comparative Performance

Evaluation across multiple datasets demonstrates that FDDT outperforms input masking and embedding-based baselines for target-speaker and speaker-attributed ASR:

Method	AMI-sdm (%)	NOTSOFAR-1 (%)	LibriCSS (%)
Vanilla Whisper	220.0	260.1	588.2
Input Masking	52.8	61.6	56.2
QK-Bias w/o shift	47.8	28.2	16.4
FDDT init.	78.3	89.7	102.0
FDDT (domain-tuned)	17.8	20.9	—
FDDT (multi-domain)	17.6	19.7	8.8

Results indicate that FDDT provides stronger word error rate (WER) and overlap-resolved WER (ORC-WER) reductions compared to masking or query-key biasing, particularly in overlapped and multi-speaker conditions (Polok et al., 2024, Polok et al., 2024). Moreover, FDDT is robust to diarization imperfections (20% DER yielding only moderate degradation).

Ablations show diagonal + bias initialization outperforms full affine matrices, and four-class (STNO) masks yield better results than coarser alternatives. Gains are additive with more training data, CTC pretraining, and multi-domain adaptation.

6. Extensions, Limitations, and Generalization

FDDT is model-agnostic with respect to backbone: it has been successfully applied to Whisper, Branchformer, and transformer-based diarization (EEND-NA-deep+SelfCond). Limitations include reliance on external frame-level diarization—performance and robustness directly depend on diarization quality and alignment. Current approaches fix the mask to four STNO classes; more granular or multi-party masks remain an open area (Polok et al., 2024, Polok et al., 2024).

Suppressive initialization prevents catastrophic forgetting when adding FDDT atop pretrained networks but may limit expressivity relative to fully learned affine transforms. Synthetic mixture pretraining may yield further generalization to rare overlap patterns.

7. FDDT in Neural Diarization and Broader Context

In end-to-end neural diarization, FDDT appears as self-conditioning on intermediate attractor-derived speaker posteriors at every transformer layer (Fujita et al., 2023). This mechanism injects frame-level speaker label information (attractor-weighted residual) into each hidden representation, improving diarization error rate (DER) and training throughput over traditional EEND-EDA baselines. The generic form is:

$\widetilde{H}^{(l-1)}_t = H^{(l-1)}_t + W \sum_{c=1}^C a^{(l)}_c y^{(l)}_{c, t}$

A plausible implication is that FDDT unifies techniques in neural diarization and ASR by providing an efficient interface for integrating external or self-predicted frame-level speaker information, via lightweight and interpretable layerwise modulation.

References:

"Target Speaker ASR with Whisper" (Polok et al., 2024)
"DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition" (Polok et al., 2024)
"Neural Diarization with Non-autoregressive Intermediate Attractors" (Fujita et al., 2023)

Markdown Report Issue Upgrade to Chat

References (3)

Target Speaker ASR with Whisper (2024)

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition (2024)

Neural Diarization with Non-autoregressive Intermediate Attractors (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Frame-Level Diarization-Dependent Transformations (FDDT).