FDDT: Frame-Level Diarization Transformations

Updated 26 September 2025

FDDT is defined by applying diarization-conditioned transformations to each frame, thereby improving speaker boundary detection and handling overlapping speech.
Architectures leveraging FDDT utilize attractor-based strategies and affine transformations to modulate frame-wise embeddings for real-time, robust diarization and ASR.
Empirical results demonstrate FDDT's scalability and effectiveness in multi-speaker scenarios, paving the way for enhanced streaming and target-speaker applications.

Frame-Level Diarization Dependent Transformations (FDDT) refer to techniques and neural architectures that perform speaker diarization or related conditioning at the frame-wise level, modulating hidden representations or predictions for each time frame based on diarization cues, embedding context, or explicit target activity signals. FDDT enables robust handling of overlapping speech, fine-grained segment attribution, and flexible, speaker-aware conditioning in both diarization and target-speaker ASR settings. Below, key dimensions of FDDT are surveyed as they appear in the recent literature.

1. Concept and Foundational Principles

Frame-Level Diarization Dependent Transformations are defined by their application of diarization-conditioned operations—such as affine transformations, label conditioning, or attractor-based attention—on frame-wise representations within neural models. Distinct from segment-level approaches, FDDT leverages the temporal granularity to improve the accuracy of speaker boundary detection, speaker activity inference, and conditioning for downstream tasks including ASR.

A hallmark of FDDT is the integration of diarization signals or probabilistic masks (such as STNO masks: silence, target, non-target, overlap) directly into model layers, enabling the system to process clean, overlapped, and interfering speech frames distinctly within the network (Polok et al., 2024). Architectures exploiting FDDT often depart from conventional block- or segment-based pooling in favor of local pooling, causal encoders, and cross-attention with dynamic attractors (Liang et al., 2023, Chen et al., 2023, Fujita et al., 2023).

2. Architectures and Transformation Strategies

The majority of FDDT methods fall into two architectural categories: attractor-based (speaker/activity-conditioned) neural diarization and explicit masking/transformation in encoder stacks.

Attractor-Based Frame-Level Modeling: These models encode frame-level speaker embeddings followed by the generation of speaker attractors, activity attractors, or attribute attractors typically via cross-attention, self-attention, or autoregressive mechanisms. Such attractors guide the assignment of speaker activity probabilities on each frame (Chen et al., 2023, Palzer et al., 5 Jun 2025, Fujita et al., 2023, Palzer et al., 5 Jun 2025).
Affine Frame-Level Transformations: FDDT can be implemented as class-specific affine transformations applied at each encoder layer, guided by diarization outputs on a per-frame basis—see the convex combination formulation:

$\hat{z}_t^l = (W_S^l z_t^l + b_S^l) p_S^t + (W_T^l z_t^l + b_T^l) p_T^t + (W_N^l z_t^l + b_N^l) p_N^t + (W_O^l z_t^l + b_O^l) p_O^t$

where each term is weighted by the diarization class probability of frame $t$ (Polok et al., 2024).

Contextualization Modules: Dual-path contextualizers alternate across time and speaker dimensions to enable frame-by-frame sharing of temporal and speaker context, improving speaker activity detection and disambiguation in conversations with unknown speaker count (Cheng et al., 2022).
Causal Frame-Wise Processing: For streaming or online diarization, causal encoders (e.g., masked Transformers with lower triangular masks) output embeddings that rely only on past and current frames, thus enabling minimal-latency frame-wise predictions and attractor updates (Liang et al., 2023).

3. Embedding Extraction and Supervision

Frame-wise embedding extraction is crucial to FDDT frameworks and is approached through teacher-student schemes and loss functions optimized to preserve latent speaker geometry and overlap robustness.

Local Pooling for Frame-Level Embedding: Instead of global segment-level pooling, “local pooling” over a small moving frame window produces high-resolution embeddings suitable for FDDT architectures (Cord-Landwehr et al., 2023, Cord-Landwehr et al., 2024).
Geodesic Distance Loss: For overlapping speech, the “geodesic distance loss” enforces that an embedding for a frame with $K$ speakers lies on the geodesic between the corresponding single-speaker d-vectors on the hypersphere. In the two-speaker case:

$d(\alpha_t) = \alpha_t d_1 + (1-\alpha_t) d_2$

with the optimal $\alpha_t$ chosen per frame to minimize MSE against the network estimate, then re-scaled to the norm of the hypersphere (Cord-Landwehr et al., 2024).

Attentive Statistics Pooling and Residual Networks: Attentive pooling after a residual network backbone increases speaker representation quality for each frame and improves subsequent contextualization and activity detection (Cheng et al., 2022, Wang et al., 2022, Wang et al., 2023).

4. Overlap Handling and Assignment

FDDT is closely linked to enhanced overlap detection and multi-label assignment. Key mechanisms include:

Multi-Class or Multi-Label Output: End-to-end diarization models classify each frame into potentially multiple speaker activations, as opposed to hard segment assignments. This is facilitated through attractor-based mechanisms and mixture models (von-Mises–Fisher distributions) yielding soft posteriors for overlapping frames (Fujita et al., 2023, Cord-Landwehr et al., 2024, Palzer et al., 5 Jun 2025).
Directed Assignment and Posterior Manipulation: Diarization posteriors or secondary assignments are adjusted per detected overlap region based on frame-level speaker probability matrices, reducing missed detection and enabling more accurate diarization (Bullock et al., 2019).
Permutation-Invariant Training (PIT): Permutation ambiguity is addressed by PIT, especially for systems that deal with unknown or flexible speaker counts, aligning attractor outputs to best match true speaker identities (Fujita et al., 2023, Palzer et al., 5 Jun 2025).

The FDDT paradigm is leveraged not only for diarization but also as a conditioning signal for models tasked with target-speaker ASR in multi-speaker recordings.

Diarization-Conditioned ASR (DiCoW): Frame-level diarization masks condition the ASR encoder via class-specific transformations; this architecture improves generalization to unseen speakers and enhances robustness to overlap, outperforming masking and query-key attention biasing (Polok et al., 2024).
Dynamic Target Speaker Tracking: Online TS-VAD approaches utilize buffer-based or accumulation-based updates to target speaker embeddings on a block-by-block, frame-level basis, synchronizing model attention to evolving speaker activity (Wang et al., 2022, Wang et al., 2023).

6. Comparative Impact and Empirical Results

Across recent evaluations, FDDT and related frame-level techniques have shown strong and consistent improvements over segment-level and clustering-based diarization approaches:

Approach	Dataset(s)	DER Improvement / Outcome
Overlap-aware LSTM resegmentation	AMI	20% relative DER reduction over baseline (Bullock et al., 2019)
EEND post-processing	CALLHOME	~28% DER reduction in best system (Horiguchi et al., 2020)
Block-wise Student-EEND	LibriSpeech, AMI	Significant offset in DER degradation with more speakers (Cord-Landwehr et al., 2023)
Geodesic frame-wise embeddings	Meeting scenarios	Outperformed VBx and spectral clustering; mixture-model clustering approaches supervised overlap handling (Cord-Landwehr et al., 2024)
DiCoW (FDDT in ASR)	AMI, NOTSOFAR-1	Lower time-constrained WER and tcORC-WER compared to baselines (Polok et al., 2024)

These results highlight the superior overlap handling, flexible speaker attribution, and overall diarization accuracy enabled by frame-level techniques. Most approaches confirm improvements in both diarization error and false alarm/confusion components, with further gains realized when combining FDDT with advanced attractor, conformer, or clustering modules (Palzer et al., 5 Jun 2025, Palzer et al., 5 Jun 2025).

7. Future Directions and Practical Implications

Several areas for further research and application of FDDT include:

Extension to Multi-Channel and Multi-Modal Diarization: Integration of cross-channel attention and joint conditioning for audio-visual multi-speaker scenarios (Wang et al., 2023).
Scaling to Large Number of Speakers: Current attractor and clustering mechanisms often assume two or a few simultaneous speakers. Extension to arbitrary numbers requires adaptive attractor generation and robust multi-label modeling (Fujita et al., 2023, Horiguchi et al., 2020).
Real-Time and Edge Deployment: Efficient causal encoders, low-latency attractor updating, and compact model designs (e.g., conformers with ~15M parameters) indicate suitability for streaming applications such as live transcription, meeting analysis, and smart device interaction (Liang et al., 2023, Palzer et al., 5 Jun 2025).
Unified Frameworks for Overlap, Attribution, and ASR: Combining frame-level diarization, activity attribution, and downstream transcription within a seamless, end-to-end system is a logical implication, as evidenced by DiCoW and related approaches (Polok et al., 2024).
Advanced Training Objectives: Angle-based deep clustering losses, geodesic losses, and label-attractor vector constraints exemplify emerging supervisory signals for FDDT, suggesting further refinement in embedding structure and discriminative capacity.

In summary, FDDT embodies a suite of frame-wise neural strategies that transform or condition hidden representations, prediction heads, and attractor modules based on detailed diarization context. These techniques provide substantial gains in multi-speaker labeling precision, overlap management, and downstream ASR robustness, marking FDDT as a central mechanism in modern and future diarization architectures.