Papers
Topics
Authors
Recent
2000 character limit reached

Diarization-Conditioned Transformations

Updated 9 January 2026
  • Diarization-conditioned transformations are neural techniques that inject explicit 'who-spoke-when' information into ASR and diarization models.
  • They employ methods like mask-based conditioning, embedding adaptation, and attention biasing to improve speaker separation and attribution.
  • These approaches enable robust, joint ASR-diarization systems that achieve lower error rates in multi-speaker and overlapping speech scenarios.

Diarization-conditioned transformations are a class of neural mechanisms in which representations within a deep network—typically for speech processing tasks such as automatic speech recognition (ASR), speaker diarization, or joint ASR-diarization—are explicitly modulated using the outputs or intermediate structures of a diarization system. Unlike approaches that treat speaker identity as a latent variable inferred solely by the network, these methods inject external or learned “who-spoke-when” information directly into neural transformations (via attention mechanisms, affine operations, masking, or sequence conditioning), leading to improved separation, attribution, and robustness in multi-speaker and overlapping speech scenarios.

1. Foundational Concepts and Mechanisms

Diarization-conditioned transformations (DCTs, Editor's term) formalize the use of diarization outputs—such as frame-level speaker activity masks, embeddings, or segment boundaries—as conditioning signals for neural modules in speech systems. The essence of these approaches is to guide or alter the forward computation of neural blocks (attention, convolution, or affine projection) so that the model’s predictions reflect both the acoustic content and the diarization structure.

There are several canonical forms:

  • Mask-based Conditioning: Per-frame, per-speaker activity or class masks are used to modulate input or hidden representations. A salient example is the use of the STNO mask (Silence, Target, Non-target, Overlap) in both ASR and diarization modules (Polok et al., 2024, Polok et al., 2024).
  • Embedding-based Conditioning: Speaker activity sequences are used to pool frame-level features, resulting in speaker-specific embeddings that condition subsequent senone/posterior estimation in hybrid ASR systems (Chetupalli et al., 2021).
  • Bias/Shifting Operations: Class-specific bias vectors (e.g., one per STNO class) are added to hidden states, steering the attention or output of Transformer blocks towards or away from the target speaker (Polok et al., 2024).
  • Masked/Cross Attention: Attention weights are modulated by binary or probabilistic masks, allowing each query (e.g., a “speaker slot” in diarization) to attend only to relevant frames (Härkönen et al., 2024, Palzer et al., 5 Jun 2025).
  • Conditional Multitask and Sequential Transformations: The diarization module is trained to predict speaker labels conditioned on explicit sub-tasks (e.g., SAD, overlap detection) or on prior speaker decisions via sequential models (Takashima et al., 2021).

The resulting transformations are not merely architectural; they encode structural inductive biases that exploit diarization as privileged information for improved attribution, error correction, and disentanglement of overlapping sources.

2. Diarization-Conditioned ASR Architectures

State-of-the-art target-speaker and multi-speaker ASR systems increasingly rely on diarization-conditioned transformations to achieve accurate, speaker-attributed transcripts without explicit source separation.

  • Frame-level Diarization-Dependent Transformations (FDDT): In DiCoW (Polok et al., 2024), per-frame STNO probabilities are computed from diarization logits and used to drive a convex combination of four affine transformations for each encoder layer. For frame tt in layer ll:

z^tl=(WSlztl+bSl)pSt+(WTlztl+bTl)pTt+(WNlztl+bNl)pNt+(WOlztl+bOl)pOt\hat{z}^l_t = (W_S^l z^l_t + b_S^l) p_S^t + (W_T^l z^l_t + b_T^l) p_T^t + (W_N^l z^l_t + b_N^l) p_N^t + (W_O^l z^l_t + b_O^l) p_O^t

This allows the encoder to learn speaker-selective representation flow, suppressing irrelevant frames.

  • Query-Key Biasing (QKb): Attention heads are modified so that keys from nontarget frames incur a large negative bias, ensuring that attention is focused on frames relevant to the target speaker. The bias is trainable and can be annealed during fine-tuning.
  • Serialized Output Training with Diarization-Conditioned Encoders: In SA-DiCoW (Kocour et al., 4 Oct 2025), outputs for all speakers are produced jointly in a streaming, serialized token sequence with timestamp and speaker tags, leveraging FDDT-modified encoders for each speaker and aggregating their representations.
  • Input Masking and Bias Injection: Conditioning can be implemented in a simple form by adding class-specific biases to each frame’s input representation before any encoder block (Polok et al., 2024). Input masking, where spectrogram frames are zeroed out or scaled by activity probability, is a less effective but related baseline.

These mechanisms allow pre-trained ASR models such as Whisper to be rapidly adapted for target-speaker and fully speaker-attributed ASR via the injection of diarization structure, outperforming cascades that couple separation, diarization, and recognition system components (Polok et al., 2024, Polok et al., 2024, Kocour et al., 4 Oct 2025).

3. End-to-End Neural Diarization: Masked Attention and Attractor Conditioning

DCTs are central to recent advances in end-to-end neural diarization (EEND) models, facilitating robust performance even with significant overlapping speech.

  • Masked Cross-Attention in Transformers: In EEND-M2F (Härkönen et al., 2024), each “speaker query” passes through a stack of decoder layers, where at every layer, the query’s cross-attention is explicitly masked according to its own predicted mask from the previous layer. For decoder query ii, attention to frame jj is computed with:

Attentioni,je(qiWQ)(kjWK)/d+Δi,j,    Δi,j={0if maski,j=1 otherwise \mathrm{Attention}_{i,j} \propto e^{ (q_i W_Q)(k_j W_K)^\top/\sqrt{d} + \Delta_{i,j} }, \;\; \Delta_{i,j} = \begin{cases} 0 & \text{if mask}_{i,j}=1 \ -\infty & \text{otherwise} \ \end{cases}

This enforces that, at each stage, queries update only with acoustically or contextually relevant frames, driving effective speaker separation and assignment.

  • Diarization-Conditioned Attractor Decoders: Modern attractor-based EEND approaches such as EEND-TA and attractor deep clustering (Samarakoon et al., 2023, Palzer et al., 5 Jun 2025) refine their speaker prototypes (attractors) using cross- and self-attention that is dynamically conditioned on the global conversation summary and current audio embeddings. These attractors determine speaker assignment for every frame, with explicit regularizers enforcing separation, suppression of inactive attractors, and geometric alignment with audio representations.

The explicit use of diarization-driven cross-attention and attractor updates distinguishes these models from earlier, purely sequential or LSTM-based EEND systems, yielding improved diarization error rates and scalability to longer, denser meeting data.

4. Multi-Stage, Joint, and Correction Frameworks

Recent research deploys DCTs as back-end correction modules or in joint learning frameworks, further improving diarization and speaker-attributed ASR.

  • Diarization Correction as Sequence-to-Sequence Transformation: DiaCorrect (Han et al., 2023) applies parallel convolutional encoders to the initial diarization (speaker activity logits) and input acoustics, merges them, and feeds the concatenated representation into a Transformer decoder, which emits refined, corrected diarization logits. Cross-attention layers enable the model to learn corrections based on the joint patterns of audio and the first-pass diarization, improving boundary placement and speaker switches without retraining upstream modules.
  • Sequential Conditional Multitask Learning: EEND extensions (Takashima et al., 2021) introduce an explicit chain-rule model, in which outputs for sub-tasks like speech activity detection (SAD) and overlap detection (OD) are predicted first, and these are passed as conditioning features for subsequent speaker attribution steps. This coarse-to-fine, conditioned sequence structure both improves DER and makes training more stable compared to flat multitask networks.
  • Joint Diarization and Source Separation: In models such as TS-SEP (Boeddeker et al., 2023), diarization and time-frequency mask estimation are unified by extending TS-VAD architectures to operate on TF bins. Embedding-driven, per-speaker TF masks are conditioned on i-vectors and initial diarization, and are used for downstream masking or MVDR beamforming extraction. This enables joint optimization and decouples diarization-induced attribution errors from those of the ASR backend.

5. Transformer-based Speaker Conditioning and Cross-Speaker Modeling

Adoption of speaker-axis Transformer blocks and attractor-based self- and cross-attention allows DCTs to operate in settings with a variable number of speakers, complex overlap, and minimal prior constraints.

  • Speaker-Axis Multihead Attention: Transformers are applied along the speaker dimension, enabling the model to be invariant to speaker order and to capture inter-speaker dependencies directly. This is exemplified in the TS-VAD transformer (Wang et al., 2022), where alternating blocks of speaker-axis Transformer layers and time-axis BLSTMs integrate both inter-speaker and temporal interactions:

MHSAspk(Xt)=[headt(1),,headt(H)]WO,where each head attends over S speakers’ features.\mathrm{MHSA}_\mathrm{spk}(X_t) = [\mathrm{head}^{(1)}_t,\ldots,\mathrm{head}^{(H)}_t] W^O,\quad \text{where each head attends over S speakers' features.}

Integration of this structure into EEND (EDA-TS-VAD) yields state-of-the-art DER, especially in multi-overlap scenarios.

  • Combiner-based Attractor Transformers: In EEND-TA (Samarakoon et al., 2023), conversation summary vectors condition the initial attractor embeddings, which are then refined by stacks of Transformer decoder blocks without positional encoding. The attractors operate as queries over frame-level features, forming an efficient and robust mechanism for determining speaker activity.
  • Deep Clustering and Orthogonality Regularization: End-to-end diarization models with attractor deep clustering (Palzer et al., 5 Jun 2025) impose additional geometric structure via angle-based losses between label vectors and learned attractors, as well as orthogonality/suppression constraints to ensure only relevant attractors are active and orthogonal in embedding space.

6. Post-processing and Improvement of Diarization Outputs

DCTs are not limited to model-internal transformations; they can operate as modular, post-hoc corrections using LLMs.

  • DiarizationLM: This framework (Wang et al., 2024) accepts diarization–ASR output pairs, encodes them in a compact textual prompt, and feeds this to a finetuned LLM that is trained to correct speaker assignments without changing the recognized transcript. Through careful prompt engineering and transcript-preserving re-alignment (TPST), the LLM operates as a high-level text-conditioned DCT, yielding WDER reductions of up to 55.5% (Fisher) and 44.9% (Callhome) without retraining ASR or diarization front-ends.
  • Correction and Calibration: Bias calibration and targeted fine-tuning of correction modules on “hard” example subsets further improve performance, even when initial diarization systems are frozen or sub-optimal (Han et al., 2023).

The generality of text-based and modular DCT frameworks enables rapid adaptation to new domains and supports practical deployment in heterogeneous recognition pipelines.

7. Empirical Impact and Research Trajectory

The proliferation of diarization-conditioned transformation architectures across diarization, ASR, and joint modeling pipelines has driven systematic improvements in diarization error rate (DER), word error rate (WER), and speaker-attributed metrics across single- and multi-speaker test sets.

  • Performance Highlights:
    • FDDT-based conditioning in DiCoW achieves AMI cpWER of 17.2% (oracle diarization), outperforming prior oracle-D based systems (Polok et al., 2024).
    • Masked cross-attention yields DIHARD-III DER of 16.07%, the first major improvement upon challenge-winning systems (Härkönen et al., 2024).
    • SOT with concatenated speaker channels halves cpWER on fully-overlapped 3-speaker LibriMix relative to per-speaker decoding (Kocour et al., 4 Oct 2025).
    • Modular correction with DiaCorrect and DiarizationLM reduces WDER and DER by up to 55% in challenging telephony and meeting data (Han et al., 2023, Wang et al., 2024).
    • TS-VAD with Transformer conditioning achieves a new SOTA DER of 4.57% on VoxConverse with >10 speakers and reduces DER on fixed data CALLHOME from 12.01% to 11.18% (Wang et al., 2022).

These advances illustrate the practical utility of DCTs in both end-to-end and modular systems, bridging the gap between diarization and downstream tasks and supporting robust speaker-attributed recognition under real-world conditions. The DCT paradigm continues to underpin state-of-the-art systems in meetings transcription, conversational ASR, and joint separation-diarization pipelines.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Diarization-Conditioned Transformations.