Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diarization-Conditioned Whisper (DiCoW)

Updated 3 February 2026
  • Diarization-Conditioned Whisper (DiCoW) is a methodology that uses frame-level STNO masks to condition single-speaker ASR models for robust speaker attribution.
  • It injects rich temporal diarization information via the Frame-level Diarization-Dependent Transform (FDDT), minimizing complexity while improving accuracy in overlapping speech.
  • The SE-DiCoW variant employs self-enrollment and cross-attention fusion to resolve overlap ambiguities, yielding significant performance gains across diverse languages and domains.

Diarization-Conditioned Whisper (DiCoW) is a methodology for enabling robust speaker-attributed and target-speaker automatic speech recognition (ASR) with large, pretrained single-speaker ASR models such as Whisper. DiCoW capitalizes on frame-level outputs from a diarization front end, bypassing explicit speaker embeddings and instead providing the ASR model with rich temporal conditioning that encodes which speakers are active, alone, in overlap, or silent within each time frame. This approach is extended in SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper), which resolves key ambiguities arising in heavily overlapping multi-speaker scenarios by incorporating an enrollment segment as explicit speaker reference, fused via cross-attention. These frameworks have demonstrated substantial improvements in speaker-attributed recognition across multilingual and multi-domain contexts, without domain-dependent reengineering or cascaded separation stages (Polok et al., 2024, Polok et al., 27 Jan 2026).

1. Mathematical Foundations: The STNO Mask Formalism

At the core of DiCoW is the frame-level computation of so-called STNO (Silence, Target, Non-target, Overlap) masks, derived from probabilistic diarization outputs. For an audio segment with SS speakers and TT frames, the diarization module provides d(s,t)[0,1]d(s,t)\in[0,1] representing the activity probability of speaker ss at frame tt. With the target speaker indexed as sks_k, the following mutually exclusive mask classes are determined for each frame tt:

pSt=s=1S(1d(s,t)), pTt=d(sk,t)  s=1,sskS(1d(s,t)), pOt=d(sk,t)pTt, pNt=(1pSt)d(sk,t).\begin{aligned} p_S^t &= \prod_{s=1}^S (1 - d(s, t)), \ p_T^t &= d(s_k, t) \; \prod_{s=1, s\neq s_k}^S (1 - d(s, t)), \ p_O^t &= d(s_k, t) - p_T^t, \ p_N^t &= (1 - p_S^t) - d(s_k, t). \end{aligned}

This produces a four-dimensional vector Mt=[pSt,pTt,pNt,pOt]\mathbf{M}^t = [p_S^t,\,p_T^t,\,p_N^t,\,p_O^t]^\top per frame, where all entries are non-negative and sum to 1. The mask is soft, reflecting the diarizer’s probabilistic confidence, and encodes:

  • SS: silence
  • TT: target speaker alone
  • NN: non-target speaker(s) active, target off
  • OO: target jointly active with at least one other speaker

This formulation provides a fine-grained acoustic context at the frame level and underpins both target-speaker and speaker-attributed recognition without requiring explicit embedding extraction (Polok et al., 2024, Polok et al., 27 Jan 2026).

2. Model Conditioning Architecture: Frame-Level Diarization-Dependent Transform (FDDT)

Instead of modifying raw inputs or appending embeddings, DiCoW injects the STNO mask into the Whisper encoder via the Frame-level Diarization-Dependent Transform (FDDT). For each encoder layer nn and incoming hidden state matrix ZnRdmodel×T\mathbf{Z}^n \in \mathbb{R}^{d_{\rm model}\times T}, four independent affine transforms (Wcn,bcn)(\mathbf{W}_c^n,\, \mathbf{b}_c^n) are learned for c{S,T,N,O}c \in \{S,T,N,O\}. The diarization-conditioned output at each frame is a convex combination of these transforms:

Z^n(:,t)=c{S,T,N,O}(WcnZn(:,t)+bcn)  pct\hat{\mathbf{Z}}^n(:,t) = \sum_{c\in\{S,T,N,O\}} \left( \mathbf{W}_c^n\,\mathbf{Z}^n(:,t) + \mathbf{b}_c^n \right)\; p_c^t

Typically, conditioning is applied before the first encoder block (n=1n=1), but ablations validate efficacy with deeper injection (up to all 24 layers). The scalars pctp_c^t serve as mixture weights; no separate mask embedding network is introduced. This minimal projection avoids substantially increasing model complexity while effectively informing the model of frame-level speaker context (Polok et al., 2024).

3. Empirical Evaluation and Ablations

Empirical results demonstrate that naive masking approaches—such as zeroing non-target frames—lead to poor ASR performance (ORC-WER 76.6%\approx 76.6\% on NOTSOFAR-1). In contrast, Whisper-large-v3 fine-tuned with STNO-FDDT achieves an ORC-WER of 24.5%24.5\%, surpassing the diarization-cascade baseline (35.5%) by $11$ percentage points absolute. Further ablations analyze:

  • Parameterization of FDDT: Bias-only ({bc}\{\mathbf{b}_c\}), diagonal weights+bias, and full affine models show converging performance (bias-only: 28.0%28.0\%).
  • Layer injection: A single-layer bias-only FDDT suffices, delivering 28.7%28.7\% ORC-WER, comparable to full-depth conditioning.
  • Mask information: Removing classes (e.g., omitting Silence, yielding a TNO mask) degrades performance from 26.7%26.7\% (STNO) to 30.0%30.0\% (TNO).

The model’s design allows speaker-attributed ASR by iterating over hypothesized speaker indices, recomputing the STNO mask with each as the target, and sequentially generating per-speaker transcripts (Polok et al., 2024).

4. Ambiguity in Overlap and Limitations

A critical limitation emerges when multiple speakers are fully overlapped over contiguous frames. In such scenarios, the STNO masks become indistinguishable for all involved speakers (e.g., when d(A,t)=d(B,t)=1.0d(A,t)=d(B,t)=1.0, all other speakers off, both sk=As_k=A and sk=Bs_k=B yield Mt=(0,0,0,1)\mathbf{M}^t = (0,0,0,1)). Consequently, the ASR model cannot resolve which speaker’s words to attribute in the overlapped segment—leading to transcript swaps, merges, or omissions. This ambiguity is inherent to any method relying solely on such frame-level event partitioning (Polok et al., 27 Jan 2026).

5. SE-DiCoW: Resolving Overlap Ambiguity with Self-Enrollment

The SE-DiCoW approach circumvents this overlap-induced ambiguity by incorporating an enrollment segment as explicit conditioning input. The procedure:

  1. Enrollment Segment Selection: Slides a fixed-length window (e.g., 5 or 10 seconds) across the recording to find the region maximizing the sum of pTtp_{\mathcal{T}}^t, i.e., where the target speaker is alone and most active.
  2. Dual Input Streams: Feeds both the main audio and the enrollment segment (plus their corresponding STNO masks) into parallel branches in each encoder layer.
  3. Cross-Attention Fusion: After encoding, the main stream attends to the enrollment representations via cross-attention. The outputs are fused (MLP plus residual) before standard FDDT reapplication.

Formally, for layer ll: Zse(l)=EncoderLayer(Zse(l1), STNOse) C(l)=CrossAttention(Q=Z(l1), K,V=Zse(l)) Zaug(l)=MLP([Z(l1);C(l)])+Z(l1) Z(l)=EncoderLayer(Zaug(l), STNO)\begin{aligned} \mathbf{Z}_{\rm se}^{(l)} & = \text{EncoderLayer}(\mathbf{Z}_{\rm se}^{(l-1)},~{\rm STNO}_{\rm se}) \ \mathbf{C}^{(l)} & = \text{CrossAttention}(\mathbf{Q}=\mathbf{Z}^{(l-1)},~\mathbf{K,V}=\mathbf{Z}_{\rm se}^{(l)}) \ \mathbf{Z}_{\rm aug}^{(l)} & = {\rm MLP}([\mathbf{Z}^{(l-1)}; \mathbf{C}^{(l)}]) + \mathbf{Z}^{(l-1)} \ \mathbf{Z}^{(l)} & = \text{EncoderLayer}(\mathbf{Z}_{\rm aug}^{(l)},~{\rm STNO}) \end{aligned}

Only the outputs from the main stream are used for ASR loss. This cross-attentive fusion allows the model to correlate acoustic patterns in the mixture with the signature of the target speaker, resolving cases where STNO-based conditioning is fundamentally ambiguous due to identical mask vectors (Polok et al., 27 Jan 2026).

6. Impact and Performance Across Domains

DiCoW and SE-DiCoW establish a new paradigm for speaker-attributed ASR, yielding strong generalization across languages and domains with minimal task-specific tuning. SE-DiCoW, in particular, achieves a 52.4%52.4\% relative reduction in macro-averaged tcpWER over the original DiCoW on the EMMA MT-ASR benchmark, with >75%>75\% improvement in fully overlapped synthetic mixtures. The empirical findings highlight that speaker discrimination via frame-level event structure (STNO masks) is highly effective in typical conversational and meeting settings, with the self-enrollment enhancement enabling robust attribution even in pathological overlap scenarios (Polok et al., 27 Jan 2026).

7. Practical Methodology

The DiCoW workflow proceeds as follows:

  • Diarization: Produce frame-level activity probabilities d(s,t)d(s, t) for all speakers.
  • Mask Computation: For each frame, compute the four STNO probabilities as above, producing an (4,T)(4, T) mask matrix.
  • Conditioning and Inference: For a given target speaker, inject the STNO mask (and, for SE-DiCoW, enrollment representations) into the Whisper encoder via FDDT. Repeat for all speakers present, yielding segmented, speaker-attributed transcripts.
  • Training Augmentations: Robustness is achieved by augmenting both spectrograms and STNO masks—e.g., via SpecAugment and adversarial mask flipping.

This pipeline eliminates reliance on pretrained speaker embeddings while outperforming cascaded speech separation and diarization–ASR pipelines. The design principles underlying DiCoW and SE-DiCoW are extensible to other large ASR backbones and may inform future research on overlap-resilient speaker-attributed recognition architectures (Polok et al., 2024, Polok et al., 27 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diarization-Conditioned Whisper (DiCoW).