DiCoW: Diarization-Conditioned Whisper Encoder

Updated 7 October 2025

DiCoW is a diarization-conditioned extension of Whisper that integrates per-frame speaker activity to enhance multi-speaker ASR performance.
It introduces FDDT and QKb modules that adapt internal representations and attention biases to focus on target-speaker and overlapping speech.
Extensive validation across meeting and noisy far-field scenarios shows significant transcription error reductions and runtime improvements.

A Diarization-Conditioned Whisper (DiCoW) Encoder refers to an architectural extension of the Whisper automatic speech recognition (ASR) model where frame-level speaker activity signals—derived from automatic speaker diarization—are directly integrated into the encoder transformer stack as conditioning information. DiCoW eliminates reliance on explicit speaker embeddings and instead enables robust target-speaker and speaker-attributed ASR by modulating internal representations according to diarization labels. The approach is validated for multi-speaker transcription and attribution in meetings, far-field recordings, and noisy overlapped speech, and can generalize to unseen speakers without extensive speaker-specific training data (Polok et al., 30 Dec 2024).

1. Motivation and Background

Traditional multi-speaker ASR systems typically depend on speaker embeddings—learned from enrollment utterances—to condition the recognizer, but this presents limitations in generalization to new or unseen speakers and often necessitates complex separation or clustering pipelines. DiCoW circumvents these constraints by exploiting frame-wise diarization scores, such as time-aligned speaker probabilities, directly in the transformer-based encoder.

Diarization outputs, typically in the form of soft activity matrices for each hypothesized speaker over time, offer a rich conditioning signal for distinguishing speech, non-speech, single-speaker, and overlapping regimes.

This paradigm shift is motivated by findings that relative differences among speakers, as inferred from diarization, are easier for large ASR models to exploit than absolute speaker embeddings (Polok et al., 14 Sep 2024), and that integrating these signals into the encoder can robustly focus recognition on the intended speaker stream.

2. Frame-Level Diarization-Dependent Transformations (FDDT) and Query-Key Biasing (QKb)

DiCoW contains two principal technical modules:

Frame-Level Diarization-Dependent Transformations (FDDT)

At every transformer layer $l$ , the encoder’s per-frame hidden state $\mathbf{z}_t^l$ is augmented via a convex combination of four class-specific affine transformations corresponding to Silence ( $\mathcal{S}$ ), Target only ( $\mathcal{T}$ ), Non-Target only ( $\mathcal{N}$ ), and Overlap ( $\mathcal{O}$ ):

$\hat{\mathbf{z}}_t^l = (W_{\mathcal{S}}^l \mathbf{z}_t^l + b_{\mathcal{S}}^l) p_{\mathcal{S}}^t + (W_{\mathcal{T}}^l \mathbf{z}_t^l + b_{\mathcal{T}}^l) p_{\mathcal{T}}^t + (W_{\mathcal{N}}^l \mathbf{z}_t^l + b_{\mathcal{N}}^l) p_{\mathcal{N}}^t + (W_{\mathcal{O}}^l \mathbf{z}_t^l + b_{\mathcal{O}}^l) p_{\mathcal{O}}^t$

where $p_{\mathcal{C}}^t$ is the probability assigned by diarization to each class at frame $t$ . The four probability masses are derived from the set of diarization outputs $d(s, t)$ for $S$ speakers:

$p_{\mathcal{S}}^t = \prod_{s=1}^S [1 - d(s, t)]$
$p_{\mathcal{T}}^t = d(s_k, t) \prod_{s \neq s_k} [1 - d(s, t)]$
$p_{\mathcal{N}}^t = (1 - p_{\mathcal{S}}^t) - d(s_k, t)$
$p_{\mathcal{O}}^t = d(s_k, t) - p_{\mathcal{T}}^t$

where $s_k$ is the target speaker.

This conditioning mechanism is repeated at each encoder layer, adapting the internal representations according to speaker context.

Query-Key Biasing (QKb)

QKb injects speaker activity bias directly into the attention mechanism. For the attention query $\mathbf{q}_i$ and key $\mathbf{k}_j$ , an extra dimension encodes speaker focus:

$\hat{\mathbf{q}}_i = [\mathbf{q}_i; 1]$
$\hat{\mathbf{k}}_j = [\mathbf{k}_j; -c]$

with $c$ a tunable positive constant.

Attention scores are then computed as

$(\mathbf{W}_q \mathbf{q}_i)^T (\mathbf{W}_k \mathbf{k}_j) - c$

for non-target frames, suppressing attention weights for irrelevant speaker regimes. The mechanism allows the ASR system to finely control how much overlapping and non-target speech is integrated or down-weighted during encoding.

3. Integration in DiCoW and Extension to Other Architectures

The DiCoW encoder is implemented by extending the Whisper backbone with FDDT and/or QKb conditioning. The conditioning is triggered by the output of a diarization pipeline such as DiariZen (Polok et al., 16 Jun 2025) or other external diarizer capable of per-frame probability output.

At initialization, the extra affine transformation matrices and biases are set to the identity and zero, respectively, such that the pre-trained Whisper encoder’s original behavior is preserved except for the conditioning—it effectively becomes target-speaker aware only after adaptation.

The approach generalizes to other single-speaker ASR architectures, e.g., Branchformer, by adding equivalent FDDT modules post-encoder (Polok et al., 30 Dec 2024).

4. Speaker-Attributed and Target-Speaker ASR

DiCoW supports both target-speaker ASR (transcribing the intended speaker in multi-talker mixtures) and speaker-attributed ASR (sequentially generating transcripts for each speaker present, as indicated by the diarization mask).

Unlike traditional separation-based systems, which require cascaded separation and recognition, the DiCoW approach directly conditions the recognition model on diarization and produces time-aligned, speaker-attributed transcripts.

Additionally, DiCoW can be extended to joint multi-talker decoding via serialized output training (SOT), where embeddings for all speakers are concatenated and a shared decoder generates interleaved transcriptions with speaker tags and timestamps (Kocour et al., 4 Oct 2025).

5. Experimental Validation and Performance

DiCoW is validated across multiple corpora:

Real-world meeting data: AMI, NOTSOFAR-1 (CHiME 8), with ground-truth and automatic diarization signals.
Synthetic overlapped speech: Libri2Mix, LibriCSS, LibriMix.

Performance benchmarks demonstrate that DiCoW yields substantial improvements in target-speaker and speaker-attributed transcription error rates:

On NOTSOFAR-1, DiCoW reduces ORC-WER by approximately 11% absolute compared to baseline separation/diarization cascades (Polok et al., 14 Sep 2024).
With ground-truth diarization, cpWER, ORC-WER, and time-constrained metrics (tcWER) are consistently lower than standard Whisper and speaker-embedding–conditioned systems (Polok et al., 30 Dec 2024).

Domain adaptation and fine-tuning further improve robustness and accuracy without eroding multilingual generalization (Polok et al., 16 Jun 2025).

The integration with efficient speaker-agnostic activity-stream heuristics decouples inference runtime from the number of speakers, leading to runtime gains exceeding 100% relative on multi-party meeting benchmarks while maintaining recognition accuracy (He et al., 4 Oct 2025).

6. Practical Deployment and Implications

DiCoW is suited for meeting transcription, distant-mic conversational analysis, court or broadcast interview transcription, and any environment requiring reliable multi-speaker attribution.

Practical deployment leverages frame-wise conditioning by external diarization, such as EEND, WavLM+Conformer, or Pyannote-based pipelines. Voice Activity Detector augmentation and label-mass redistribution strategies mitigate annotation inconsistencies, further improving robustness.

DiCoW’s conditioning architecture opens new research directions for integrating external supervision signals (such as diarization, separation masks, or conversational context) directly into large, pretrained ASR models, reducing the need for end-to-end retraining or complex pipeline coordination. The method generalizes to new speakers and domains with minimal adaptation effort and preserves the underlying model’s capabilities, including multilingual recognition.

Extensions include:

Joint training for diarization, separation, and ASR under shared encoders, as in multi-task frameworks using residual weighted-sum encoding (RWSE) to fuse semantic levels (Shakeel et al., 28 Aug 2025).
Unified encoder-decoder architectures with serialized output training and joint decoding for global conversational context modeling (Kocour et al., 4 Oct 2025).
Integration of deep clustering and transformer-updated attractors for more robust speaker representation (Palzer et al., 5 Jun 2025).
Incorporation of speech activity vectors (SAV) and explicit disentanglement of nuisance noise for improved clustering and segment attribution (Kim et al., 2021).
Multilingual, unsupervised clustering using Whisper embeddings and mix of sparse autoencoders for domain-robust telephone diarization (Lam et al., 2 Jul 2024).

These directions suggest the convergence of diarization, separation, and transcription tasks via external supervision and model conditioning, underpinned by the adaptable architecture of DiCoW and its variants.

DiCoW represents a diarization-conditioned extension to Whisper ASR, employing frame-level conditioning transformations and attention biasing informed by external diarization, validated in real-world multi-speaker environments, and generalizable to new speakers and domains. It serves as the foundation for multi-talker, speaker-attributed, and joint ASR pipelines in current speech recognition research.