Word Diarization Error Rate (WDER)
- WDER is a metric that quantifies the fraction of words with incorrect speaker labels in ASR outputs by focusing solely on aligned words.
- It refines traditional WER and DER metrics by isolating speaker-attribution mistakes through dynamic programming alignment techniques.
- Practical approaches, including joint ASR+SD models and LLM-based post-processing, have demonstrated significant reductions in WDER across diverse datasets.
Word Diarization Error Rate (WDER) quantifies the fraction of words in an automatic speech recognition (ASR) output that are assigned incorrect speaker labels. Unlike conventional Diarization Error Rate (DER), which is computed over time intervals, or Word Error Rate (WER), which ignores speaker information, WDER isolates speaker-attribution mistakes at the word level. This provides a tailored metric for evaluating joint ASR and speaker diarization pipelines, as well as post-processing correction systems.
1. Formal Definitions and Computation of WDER
WDER is mathematically defined as the proportion of aligned words (i.e., words that are either correctly recognized or are substitutions) that receive incorrect speaker tags. Insertions and deletions are excluded from both the numerator and denominator because speaker attribution for those word positions is ambiguous. Formally (Shafey et al., 2019, Kirakosyan et al., 30 Aug 2024, Wang et al., 7 Jan 2024):
where:
- is the number of ASR substitution errors (reference word aligned to an incorrect hypothesis word),
- is the number of correctly recognized (matched) words,
- is the number of substitutions assigned the incorrect speaker,
- is the number of correct matches assigned the incorrect speaker.
Alternatively, implementations that summarize all word-level mistakes (including deletions/insertions) as diarization errors use:
where are substitution, deletion, and insertion counts (from asclite-alignments) and is the reference word count (Paturi et al., 2023).
A further estimate arises in the context of long-form multi-talker evaluation as the difference between concatenated-permutation WER (cpWER)—where streams are aligned under permutation—and the diarization-invariant version (DI-cpWER), which assigns hypothesis words to reference speakers optimally:
This difference reflects the penalty specifically attributable to speaker-label confusion (Neumann et al., 4 Aug 2025).
WDER is always computed after aligning reference and ASR outputs using standard word-level dynamic programming techniques (e.g., Levenshtein distance, asclite). Only aligned word pairs contribute to the metric.
2. Relationship to WER, DER, and Related Metrics
WDER exists alongside two canonical metrics:
- Word Error Rate (WER): , reporting lexical (recognition) errors independent of speaker attribution.
- Diarization Error Rate (DER): measures time-proportion of missed speech, false alarms, and speaker confusion; grounded on frame- or segment-level alignment.
Key distinctions:
| Metric | Domain | Penalizes Speaker Errors | Handles Word-Level Errors | Handles Time-Misalignment |
|---|---|---|---|---|
| WER | Lexical, word-level | No | Yes | No |
| DER | Signal, frame-level | Yes | No | Yes |
| WDER | Lexical+diar word | Yes (on aligned words) | Yes (ASR+SD jointly) | No |
WDER quantifies the word-level impact of speaker attribution errors, abstracting away from temporal segmentation or word insertions/deletions (Shafey et al., 2019Wang et al., 7 Jan 2024). In long-form multi-talker settings, related metrics such as cpWER, tcpWER, and DI-cpWER capture different tradeoffs:
- cpWER: penalizes both lexical and speaker-label errors, without temporal constraints.
- tcpWER: adds a temporal collar to force plausible word alignments.
- DI-cpWER: optimal assignment of hypothesis speaker labels to references, ignoring speaker confusions.
- WDER = cpWER − DI-cpWER: isolates speaker confusion cost (Neumann et al., 4 Aug 2025).
3. Practical Calculation: Workflow and Error Decomposition
The canonical workflow for WDER calculation is:
- Alignment: Align reference transcript (with word-level speaker tags) to the ASR+SD hypothesis using dynamic-programming or asclite (handles overlaps and multi-speaker alignment) (Paturi et al., 2023, Paturi et al., 25 Jun 2024).
- Edit Classification:
- For each reference word, note if it is a correct match, substitution, deletion, or insertion.
- For each aligned (matched or substituted) word pair, check if the hypothesized speaker tag matches the reference.
- Increment and accordingly.
- Compute WDER: Divide the total number of aligned words with incorrect speaker attribution by the total number of aligned words (excluding insertions and deletions).
A toy example (Shafey et al., 2019) demonstrates:
| REF | HYP | Error Type |
|---|---|---|
| <dr> hello | <dr> hello | correct |
| <pt> thanks | <dr> tanks | substitution + spkr |
| <pt> bye | — | deletion |
| — | <pt> see | insertion |
Only "hello" and "thanks/tanks" are counted in WDER, and only "thanks/tanks" is a speaker error (), so WDER = 1/2 = 50%.
Some approaches compute WDER over all errors (including deletions/insertions), though the predominant convention is to exclude these due to unassignable speaker tags (Paturi et al., 2023).
4. System Architectures Targeting WDER Reduction
End-to-End Joint ASR+SD
A single sequence-to-sequence model—typically an RNN-Transducer—can emit word and speaker tokens jointly (Shafey et al., 2019). This architecture leverages:
- Acoustic Encoder: Extracts features from raw log-Mel frames.
- Prediction Network: Encodes previous output tokens.
- Joint Network: Produces scores over vocabulary (including special speaker tokens).
This joint modeling exploits both acoustic and linguistic cues, eliminating the need for word-to-segment post-hoc reconciliation, and is empirically shown to reduce WDER from 15.8% (two-stage baseline) to 2.2% (joint model) in medical conversations.
Lexical and LLM-Based Post-Processing
LLM–driven second-pass correction (“lexical speaker error correction,” LSEC) slides a window of contextual word embeddings (e.g., RoBERTa-base) together with first-pass speaker tags through a lightweight transformer network, predicting corrected speaker IDs (Paturi et al., 2023, Paturi et al., 25 Jun 2024). Subsequently:
- Audio-Grounded LSEC (AG-LSEC): Fuses frame-level or word-level speaker posteriors from neural diarization (EEND) into the lexical transformer, either in early or late fusion. Early fusion concatenates acoustic and lexical embeddings; late fusion merges outputs. Early fusion achieves up to 39–41% relative WDER reduction over baseline (Paturi et al., 25 Jun 2024).
- LLM-Based Correction Systems: Finetuned LLMs (e.g., PaLM 2-S) can post-process diarized transcripts, transferring only speaker labels while preserving word content, yielding up to 55.5% WDER reduction on Fisher (baseline 5.32% to 2.37%), and 44.9% on Callhome (7.72% to 4.25%) (Wang et al., 7 Jan 2024). Zero-shot and non-finetuned LLMs typically worsen WDER due to paraphrases and deletions.
Non-Autoregressive Correction
Models such as ALBERT-base with a stacked transformer encoder (parameters initially frozen, then unfrozen), operating over windows around speaker-change points, further reduce WDER by modeling contextual error patterns, especially boundary misassignments (Kirakosyan et al., 30 Aug 2024).
5. Empirical Results Across Systems and Datasets
| Model/Approach | Dataset | Baseline WDER | Improved WDER | Relative Reduction |
|---|---|---|---|---|
| Joint RNN-T (ASR+SD) (Shafey et al., 2019) | Medical conv. | 15.8% | 2.2% | 86% |
| LSEC (Paturi et al., 2023) | Fisher | 2.26% | 1.58% | 30% |
| AG-LSEC (Early) (Paturi et al., 25 Jun 2024) | Fisher | 2.56% | 1.56% | 39.1% |
| LLM Post-Processing (Wang et al., 7 Jan 2024) | Fisher | 5.32% | 2.37% | 55.5% |
| SEC (ALBERT+Transformer) (Kirakosyan et al., 30 Aug 2024) | Fisher | 2.80% | 2.42% | 13.6% |
Empirical findings highlight that most WDER reductions stem from correcting errors at speaker-change boundaries or in short segments. Models combining lexical with acoustic features are particularly effective in regions of overlapping speech or rapid speaker turns.
6. Analysis of WDER in Multi-Talker and Overlap Conditions
In long-form multi-talker evaluation, WDER can be interpreted as the difference in error rates between cpWER (speaker-attributed) and DI-cpWER (speaker-agnostic). This quantifies the excess word errors that can be eliminated with perfect speaker attribution (Neumann et al., 4 Aug 2025). The greedy approximation algorithm for DI-cpWER/ORC-WER enables tractable computation in meeting-scale data with many speakers, yielding <0.1% error compared to exact assignment.
WDER responds sensitively to speaker–label boundary mismatches, which are frequent at short turns, overlap regions, and around abrupt speaker switches. Systems that jointly process ASR and speaker labeling, as well as multimodal correction models, are empirically favored here.
7. Practical Implications, Limitations, and Usage Guidance
WDER serves as a focused metric for evaluating "who spoke which word"—a crucial output for downstream conversational analytics, transcription search, and dialogue system integration. Unlike DER, which may obfuscate the precise word-level attribution, and WER, which ignores speaker identity, WDER provides an actionable figure of merit for optimizing end-to-end diarization and correction models.
Key insights:
- Lexical correction models are effective even when trained on modest amounts of annotated data (Paturi et al., 2023).
- Audio grounding further boosts resilience in challenging speaker turn/overlap scenarios (Paturi et al., 25 Jun 2024).
- Finetuned LLM post-processing can yield dramatic WDER reductions, though robustness to paraphrasing and transcript format must be ensured (Wang et al., 7 Jan 2024).
Limitations include:
- Current lexical correctors are largely limited to two-speaker windows; extension to multi-speaker contexts is not fully resolved (Paturi et al., 2023).
- Some definitions of WDER include insertions/deletions, but the dominant practice excludes them to avoid unassignable speaker roles (Shafey et al., 2019, Neumann et al., 4 Aug 2025).
- All leading results are from English conversational domains; generalization to multilingual or highly overlapped speech remains an open area (Paturi et al., 25 Jun 2024).
WDER is now widely reported for benchmarking diarization-enhanced ASR systems, especially in settings prioritizing transcript usability and searchability over frame-aligned segmentation. It is recommended to report WDER alongside WER and DER or cpWER/tcpWER to comprehensively characterize system behavior, especially as architectures for joint recognition and diarization mature (Neumann et al., 4 Aug 2025).