LLM-Based Speaker Diarization Correction

Updated 9 November 2025

The paper demonstrates that LLM-based correction significantly reduces speaker-attribution errors by leveraging semantic and discourse context.
It outlines diverse architectures including two-pass lexical correction, audio-lexical fusion, and integrated end-to-end models for improved diarization.
The study reports empirical gains, discusses training regimes, and highlights challenges like computational cost and scalability in real-world settings.

LLM-based speaker diarization correction encompasses a set of approaches that leverage deep neural LLMs to improve the assignment of speaker labels to transcribed words in multi-speaker audio. These techniques augment or refine the output of traditional acoustic diarization and ASR systems by exploiting the semantic, syntactic, and dialog-structural knowledge embedded in powerful LLMs. The principal objectives are reductions in speaker-attribution error rates, improved robustness across domains, and greater usability in complex conversational scenarios. This article surveys the major system architectures, integration methods, training regimes, empirical results, and open research directions in this rapidly evolving area.

1. Motivation and Background

Speaker diarization, the process of determining "who spoke when" in an audio stream, is typically addressed by integrating ASR for transcription and a separate SD system for segmenting and labeling speech by speaker. Speaker errors—misattributions of words or segments to the wrong speaker—remain a significant bottleneck, particularly around turn boundaries, in overlapped speech, or when the number of speakers is misestimated. Performance issues arise from factors including inaccurate segmentation, imperfect timestamp alignment, effects of background noise, and errors in clustering algorithms.

Traditional SD systems rely heavily on acoustic features and operate independently of language content. However, conversational language intrinsically encodes speaker-change cues, dialog roles, and pragmatic structure, which are accessible only through modeling of the transcript. The rapid progress in LLMs offers an opportunity to inject such lexical and discourse context, either as a post-processing correction mechanism or via joint modeling, to address the shortcomings of purely acoustic diarization.

2. System Architectures and Integration Strategies

LLM-based correction systems fall into several major paradigms:

2.1 Two-Pass Lexical Correction

This approach inserts an LLM-driven post-processing step after the baseline cascade of ASR and SD, leaving existing acoustics modules untouched. Notable instantiations include:

Lexical Speaker Error Correction: The transcript and initial word-level speaker labels from ASR and SD are fed to a Transformer-based LLM (e.g., RoBERTa-base), possibly followed by a lightweight front-end Transformer, which predicts revised speaker label probabilities for each word. Correction is typically local to a sliding context window (e.g., 30 words) (Paturi et al., 2023).
DiarizationLM Framework: The transcript is serialized as a sequence of "<spk:k> ...words..." tokens and provided (with or without instructions/examples) as prompt to a finetuned LLM (e.g., PaLM 2-S), which re-emits the same words interleaved with corrected speaker tags. Output sequences are aligned to the original via transcript-preserving speaker transfer (TPST) to prevent content drift (Wang et al., 7 Jan 2024).
Seq2Seq Text-Only Correction: An off-the-shelf LLM (e.g., Mistral 7B with QLoRA adapters) is finetuned to map ASR/SD transcripts to reference-labeled versions. The architecture does not require changes to the transformer backbone; only adapters are updated (Efstathiadis et al., 7 Jun 2024).

2.2 Audio-Lexical Fusion Methods

To mitigate hallucinations or miscorrections due to ambiguous text, several works enhance the LLM's input with acoustic scores:

Acoustic-Grounded LSEC: The AG-LSEC architecture fuses lexical speaker scores (from LSEC) with normalized word-level posterior probabilities directly derived from the SD pipeline (e.g., EEND). Fusion occurs either early (at the token-embedding level via concatenation) or late (via a small feed-forward network over separate modality outputs), followed by multi-word LLM context modeling (Paturi et al., 25 Jun 2024).
SEAL: SEAL uses discretized confidence labels (low/med/high) for each word, interleaved with the transcript, and applies a constrained decoding scheme during inference to restrict the LLM to only permissible speaker-label outputs, substantially reducing hallucinations (Kumar et al., 14 Jan 2025).

2.3 LLM-Integrated End-to-End Models

Emerging architectures collapse diarization correction into the main speech-to-text sequence modeling:

Unified Speech LLMs: End-to-end models fuse frame-level acoustic representations (e.g., Whisper-large-v3) with LLM backbones (e.g., Llama-3.2-3B-instruct) via a sub-sampling projector. Speaker tokens are interleaved with text and timestamp tokens in the output, and the entire sequence is generated autoregressively and trained with a joint ASR+diarization loss (Saengthong et al., 26 Jun 2025, Yin et al., 8 Aug 2025).
Diarization-Aware SOT-LLM: Conditioning the LLM both on semantic and speaker embeddings (produced by separate encoders and fused by gated cross-attention), as well as on explicit diarization triplet prompts, enables the model to ground segment-level transcription in speaker identity and temporal boundaries (Lin et al., 6 Jun 2025).

2.4 Modular, Training-Free Pipelines

For highly dynamic domains or scenarios with unknown speaker counts, modular correction pipelines apply off-the-shelf ASR, SD, and a prompted LLM for assignment of speaker roles, merging of split clusters, and low-confidence relabeling, without further model training. Correction is based on semantic continuity and role identification via well-crafted prompts (zero-shot) (Chen et al., 18 Sep 2025).

2.5 Interactive and Human-in-the-Loop Correction

LLMs are also used for real-time, interactive correction, incorporating user feedback into the diarization loop. Summaries presented by the LLM enable users to provide corrections (e.g., "Hey COBI: Predicted A saying ‘…’ was actually B"). These corrections are parsed into structured updates and applied on-the-fly, with speaker enrollments refined by adding new embeddings for corrected labels. Additional segmentation refinement, such as Split-When-Merged (SWM), addresses error propagation from under-segmented regions (He et al., 22 Sep 2025).

3. Training Regimes, Prompt Engineering, and Losses

3.1 Data Preparation and Alignment

LLM-based correction frameworks generally require paired ASR/SD outputs and reference speaker labels. Alignment for supervised finetuning employs Levenshtein-based mapping or more specialized algorithms (TPST), ensuring speaker-label sequences are consistent with word order after possible content drifts by the LLM (Wang et al., 7 Jan 2024, Efstathiadis et al., 7 Jun 2024).

3.2 Loss Functions and Optimization

Typically, cross-entropy loss over the entire output stream—encompassing both transcript tokens and speaker tokens—is minimized. In end-to-end models, the joint loss decomposes as:

$L_{total} = L_{ASR} + L_{diarization}$

where each term is a sum over the relevant token classes (text vs. speaker/timestamp tokens) (Saengthong et al., 26 Jun 2025, Yin et al., 8 Aug 2025). No explicit re-weighting is typically applied.

For signal-level fusion models, loss terms over the fused speaker posteriors (either concatenated embeddings or late decision-level scores) are computed with reference labels (Paturi et al., 25 Jun 2024).

3.3 Prompt Engineering

Prompt formats range from minimal (just tagged text) for finetuned runs to explicit instructions for zero- and one-shot scenarios. Modular pipelines rely on structured, role-based, or low-confidence segment prompts, directing the LLM to perform mapping or relabeling actions. Length constraints are addressed via chunking, and multi-turn or role-annotated windows are used for enhanced disambiguation (Wang et al., 7 Jan 2024, Chen et al., 18 Sep 2025).

4. Evaluation Metrics and Empirical Results

4.1 Metrics

Key diarization correction metrics include:

Word Diarization Error Rate (WDER): Fraction of words with incorrect speaker labels. Often computed with asclite alignment to account for word insertions/deletions and overlaps.
cpWER (Concatenated-Permutation WER): WER after speaker-attributed segmentation, then minimizing error over speaker permutations.
SA-WER: Speaker-attributed WER; similar computation as cpWER.
DER (Diarization Error Rate): Aggregate of False Alarm, Missed Speech, and Speaker Error over the total reference time. Collars of 0.25–5 seconds are standard.
Relative Error Reduction (%): Relative change versus baseline, typically reported in tables alongside absolute error rates.

4.2 Results Summary

System/Method	Domain	Metric	Baseline	LLM-Based	Relative Gain
Lexical SEC (Paturi et al., 2023)	Fisher/RT03/CHAE/CH-109	WDER	2.26-4.28	1.59-3.56	15-30%
DiarizationLM (PaLM 2-S) (Wang et al., 7 Jan 2024)	Fisher/Callhome	WDER	5.32/7.72	2.37/4.25	44.9–55.5%
LLM-PostProcess Ensemble (Efstathiadis et al., 7 Jun 2024)	Fisher/Azure/WhisperX/GCP	deltaCP/SA	-	Table 1/2	Up to 57% deltaCP RE
AG-LSEC, Early Fusion (Paturi et al., 25 Jun 2024)	Fisher/RT03-CTS/CHAE	WDER	2.56–3.45	1.56–2.48	28–41% (vs baseline)
SEAL (AC-label+CD) (Kumar et al., 14 Jan 2025)	Fisher/CHAE/RT03-CTS	Δcp	3.72-5.41	2.12-2.23	24–43%
Contextual Beam Search (All LLM) (Park et al., 2023)	AMI/CHAE	ΔSA-WER	5.05-8.45	3.24-3.84	Up to 39.8%
Modular LLM-Refinement (Chen et al., 18 Sep 2025)	Clinician–Patient	DER	23.05	16.19	29.7%
Unified End-to-End Speech LLM (Saengthong et al., 26 Jun 2025)	Multilingual	tcpWER	76.12	28.23	54.9%
SpeakerLM (Yin et al., 8 Aug 2025)	SDR Benchmarks	cpCER	23.20	16.05	Absolute -6.60 pts

Across diverse datasets and languages, LLM-based correction modules consistently halve diarization error rates compared to strong acoustic baselines. Error types most impacted are speaker-confusion during turn-taking and around overlaps, as well as “split” and “merge” clustering failures.

5. Error Analysis and Operational Considerations

LLM-based diarization correction addresses systematic errors that are inaccessible to acoustics-only models by leveraging:

Lexical and semantic disambiguation: The LLM parses the structure of natural dialog, e.g., assigning response utterances to the correct interlocutor based on prior context or idiomatic question–answer structure.
Multi-modal anchoring: When augmented with acoustic scores, LLMs arbitrate between plausible lexical assignments and hard evidence derived from speaker posteriors.
Role mapping and identity resolution: Prompted LLMs can resolve “split speaker” errors by mapping multiple cluster labels to the same participant identity if semantic continuity is clear.

However, certain limitations persist: (i) LLMs without acoustic input may hallucinate corrections in lexically ambiguous regions, (ii) ASR system specificity of finetuning can reduce generalizability unless expert ensembling is employed, and (iii) computational overhead, especially for large-scale inference or at every word, can be high.

6. Generalizability, Adaptation, and Applications

ASR-Agnostic Correction: Ensembles of LLMs finetuned on outputs from different ASR systems yield robust error reduction on previously unseen ASR domains, outperforming single-expert models (Efstathiadis et al., 7 Jun 2024).
Multilingual Extension: Joint models trained on multiple languages (via subword tokenization, data mixing, and language-specific augmentation) achieve substantial improvement in tcpWER and DER across eleven languages and diverse accents (Saengthong et al., 26 Jun 2025).
Training-Free Adaptation: Modular prompting architectures provide a route for deployment in clinical, legal, and open-domain conversational settings with little or no retraining—role ontologies and prompt translation address domain specificity (Chen et al., 18 Sep 2025).
Human-in-the-Loop Correction: Systems supporting real-time correction by users, coupled with interactive LLM summarization and automatic speaker enrollment refinements, show large reductions in DER, especially in complex or dynamic dialog settings (He et al., 22 Sep 2025).
End-to-End Paradigms: Fully end-to-end, diarization-aware LLMs unify transcription and speaker labeling, eliminating separate post-processing and enabling direct optimization of joint ASR+diarization objectives (Yin et al., 8 Aug 2025, Saengthong et al., 26 Jun 2025).

7. Limitations and Future Research Directions

Ongoing challenges for LLM-based diarization correction include:

Handling of Overlapped Speech: While state-of-the-art architectures (e.g., SOT-LLMs, SpeakerLM) can separate overlap to a degree, performance may still degrade in heavy multi-speaker crosstalk.
Scaling to Many Speakers: Most LLM-correction frameworks are validated on 2-speaker telephony; extension to >4 speakers (as in meetings or courtroom data) requires scalable alignment and tokenization schemes.
Latency and Cost: Computational cost of LLM-inference, especially with multi-modal prompts or in streaming settings, remains high. Beam pruning, adapter-based distillation, or quantization may be required for production scenarios.
Cross-lingual Generalization: Although multilingual models are emerging, coverage of code-mixed conversations and acoustic-textual correlation in low-resource languages demands further research.
Integration of Paralinguistics: Incorporating prosodic features, speaker role cues, and dialog act signals into the correction pipeline is an open direction for improving context-sensitive diarization.

LLM-based speaker diarization correction, through both posthoc and integrated approaches, has thus reshaped diarization accuracy benchmarks and enabled robust, context-aware speaker labeling in real conversations. The landscape is characterized by modularity, strong empirical gains particularly when applied with domain-appropriate training or prompting, and rapid advances in end-to-end and multimodal LLM architectures.