DiarizationLM: Speaker Diarization Post-Processing with LLMs
The paper introduces a novel framework, DiarizationLM, leveraging LLMs for post-processing speaker diarization outputs. This approach aims to enhance the readability of diarized transcripts and significantly reduce the Word Diarization Error Rate (WDER).
DiarizationLM operates by taking outputs from Automatic Speech Recognition (ASR) and speaker diarization systems, converting them into a compact textual format, and utilizing a finetuned LLM to refine these outputs. The LLM is prompted with text, potentially enhanced with additional information, and its output serves as the improved diarization result. This process requires no retraining of the underlying ASR or speaker diarization components, offering flexibility and ease of integration.
Key Findings
The framework was evaluated using a finetuned PaLM 2-S model on datasets such as the Fisher telephone conversation corpus and the Callhome English dataset. The results demonstrated notable reductions in WDER: a relative improvement of 55.5% on the Fisher dataset and 44.9% on the Callhome dataset.
Methodological Insights
- Prompt Construction: The framework constructs prompts by segmenting the diarization outputs into text with embedded speaker tokens. This is then fed into an LLM alongside an instruction prefix and optional suffix or contextual hints.
- Completion Parsing: Post LLM processing, the text output is transformed back into speaker and word sequences. A Transcript-Preserving Speaker Transfer (TPST) algorithm ensures speaker labels are correctly and consistently applied, preserving the original ASR transcript.
- LLM Finetuning: Three data preparation flavors were explored for finetuning the LLM:
- Hypothesis-to-oracle (hyp2ora)
- Degraded-to-reference (deg2ref)
- A mixed approach combining both.
Remarkably, the hyp2ora flavor yielded the most significant error reduction.
- Experimental Validation: Extensive testing was conducted across different LLM conditions, such as zero-shot, one-shot, and fully finetuned scenarios. Notably, the finetuned DiarizationLM model showcased superior performance over zero-shot and one-shot LLMs, which suffered from high error rates, emphasizing the necessity of task-specific finetuning for complex diarization tasks.
Implications and Future Directions
The findings underscore the efficacy of incorporating semantic information through LLMs in refining speaker diarization results. This research posits potential expansions like handling diverse domains beyond telephone conversations and evaluating performance on a multilingual scale.
Additional exploration into other capabilities of LLMs, such as autofilling speaker roles or integrating semantic context for improved orchestration, presents intriguing avenues for further research. The framework's adaptability to different ASR and speaker diarization systems also opens prospects for broader applications in dynamic environments.
In summary, DiarizationLM presents a compelling case for the integration of LLMs in speaker diarization processes, offering substantial improvements in accuracy and usability. As LLM technologies progress, their application in such domains is poised to expand, driving both theoretical advancements and practical solutions in AI-driven communication technologies.