Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

21 421

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models (2401.03506v9)

Published 7 Jan 2024 in eess.AS, cs.LG, and cs.SD

Abstract: In this paper, we introduce DiarizationLM, a framework to leverage LLMs (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.

PDF HTML Abstract

DiarizationLM: Speaker Diarization Post-Processing with LLMs

The paper introduces a novel framework, DiarizationLM, leveraging LLMs for post-processing speaker diarization outputs. This approach aims to enhance the readability of diarized transcripts and significantly reduce the Word Diarization Error Rate (WDER).

DiarizationLM operates by taking outputs from Automatic Speech Recognition (ASR) and speaker diarization systems, converting them into a compact textual format, and utilizing a finetuned LLM to refine these outputs. The LLM is prompted with text, potentially enhanced with additional information, and its output serves as the improved diarization result. This process requires no retraining of the underlying ASR or speaker diarization components, offering flexibility and ease of integration.

Key Findings

The framework was evaluated using a finetuned PaLM 2-S model on datasets such as the Fisher telephone conversation corpus and the Callhome English dataset. The results demonstrated notable reductions in WDER: a relative improvement of 55.5% on the Fisher dataset and 44.9% on the Callhome dataset.

Methodological Insights

Prompt Construction: The framework constructs prompts by segmenting the diarization outputs into text with embedded speaker tokens. This is then fed into an LLM alongside an instruction prefix and optional suffix or contextual hints.
Completion Parsing: Post LLM processing, the text output is transformed back into speaker and word sequences. A Transcript-Preserving Speaker Transfer (TPST) algorithm ensures speaker labels are correctly and consistently applied, preserving the original ASR transcript.
LLM Finetuning: Three data preparation flavors were explored for finetuning the LLM:
- Hypothesis-to-oracle (hyp2ora)
- Degraded-to-reference (deg2ref)
- A mixed approach combining both.

Remarkably, the hyp2ora flavor yielded the most significant error reduction.

Experimental Validation: Extensive testing was conducted across different LLM conditions, such as zero-shot, one-shot, and fully finetuned scenarios. Notably, the finetuned DiarizationLM model showcased superior performance over zero-shot and one-shot LLMs, which suffered from high error rates, emphasizing the necessity of task-specific finetuning for complex diarization tasks.

Implications and Future Directions

The findings underscore the efficacy of incorporating semantic information through LLMs in refining speaker diarization results. This research posits potential expansions like handling diverse domains beyond telephone conversations and evaluating performance on a multilingual scale.

Additional exploration into other capabilities of LLMs, such as autofilling speaker roles or integrating semantic context for improved orchestration, presents intriguing avenues for further research. The framework's adaptability to different ASR and speaker diarization systems also opens prospects for broader applications in dynamic environments.

In summary, DiarizationLM presents a compelling case for the integration of LLMs in speaker diarization processes, offering substantial improvements in accuracy and usability. As LLM technologies progress, their application in such domains is poised to expand, driving both theoretical advancements and practical solutions in AI-driven communication technologies.

PDF Markdown Bookmark Chat (Pro)

References (60)

Authors (6)

Quan Wang (130 papers)
Yiling Huang (16 papers)
Guanlong Zhao (10 papers)
Evan Clark (2 papers)
Wei Xia (147 papers)
Hank Liao (13 papers)

Citations (7)

View on Semantic Scholar

GitHub

GitHub - google/speaker-id: This repository contains audio samples and supplementary materials accompanying publications by the "Speaker, Voice and Language" team at Google. (421 stars)

Tweets

https://twitter.com/ArxivSound/status/1747816107161907434

https://twitter.com/ArxivSound/status/1749640518114357645

https://twitter.com/ArxivSound/status/1755457318014128307

https://twitter.com/IAmACatAI/status/1744645195168350598

https://twitter.com/AI_inAM/status/1749661751082422690

https://twitter.com/fly51fly/status/1745203742973321426