Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models (2401.03506v9)

Published 7 Jan 2024 in eess.AS, cs.LG, and cs.SD

Abstract: In this paper, we introduce DiarizationLM, a framework to leverage LLMs (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.

DiarizationLM: Speaker Diarization Post-Processing with LLMs

The paper introduces a novel framework, DiarizationLM, leveraging LLMs for post-processing speaker diarization outputs. This approach aims to enhance the readability of diarized transcripts and significantly reduce the Word Diarization Error Rate (WDER).

DiarizationLM operates by taking outputs from Automatic Speech Recognition (ASR) and speaker diarization systems, converting them into a compact textual format, and utilizing a finetuned LLM to refine these outputs. The LLM is prompted with text, potentially enhanced with additional information, and its output serves as the improved diarization result. This process requires no retraining of the underlying ASR or speaker diarization components, offering flexibility and ease of integration.

Key Findings

The framework was evaluated using a finetuned PaLM 2-S model on datasets such as the Fisher telephone conversation corpus and the Callhome English dataset. The results demonstrated notable reductions in WDER: a relative improvement of 55.5% on the Fisher dataset and 44.9% on the Callhome dataset.

Methodological Insights

  1. Prompt Construction: The framework constructs prompts by segmenting the diarization outputs into text with embedded speaker tokens. This is then fed into an LLM alongside an instruction prefix and optional suffix or contextual hints.
  2. Completion Parsing: Post LLM processing, the text output is transformed back into speaker and word sequences. A Transcript-Preserving Speaker Transfer (TPST) algorithm ensures speaker labels are correctly and consistently applied, preserving the original ASR transcript.
  3. LLM Finetuning: Three data preparation flavors were explored for finetuning the LLM:
    • Hypothesis-to-oracle (hyp2ora)
    • Degraded-to-reference (deg2ref)
    • A mixed approach combining both.

Remarkably, the hyp2ora flavor yielded the most significant error reduction.

  1. Experimental Validation: Extensive testing was conducted across different LLM conditions, such as zero-shot, one-shot, and fully finetuned scenarios. Notably, the finetuned DiarizationLM model showcased superior performance over zero-shot and one-shot LLMs, which suffered from high error rates, emphasizing the necessity of task-specific finetuning for complex diarization tasks.

Implications and Future Directions

The findings underscore the efficacy of incorporating semantic information through LLMs in refining speaker diarization results. This research posits potential expansions like handling diverse domains beyond telephone conversations and evaluating performance on a multilingual scale.

Additional exploration into other capabilities of LLMs, such as autofilling speaker roles or integrating semantic context for improved orchestration, presents intriguing avenues for further research. The framework's adaptability to different ASR and speaker diarization systems also opens prospects for broader applications in dynamic environments.

In summary, DiarizationLM presents a compelling case for the integration of LLMs in speaker diarization processes, offering substantial improvements in accuracy and usability. As LLM technologies progress, their application in such domains is poised to expand, driving both theoretical advancements and practical solutions in AI-driven communication technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, pp. 101317, 2022.
  2. “Speaker diarization: A journey from unsupervised to supervised approaches,” Odyssey: The Speaker and Language Recognition Workshop, 2022, Tutorial session.
  3. “Feature learning with raw-waveform CLDNNs for voice activity detection,” in Proc. Interspeech, 2016.
  4. “Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,” in Proc. Interspeech, 2020.
  5. “Personal VAD: Speaker-conditioned voice activity detection,” in Odyssey: The Speaker and Language Recognition Workshop, 2020.
  6. “Personal vad 2.0: Optimizing personal voice activity detection for on-device speech recognition,” arXiv preprint arXiv:2204.03793, 2022.
  7. “Turn-to-Diarize: Online speaker diarization constrained by transformer transducer speaker turn detection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8077–8081.
  8. “Augmenting transformer-transducer based speaker change detection with token-level training loss,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
  9. “Generalized end-to-end loss for speaker verification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
  10. “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
  11. “X-Vectors: Robust dnn embeddings for speaker recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
  12. “Speaker diarization with LSTM,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5239–5243.
  13. “Speaker diarization using deep neural network embeddings,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4930–4934.
  14. “Developing on-line speaker diarization system,” in Proc. Interspeech, 2017, pp. 2739–2743.
  15. “Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap,” IEEE Signal Processing Letters, vol. 27, pp. 381–385, 2019.
  16. “Multi-scale speaker diarization with neural affinity score fusion,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7173–7177.
  17. “Highly efficient real-time streaming and fully on-device speaker diarization with multi-stage clustering,” arXiv preprint arXiv:2210.13690, 2022.
  18. “Fully supervised speaker diarization,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6301–6305.
  19. “Discriminative neural clustering for speaker diarisation,” in Spoken Language Technology Workshop (SLT). IEEE, 2021.
  20. “End-to-end neural speaker diarization with permutation-free objectives,” in Proc. Interspeech, 2019, pp. 4300–4304.
  21. “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” arXiv preprint arXiv:2203.17068, 2022.
  22. “Speaker diarization using an end-to-end model,” US Patent US011545157B2, 2019.
  23. “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” arXiv preprint arXiv:2005.09921, 2020.
  24. “Who said what? Recorder’s on-device solution for labeling speakers,” Google AI Blog.
  25. “Joint speech recognition and speaker diarization via sequence transduction,” in Proc. Interspeech, 2019, pp. 396–400.
  26. “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” arXiv preprint arXiv:2006.10930, 2020.
  27. “Minimum bayes risk training for end-to-end speaker-attributed ASR,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6503–6507.
  28. “End-to-end speaker-attributed ASR with transformer,” arXiv preprint arXiv:2104.02128, 2021.
  29. “Streaming speaker-attributed ASR with token-level speaker embeddings,” arXiv preprint arXiv:2203.16685, 2022.
  30. “Speaker-aware neural network based beamformer for speaker extraction in speech mixtures,” in Proc. Interspeech, 2017, pp. 2655–2659.
  31. “Single channel target speaker extraction and recognition with speaker beam,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5554–5558.
  32. “End-to-end SpeakerBeam for single channel target speech recognition,” in Proc. Interspeech, 2019, pp. 451–455.
  33. “Auxiliary interference speaker loss for target-speaker speech recognition,” arXiv preprint arXiv:1906.10876, 2019.
  34. “Towards word-level end-to-end neural speaker diarization with auxiliary network,” arXiv preprint arXiv:2309.08489, 2023.
  35. “Lexical speaker error correction: Leveraging language models for speaker diarization error correction,” arXiv preprint arXiv:2306.09313, 2023.
  36. “An overview of Bard: an early experiment with generative AI,” https://ai.google/static/documents/google-about-bard.pdf, 2023.
  37. OpenAI, “Introducing ChatGPT,” https://openai.com/blog/chatgpt, 2022.
  38. Vladimir I Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Soviet physics doklady, vol. 10, no. 8, pp. 707–710, 1966.
  39. Harold W Kuhn, “The Hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  40. “The Fisher corpus: A resource for the next generations of speech-to-text,” in LREC, 2004, vol. 4, pp. 69–71.
  41. “CALLHOME American English speech LDC97S42,” LDC Catalog. Philadelphia: Linguistic Data Consortium, 1997.
  42. “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” arXiv preprint arXiv:2004.09249, 2020.
  43. “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  44. Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  45. “PaLM 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  46. “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv preprint arXiv:1808.06226, 2018.
  47. “Diarization resegmentation in the factor analysis subspace,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4794–4798.
  48. “Neural speech turn segmentation and affinity propagation for speaker diarization,” in Proc. Interspeech, 2018, pp. 1393–1397.
  49. “DiaCorrect: Error correction back-end for speaker diarization,” arXiv preprint arXiv:2309.08377, 2023.
  50. “The majority wins: a method for combining speaker diarization systems,” in Proc. Interspeech, 2009.
  51. “System output combination for improved speaker diarization,” in Proc. Interspeech, 2010.
  52. “DOVER: A method for combining diarization outputs,” in Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 757–763.
  53. “End-to-end speaker diarization as post-processing,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7188–7192.
  54. “Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams,” in International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2012, pp. 118–123.
  55. “Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks,” in Proc. Interspeech, 2018, pp. 1373–1377.
  56. “Speaker diarization with lexical information,” arXiv preprint arXiv:2004.06756, 2020.
  57. “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  58. “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833.
  59. “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  60. “Enhancing speaker diarization with large language models: A contextual beam search approach,” arXiv preprint arXiv:2309.05248, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Quan Wang (130 papers)
  2. Yiling Huang (16 papers)
  3. Guanlong Zhao (10 papers)
  4. Evan Clark (2 papers)
  5. Wei Xia (147 papers)
  6. Hank Liao (13 papers)
Citations (7)