Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System (2407.09817v2)

Published 13 Jul 2024 in cs.SD, cs.CL, and eess.AS

Abstract: Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enroLLMent speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AisheLLMix Mandarin dataset.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Lingwei Meng (31 papers)
  2. Jiawen Kang (204 papers)
  3. Yuejiao Wang (10 papers)
  4. Zengrui Jin (30 papers)
  5. Xixin Wu (85 papers)
  6. Xunying Liu (92 papers)
  7. Helen Meng (204 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.