Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extending Whisper with prompt tuning to target-speaker ASR (2312.08079v2)

Published 13 Dec 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. Variants of prompt tuning approaches along with their configurations are explored and optimized for TS-ASR.Experimental results show that prompt tuning can achieve performance comparable to state-of-the-art full training approaches while only requiring about 1\% of task-specific model parameters. Notably, the original Whisper's features, such as inverse text normalization and timestamp tagging, are retained in target-speaker ASR, keeping the generated transcriptions natural and informative.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Robust speech recognition via large-scale weak supervision,” in Int. Conf. Mach. Learn. (ICML), 2023, pp. 28492–28518.
  2. “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2017, pp. 241–245.
  3. “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2018.
  4. “A sidecar separator can convert a single-talker speech recognition system to a multi-talker one,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  5. “Serialized output training for end-to-end overlapped speech recognition,” in Proc. Interspeech, 2020, pp. 2797–2801.
  6. “Adapting multi-lingual asr models for handling multiple talkers,” in Proc. Interspeech, 2023.
  7. “End-to-end speaker-attributed asr with transformer,” in Proc. Interspeech, 2021.
  8. “Auxiliary interference speaker loss for target-speaker speech recognition,” in Proc. Interspeech, 2019, pp. 236–240.
  9. “Conformer-based target-speaker automatic speech recognition for single-channel audio,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  10. “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  11. “Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition,” in Proc. Interspeech, 2023.
  12. “Parameter-efficient transfer learning for NLP,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019, pp. 2790–2799.
  13. “LoRA: Low-rank adaptation of large language models,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022.
  14. “The power of scale for parameter-efficient prompt tuning,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2021.
  15. “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
  16. “Residual prompt tuning: Improving prompt tuning with residual reparameterization,” in Proc. Annual Meeting Association Comput. Linguistics (ACL), 2023.
  17. “Speechprompt v2: Prompt tuning for speech classification tasks,” arXiv preprint arXiv:2303.00733, 2023.
  18. “Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition,” arXiv preprint arXiv:2302.08102, 2023.
  19. “Prompting the hidden talent of web-scale speech models for zero-shot task generalization,” in Proc. Interspeech, 2023.
  20. “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” CoRR, vol. abs/2110.07602, 2021.
  21. “Language models are few-shot learners,” in Proc. Advances Neural Inf. Process. Systems (NeurIPS), 2020, vol. 33, pp. 1877–1901.
  22. “AutoPrompt: Eliciting knowledge from language models with automatically generated prompts,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2020.
  23. “LibriMix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
  24. “Librispeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2015, pp. 5206–5210.
  25. “WHAM!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160, 2019.
  26. “Libri-light: A benchmark for asr with limited or no supervision,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2020, pp. 7669–7673.
  27. “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech, 2017, pp. 999–1003.
  28. “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hao Ma (116 papers)
  2. Zhiyuan Peng (33 papers)
  3. Mingjie Shao (27 papers)
  4. Jing Li (621 papers)
  5. Ju Liu (36 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.