Extending Whisper with prompt tuning to target-speaker ASR (2312.08079v2)
Abstract: Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. Variants of prompt tuning approaches along with their configurations are explored and optimized for TS-ASR.Experimental results show that prompt tuning can achieve performance comparable to state-of-the-art full training approaches while only requiring about 1\% of task-specific model parameters. Notably, the original Whisper's features, such as inverse text normalization and timestamp tagging, are retained in target-speaker ASR, keeping the generated transcriptions natural and informative.
- “Robust speech recognition via large-scale weak supervision,” in Int. Conf. Mach. Learn. (ICML), 2023, pp. 28492–28518.
- “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2017, pp. 241–245.
- “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2018.
- “A sidecar separator can convert a single-talker speech recognition system to a multi-talker one,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
- “Serialized output training for end-to-end overlapped speech recognition,” in Proc. Interspeech, 2020, pp. 2797–2801.
- “Adapting multi-lingual asr models for handling multiple talkers,” in Proc. Interspeech, 2023.
- “End-to-end speaker-attributed asr with transformer,” in Proc. Interspeech, 2021.
- “Auxiliary interference speaker loss for target-speaker speech recognition,” in Proc. Interspeech, 2019, pp. 236–240.
- “Conformer-based target-speaker automatic speech recognition for single-channel audio,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
- “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
- “Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition,” in Proc. Interspeech, 2023.
- “Parameter-efficient transfer learning for NLP,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019, pp. 2790–2799.
- “LoRA: Low-rank adaptation of large language models,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022.
- “The power of scale for parameter-efficient prompt tuning,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2021.
- “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
- “Residual prompt tuning: Improving prompt tuning with residual reparameterization,” in Proc. Annual Meeting Association Comput. Linguistics (ACL), 2023.
- “Speechprompt v2: Prompt tuning for speech classification tasks,” arXiv preprint arXiv:2303.00733, 2023.
- “Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition,” arXiv preprint arXiv:2302.08102, 2023.
- “Prompting the hidden talent of web-scale speech models for zero-shot task generalization,” in Proc. Interspeech, 2023.
- “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” CoRR, vol. abs/2110.07602, 2021.
- “Language models are few-shot learners,” in Proc. Advances Neural Inf. Process. Systems (NeurIPS), 2020, vol. 33, pp. 1877–1901.
- “AutoPrompt: Eliciting knowledge from language models with automatically generated prompts,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2020.
- “LibriMix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
- “Librispeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2015, pp. 5206–5210.
- “WHAM!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160, 2019.
- “Libri-light: A benchmark for asr with limited or no supervision,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2020, pp. 7669–7673.
- “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech, 2017, pp. 999–1003.
- “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019.
- Hao Ma (116 papers)
- Zhiyuan Peng (33 papers)
- Mingjie Shao (27 papers)
- Jing Li (621 papers)
- Ju Liu (36 papers)