Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting (2309.15649v2)
Abstract: We explore the ability of LLMs to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines causal instructions and demonstration to increase its context windows. Next, we show that rescoring only by in-context learning with frozen LLMs achieves results that are competitive with rescoring by domain-tuned LMs, using a pretrained first-pass recognition system and rescoring output on two out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with fine-tuning we achieve error rates below the N-best oracle level, showcasing the generalization power of the LLMs.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
- S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?” in Proc. EMNLP, 2022, pp. 11 048–11 064.
- J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in Proc. ICLR, 2022.
- R. D. Martinez, S. Novotney, I. Bulyko, A. Rastrow, A. Stolcke, and A. Gandhe, “Attention-based contextual language model adaptation for speech recognition,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 1994–2003.
- B. Li, R. Pang, T. N. Sainath, A. Gulati, Y. Zhang, J. Qin, P. Haghani, W. R. Huang, M. Ma, and J. Bai, “Scaling end-to-end models for large-scale multilingual asr,” in Proc. ASRU. IEEE, 2021, pp. 1011–1018.
- C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and E. S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,” arXiv preprint, 2023.
- D. Le, G. Keren, J. Chan, J. Mahadeokar, C. Fuegen, and M. L. Seltzer, “Deep shallow fusion for RNN-T personalization,” in Proc. SLT. IEEE, 2021, pp. 251–257.
- J. Guo, T. N. Sainath, and R. J. Weiss, “A spelling correction model for end-to-end speech recognition,” in Proc. ICASSP. IEEE, 2019, pp. 5651–5655.
- J. Liao, S. E. Eskimez, L. Lu, Y. Shi, M. Gong, L. Shou, H. Qu, and M. Zeng, “Improving readability for automatic speech recognition transcription,” Transactions on Asian and Low-Resource Language Information Processing, 2020.
- J. Yang, R. Li, and W. Peng, “ASR error correction with constrained decoding on operation prediction,” Proc. Interspeech, pp. 3874–3878, 2022.
- R. Ma, M. J. Gales, K. Knill, and M. Qian, “N-best T5: Robust ASR error correction using multiple input hypotheses and constrained decoding space,” arXiv preprint arXiv:2303.00456, 2023.
- V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Multitask prompted training enables zero-shot task generalization,” arXiv preprint arXiv:2110.08207, 2021.
- R. Zhong, K. Lee, Z. Zhang, and D. Klein, “Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections,” arXiv preprint arXiv:2104.04670, 2021.
- D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, and F. Wei, “Why can GPT learn in-context? language models implicitly perform gradient descent as meta-optimizers,” in Proc. ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. H. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” in NeurIPS, 2022.
- T. Kojima et al., “Large language models are zero-shot reasoners,” Advances in Neural Information Processing Systems, 2022.
- S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An explanation of in-context learning as implicit bayesian inference,” Proc. ICLR, 2021.
- C. Floccia, J. Goslin, F. Girard, and G. Konopczynski, “Does a regional accent perturb speech processing?” Journal of Experimental Psychology: Human Perception and Performance, vol. 32, no. 5, p. 1276, 2006.
- Y. Razeghi, R. L. Logan IV, M. Gardner, and S. Singh, “Impact of pretraining term frequencies on few-shot reasoning,” arXiv preprint arXiv:2202.07206, 2022.
- A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang, “Fine-tuning can distort pretrained features and underperform out-of-distribution,” in Proc. ICLR, 2022.
- C.-H. H. Yang, Y.-Y. Tsai, and P.-Y. Chen, “Voice2Series: Reprogramming acoustic models for time series classification,” in Proc. International Conference on Machine Learning. PMLR, 2021, pp. 11 808–11 819.
- K. Hambardzumyan, H. Khachatrian, and J. May, “Warp: Word-level adversarial reprogramming,” in Proc. ACL, 2021, pp. 4921–4933.
- X. Ye and G. Durrett, “Explanation selection using unlabeled data for in-context learning,” arXiv preprint arXiv:2302.04813, 2023.
- C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP. IEEE, 2018, pp. 4774–4778.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” Proc. Interspeech, pp. 5036–5040, 2020.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP. IEEE, 2015, pp. 5206–5210.
- G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
- C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv preprint arXiv:2101.00390, 2021.
- J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-Light: A benchmark for ASR with limited or no supervision,” in Proc. ICASSP. IEEE, 2020, pp. 7669–7673.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
- L. Xu, Y. Gu, J. Kolehmainen, H. Khan, A. Gandhe, A. Rastrow, A. Stolcke, and I. Bulyko, “RescoreBERT: Discriminative speech recognition rescoring with BERT,” in Proc. ICASSP. IEEE, 2022, pp. 6117–6121.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL, 2018, p. 4171–4186.
- R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” in Proc. ICASSP. IEEE, 2018, pp. 4839–4843.
- B.-H. Juang, W. Hou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 257–265, 1997.
- D. Povey, “Discriminative training for large vocabulary speech recognition,” Ph.D. dissertation, University of Cambridge, 2005.
- Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A lite BERT for self-supervised learning of language representations,” in Proc. ICLR, 2020.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “BLOOM: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
- S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in Proc. ICLR, 2018.
- T. H. Trinh and Q. V. Le, “A simple method for commonsense reasoning,” arXiv preprint arXiv:1806.02847, 2018.
- X. Geng and H. Liu, “OpenLLaMA: An open reproduction of LLaMA,” https://github.com/openlm-research/open_llama, May 2023.
- T. Computer, “RedPajama: An open source recipe to reproduce LLaMA training dataset,” https://github.com/togethercomputer/RedPajama-Data, Apr. 2023.
- Q. Lhoest, A. V. del Moral, Y. Jernite, A. Thakur, P. von Platen, S. Patil, J. Chaumond, M. Drame, J. Plu, L. Tunstall et al., “Datasets: A community library for natural language processing,” in Proc. EMNLP, 2021, pp. 175–184.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray et al., “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, 2022.
- C. T. Hemphill, J. J. Godfrey, and G. R. Doddington, “The ATIS spoken language systems pilot corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, 1990.
- M. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank,” Computational Linguistics, vol. 19, no. 2, 1993.
- Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang et al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
- N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in International Conference on Machine Learning. PMLR, 2019, pp. 2790–2799.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proc. ICLR, 2021.
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proc. ACL, 2021, pp. 4582–4597.
- Chao-Han Huck Yang (89 papers)
- Yile Gu (25 papers)
- Yi-Chieh Liu (10 papers)
- Shalini Ghosh (34 papers)
- Ivan Bulyko (23 papers)
- Andreas Stolcke (57 papers)