Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting (2309.15649v2)

Published 27 Sep 2023 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS

Abstract: We explore the ability of LLMs to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines causal instructions and demonstration to increase its context windows. Next, we show that rescoring only by in-context learning with frozen LLMs achieves results that are competitive with rescoring by domain-tuned LMs, using a pretrained first-pass recognition system and rescoring output on two out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with fine-tuning we achieve error rates below the N-best oracle level, showcasing the generalization power of the LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
  2. S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?” in Proc. EMNLP, 2022, pp. 11 048–11 064.
  3. J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in Proc. ICLR, 2022.
  4. R. D. Martinez, S. Novotney, I. Bulyko, A. Rastrow, A. Stolcke, and A. Gandhe, “Attention-based contextual language model adaptation for speech recognition,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 1994–2003.
  5. B. Li, R. Pang, T. N. Sainath, A. Gulati, Y. Zhang, J. Qin, P. Haghani, W. R. Huang, M. Ma, and J. Bai, “Scaling end-to-end models for large-scale multilingual asr,” in Proc. ASRU.   IEEE, 2021, pp. 1011–1018.
  6. C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and E. S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,” arXiv preprint, 2023.
  7. D. Le, G. Keren, J. Chan, J. Mahadeokar, C. Fuegen, and M. L. Seltzer, “Deep shallow fusion for RNN-T personalization,” in Proc. SLT.   IEEE, 2021, pp. 251–257.
  8. J. Guo, T. N. Sainath, and R. J. Weiss, “A spelling correction model for end-to-end speech recognition,” in Proc. ICASSP.   IEEE, 2019, pp. 5651–5655.
  9. J. Liao, S. E. Eskimez, L. Lu, Y. Shi, M. Gong, L. Shou, H. Qu, and M. Zeng, “Improving readability for automatic speech recognition transcription,” Transactions on Asian and Low-Resource Language Information Processing, 2020.
  10. J. Yang, R. Li, and W. Peng, “ASR error correction with constrained decoding on operation prediction,” Proc. Interspeech, pp. 3874–3878, 2022.
  11. R. Ma, M. J. Gales, K. Knill, and M. Qian, “N-best T5: Robust ASR error correction using multiple input hypotheses and constrained decoding space,” arXiv preprint arXiv:2303.00456, 2023.
  12. V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Multitask prompted training enables zero-shot task generalization,” arXiv preprint arXiv:2110.08207, 2021.
  13. R. Zhong, K. Lee, Z. Zhang, and D. Klein, “Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections,” arXiv preprint arXiv:2104.04670, 2021.
  14. D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, and F. Wei, “Why can GPT learn in-context? language models implicitly perform gradient descent as meta-optimizers,” in Proc. ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
  15. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. H. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” in NeurIPS, 2022.
  16. T. Kojima et al., “Large language models are zero-shot reasoners,” Advances in Neural Information Processing Systems, 2022.
  17. S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An explanation of in-context learning as implicit bayesian inference,” Proc. ICLR, 2021.
  18. C. Floccia, J. Goslin, F. Girard, and G. Konopczynski, “Does a regional accent perturb speech processing?” Journal of Experimental Psychology: Human Perception and Performance, vol. 32, no. 5, p. 1276, 2006.
  19. Y. Razeghi, R. L. Logan IV, M. Gardner, and S. Singh, “Impact of pretraining term frequencies on few-shot reasoning,” arXiv preprint arXiv:2202.07206, 2022.
  20. A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang, “Fine-tuning can distort pretrained features and underperform out-of-distribution,” in Proc. ICLR, 2022.
  21. C.-H. H. Yang, Y.-Y. Tsai, and P.-Y. Chen, “Voice2Series: Reprogramming acoustic models for time series classification,” in Proc. International Conference on Machine Learning.   PMLR, 2021, pp. 11 808–11 819.
  22. K. Hambardzumyan, H. Khachatrian, and J. May, “Warp: Word-level adversarial reprogramming,” in Proc. ACL, 2021, pp. 4921–4933.
  23. X. Ye and G. Durrett, “Explanation selection using unlabeled data for in-context learning,” arXiv preprint arXiv:2302.04813, 2023.
  24. C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP.   IEEE, 2018, pp. 4774–4778.
  25. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” Proc. Interspeech, pp. 5036–5040, 2020.
  26. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP.   IEEE, 2015, pp. 5206–5210.
  27. G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
  28. C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv preprint arXiv:2101.00390, 2021.
  29. J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-Light: A benchmark for ASR with limited or no supervision,” in Proc. ICASSP.   IEEE, 2020, pp. 7669–7673.
  30. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
  31. L. Xu, Y. Gu, J. Kolehmainen, H. Khan, A. Gandhe, A. Rastrow, A. Stolcke, and I. Bulyko, “RescoreBERT: Discriminative speech recognition rescoring with BERT,” in Proc. ICASSP.   IEEE, 2022, pp. 6117–6121.
  32. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL, 2018, p. 4171–4186.
  33. R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” in Proc. ICASSP.   IEEE, 2018, pp. 4839–4843.
  34. B.-H. Juang, W. Hou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 257–265, 1997.
  35. D. Povey, “Discriminative training for large vocabulary speech recognition,” Ph.D. dissertation, University of Cambridge, 2005.
  36. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A lite BERT for self-supervised learning of language representations,” in Proc. ICLR, 2020.
  37. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  38. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “BLOOM: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
  39. S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in Proc. ICLR, 2018.
  40. T. H. Trinh and Q. V. Le, “A simple method for commonsense reasoning,” arXiv preprint arXiv:1806.02847, 2018.
  41. X. Geng and H. Liu, “OpenLLaMA: An open reproduction of LLaMA,” https://github.com/openlm-research/open_llama, May 2023.
  42. T. Computer, “RedPajama: An open source recipe to reproduce LLaMA training dataset,” https://github.com/togethercomputer/RedPajama-Data, Apr. 2023.
  43. Q. Lhoest, A. V. del Moral, Y. Jernite, A. Thakur, P. von Platen, S. Patil, J. Chaumond, M. Drame, J. Plu, L. Tunstall et al., “Datasets: A community library for natural language processing,” in Proc. EMNLP, 2021, pp. 175–184.
  44. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray et al., “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, 2022.
  45. C. T. Hemphill, J. J. Godfrey, and G. R. Doddington, “The ATIS spoken language systems pilot corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, 1990.
  46. M. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank,” Computational Linguistics, vol. 19, no. 2, 1993.
  47. Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang et al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  48. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2790–2799.
  49. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proc. ICLR, 2021.
  50. X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proc. ACL, 2021, pp. 4582–4597.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chao-Han Huck Yang (89 papers)
  2. Yile Gu (25 papers)
  3. Yi-Chieh Liu (10 papers)
  4. Shalini Ghosh (34 papers)
  5. Ivan Bulyko (23 papers)
  6. Andreas Stolcke (57 papers)
Citations (33)

Summary

We haven't generated a summary for this paper yet.