2000 character limit reached
Towards interfacing large language models with ASR systems using confidence measures and prompting (2407.21414v1)
Published 31 Jul 2024 in eess.AS and cs.CL
Abstract: As LLMs grow in parameter size and capabilities, such as interaction through prompting, they open up new ways of interfacing with automatic speech recognition (ASR) systems beyond rescoring n-best lists. This work investigates post-hoc correction of ASR transcripts with LLMs. To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
- J. P. Rauschecker and S. K. Scott, “Maps and streams in the auditory cortex: Nonhuman primates illuminate human speech processing,” Nature Neuroscience, vol. 12, no. 6, pp. 718–724, Jun. 2009. [Online]. Available: http://www.nature.com/articles/nn.2331
- G. A. Miller and S. Isard, “Some perceptual consequences of linguistic rules,” Journal of Verbal Learning and Verbal Behavior, vol. 2, no. 3, pp. 217–228, Oct. 1963. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0022537163800870
- R. V. Shannon, F. G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid, “Speech recognition with primarily temporal cues,” Science, vol. 270, no. 5234, pp. 303–304, Oct. 1995. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/7569981
- L. Gwilliams, A. Marantz, D. Poeppel, and J.-R. King, “Top-down information shapes lexical processing when listening to continuous speech,” Language, Cognition and Neuroscience, vol. 0, no. 0, pp. 1–14, 2023, publisher: Routledge _eprint: https://doi.org/10.1080/23273798.2023.2171072. [Online]. Available: https://doi.org/10.1080/23273798.2023.2171072
- E. Sohoglu, J. E. Peelle, R. P. Carlyon, and M. H. Davis, “Predictive Top-Down Integration of Prior Knowledge during Speech Perception,” Journal of Neuroscience, vol. 32, no. 25, pp. 8443–8453, Jun. 2012. [Online]. Available: http://www.jneurosci.org/cgi/doi/10.1523/JNEUROSCI.5069-11.2012
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proc. NeurIPS, 2022.
- B. Krause, E. Kahembwe, I. Murray, and S. Renals, “Dynamic evaluation of neural sequence models,” in Proc. ICML, 2018, pp. 2766–2775.
- A. Deoras, T. Mikolov, and K. Church, “A Fast Re-scoring Strategy to Capture Long-Distance Dependencies,” in Proc. EMNLP, 2011, pp. 1116–1127.
- T. Udagawa, M. Suzuki, G. Kurata, N. Itoh, and G. Saon, “Effect and analysis of large-scale language model rescoring on competitive ASR systems,” in Proc. Interspeech, 2022, pp. 3919–3923.
- S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition,” in Proc. SLT, 2018, pp. 369–375.
- R. Errattahi, A. El Hannani, and H. Ouahmane, “Automatic Speech Recognition Errors Detection and Correction: A Review,” Procedia Computer Science, vol. 128, pp. 32–37, 2018.
- J. Guo, T. N. Sainath, and R. J. Weiss, “A spelling correction model for end-to-end speech recognition,” in Proc. ICASSP, 2019, pp. 5651–5655.
- O. Hrinchuk, M. Popova, and B. Ginsburg, “Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model,” in Proc. ICASSP, 2020, pp. 7074–7078.
- C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and E.-S. Chng, “HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models,” in Proc. NeurIPS, 2023.
- Z. Min and J. Wang, “Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study,” in Proc. International Conference on Neural Information Processing (ICONIP), 2023, pp. 69–84.
- R. Ma, M. Qian, P. Manakul, M. J. F. Gales, and K. Knill, “Can Generative Large Language Models Perform ASR Error Correction?” ArXiv, vol. abs/2307.04172, 2023. [Online]. Available: https://arxiv.org/abs/2307.04172
- C.-H. H. Yang, Y. Gu, Y.-C. Liu, S. Ghosh, I. Bulyko, and A. Stolcke, “Generative Speech Recognition Error Correction With Large Language Models and Task-Activating Prompting,” in Proc. ASRU, 2023, pp. 1–8.
- S. Radhakrishnan, C.-H. Yang, S. Khan, R. Kumar, N. Kiani, D. Gomez-Cabrero, and J. Tegnér, “Whispering LLaMA: A cross-modal generative error correction framework for speech recognition,” in Proc. EMNLP, 2023, pp. 10 007–10 016.
- G. Li, L. Chen, and K. Yu, “How ChatGPT is Robust for Spoken Language Understanding?” in Proc. Interspeech, 2023, pp. 2163–2167.
- M. He and P. N. Garner, “Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding,” in Proc. Interspeech, 2023, pp. 1109–1113.
- J. Pu and T.-S. Nguyen, “Multi-stage Large Language Model Correction for Speech Recognition,” ArXiv, vol. abs/2310.11532, 2023, https://arxiv.org/abs/2310.11532.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023, pp. 28 492–28 518.
- J. Louradour, “whisper-timestamped,” https://github.com/linto-ai/whisper-timestamped, 2023.
- J. Achiam et al., “GPT-4 Technical Report,” https://api.semanticscholar.org/CorpusID:257532815, 2023.
- R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” in Proc. ACL. Association for Computational Linguistics, 2016, pp. 1715–1725.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
- Maryam Naderi (1 paper)
- Enno Hermann (8 papers)
- Alexandre Nanchen (1 paper)
- Sevada Hovsepyan (2 papers)
- Mathew Magimai. -Doss (16 papers)