Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards interfacing large language models with ASR systems using confidence measures and prompting (2407.21414v1)

Published 31 Jul 2024 in eess.AS and cs.CL

Abstract: As LLMs grow in parameter size and capabilities, such as interaction through prompting, they open up new ways of interfacing with automatic speech recognition (ASR) systems beyond rescoring n-best lists. This work investigates post-hoc correction of ASR transcripts with LLMs. To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. J. P. Rauschecker and S. K. Scott, “Maps and streams in the auditory cortex: Nonhuman primates illuminate human speech processing,” Nature Neuroscience, vol. 12, no. 6, pp. 718–724, Jun. 2009. [Online]. Available: http://www.nature.com/articles/nn.2331
  2. G. A. Miller and S. Isard, “Some perceptual consequences of linguistic rules,” Journal of Verbal Learning and Verbal Behavior, vol. 2, no. 3, pp. 217–228, Oct. 1963. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0022537163800870
  3. R. V. Shannon, F. G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid, “Speech recognition with primarily temporal cues,” Science, vol. 270, no. 5234, pp. 303–304, Oct. 1995. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/7569981
  4. L. Gwilliams, A. Marantz, D. Poeppel, and J.-R. King, “Top-down information shapes lexical processing when listening to continuous speech,” Language, Cognition and Neuroscience, vol. 0, no. 0, pp. 1–14, 2023, publisher: Routledge _eprint: https://doi.org/10.1080/23273798.2023.2171072. [Online]. Available: https://doi.org/10.1080/23273798.2023.2171072
  5. E. Sohoglu, J. E. Peelle, R. P. Carlyon, and M. H. Davis, “Predictive Top-Down Integration of Prior Knowledge during Speech Perception,” Journal of Neuroscience, vol. 32, no. 25, pp. 8443–8453, Jun. 2012. [Online]. Available: http://www.jneurosci.org/cgi/doi/10.1523/JNEUROSCI.5069-11.2012
  6. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proc. NeurIPS, 2022.
  7. B. Krause, E. Kahembwe, I. Murray, and S. Renals, “Dynamic evaluation of neural sequence models,” in Proc. ICML, 2018, pp. 2766–2775.
  8. A. Deoras, T. Mikolov, and K. Church, “A Fast Re-scoring Strategy to Capture Long-Distance Dependencies,” in Proc. EMNLP, 2011, pp. 1116–1127.
  9. T. Udagawa, M. Suzuki, G. Kurata, N. Itoh, and G. Saon, “Effect and analysis of large-scale language model rescoring on competitive ASR systems,” in Proc. Interspeech, 2022, pp. 3919–3923.
  10. S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition,” in Proc. SLT, 2018, pp. 369–375.
  11. R. Errattahi, A. El Hannani, and H. Ouahmane, “Automatic Speech Recognition Errors Detection and Correction: A Review,” Procedia Computer Science, vol. 128, pp. 32–37, 2018.
  12. J. Guo, T. N. Sainath, and R. J. Weiss, “A spelling correction model for end-to-end speech recognition,” in Proc. ICASSP, 2019, pp. 5651–5655.
  13. O. Hrinchuk, M. Popova, and B. Ginsburg, “Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model,” in Proc. ICASSP, 2020, pp. 7074–7078.
  14. C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and E.-S. Chng, “HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models,” in Proc. NeurIPS, 2023.
  15. Z. Min and J. Wang, “Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study,” in Proc. International Conference on Neural Information Processing (ICONIP), 2023, pp. 69–84.
  16. R. Ma, M. Qian, P. Manakul, M. J. F. Gales, and K. Knill, “Can Generative Large Language Models Perform ASR Error Correction?” ArXiv, vol. abs/2307.04172, 2023. [Online]. Available: https://arxiv.org/abs/2307.04172
  17. C.-H. H. Yang, Y. Gu, Y.-C. Liu, S. Ghosh, I. Bulyko, and A. Stolcke, “Generative Speech Recognition Error Correction With Large Language Models and Task-Activating Prompting,” in Proc. ASRU, 2023, pp. 1–8.
  18. S. Radhakrishnan, C.-H. Yang, S. Khan, R. Kumar, N. Kiani, D. Gomez-Cabrero, and J. Tegnér, “Whispering LLaMA: A cross-modal generative error correction framework for speech recognition,” in Proc. EMNLP, 2023, pp. 10 007–10 016.
  19. G. Li, L. Chen, and K. Yu, “How ChatGPT is Robust for Spoken Language Understanding?” in Proc. Interspeech, 2023, pp. 2163–2167.
  20. M. He and P. N. Garner, “Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding,” in Proc. Interspeech, 2023, pp. 1109–1113.
  21. J. Pu and T.-S. Nguyen, “Multi-stage Large Language Model Correction for Speech Recognition,” ArXiv, vol. abs/2310.11532, 2023, https://arxiv.org/abs/2310.11532.
  22. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023, pp. 28 492–28 518.
  23. J. Louradour, “whisper-timestamped,” https://github.com/linto-ai/whisper-timestamped, 2023.
  24. J. Achiam et al., “GPT-4 Technical Report,” https://api.semanticscholar.org/CorpusID:257532815, 2023.
  25. R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” in Proc. ACL.   Association for Computational Linguistics, 2016, pp. 1715–1725.
  26. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Maryam Naderi (1 paper)
  2. Enno Hermann (8 papers)
  3. Alexandre Nanchen (1 paper)
  4. Sevada Hovsepyan (2 papers)
  5. Mathew Magimai. -Doss (16 papers)