Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models (2405.10025v1)

Published 16 May 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS

Abstract: Recent advances in LLMs have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670.
  2. Bidirectional recurrent neural network language models for automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5421–5425. IEEE.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Noise-robust speech recognition with 10 minutes unparalleled in-domain data. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4298–4302. IEEE.
  5. Self-critical sequence training for automatic speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3688–3692. IEEE.
  6. Generative error correction for code-switching speech recognition using large language models. arXiv preprint arXiv:2310.13013.
  7. Hyporadise: An open baseline for generative speech recognition with large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  8. Leveraging modality-specific representations for audio-visual speech recognition via reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12607–12615.
  9. Unsupervised noise adaptation using data simulation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  10. It’s never too late: Fusing acoustic information into large language models for automatic speech recognition. arXiv preprint arXiv:2402.05457.
  11. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  13. Lip reading sentences in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR), pages 3444–3453. IEEE.
  14. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795.
  15. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  16. Switchboard: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, IEEE International Conference on, volume 1, pages 517–520. IEEE Computer Society.
  17. Investigating uncertainty calibration of aligned language models under the multiple-choice setting. arXiv preprint arXiv:2310.11732.
  18. The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.
  19. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  20. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer.
  21. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  22. Gradient remedy for multi-task learning in end-to-end noise-robust speech recognition. arXiv preprint arXiv:2302.11362.
  23. MIR-GAN: Refining frame-level modality-invariant representations with adversarial network for audio-visual speech recognition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11610–11625. ACL.
  24. Large language models are efficient learners of noise-robust speech recognition. arXiv preprint arXiv:2401.10446.
  25. Gentranslate: Large language models are generative multilingual speech and machine translators. arXiv preprint arXiv:2402.06894.
  26. Wav2code: Restore clean speech representations via codebook lookup for noise-robust asr. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  27. Interactive feature fusion for end-to-end noise-robust speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6292–6296. IEEE.
  28. Dual-path style learning for end-to-end noise-robust speech recognition. In Proc. Interspeech, pages 2918–2922.
  29. Hearing lips in noise: Universal viseme-phoneme mapping and transfer for robust audio-visual speech recognition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15213–15232. ACL.
  30. Cross-modal global interaction and local alignment for audio-visual speech recognition. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 5076–5084. IJCAI Organization.
  31. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  32. Tyler Kendall and Charlie Farrington. 2021. The corpus of regional african american language. version 2021.07. eugene, or: The online resources for african american language project.
  33. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  34. Sawan Kumar. 2022. Answer-level calibration for free-form multiple choice question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 665–679.
  35. Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition. arXiv preprint arXiv:2109.14420.
  36. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  37. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  38. N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space. arXiv preprint arXiv:2303.00456.
  39. Recurrent neural network based language model. In Interspeech, volume 2, pages 1045–1048. Makuhari.
  40. OpenAI. 2022. Introducing chatgpt. OpenAI Blog.
  41. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  42. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  43. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
  44. Douglas B Paul and Janet Baker. 1992. The design for the wall street journal-based csr corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.
  45. Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483.
  46. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  47. Effective sentence scoring method using bert for speech recognition. In Asian Conference on Machine Learning, pages 1081–1093. PMLR.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  50. Attention is all you need. Advances in neural information processing systems, 30.
  51. The 4th chime speech separation and recognition challenge. URL: http://spandh. dcs. shef. ac. uk/chime_challenge/(last accessed on 1 August, 2018).
  52. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  53. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  54. Emotion recognition by fusing time synchronous and time asynchronous representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6269–6273. IEEE.
  55. Generative speech recognition error correction with large language models and task-activating prompting. arXiv preprint arXiv:2309.15649.
  56. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000.
  57. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
  58. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  59. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882.
  60. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  61. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
  62. Robust data2vec: Noise-robust speech representation learning for asr by combining regression and improved contrastive learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  63. Multichannel av-wav2vec2: A framework for learning multichannel multi-modal speech representation. arXiv preprint arXiv:2401.03468.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuchen Hu (60 papers)
  2. Chen Chen (752 papers)
  3. Chengwei Qin (28 papers)
  4. Qiushi Zhu (11 papers)
  5. Eng Siong Chng (112 papers)
  6. Ruizhe Li (40 papers)
Citations (3)