Multi-stage Large Language Model Correction for Speech Recognition (2310.11532v2)
Abstract: In this paper, we investigate the usage of LLMs to improve the performance of competitive speech recognition systems. Different from previous LLM-based ASR error correction methods, we propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of LLMs. Specifically, the proposed approach has two stages: the first stage is about ASR uncertainty estimation and exploits N-best list hypotheses to identify less reliable transcriptions; The second stage works on these identified transcriptions and performs LLM-based corrections. This correction task is formulated as a multi-step rule-based LLM reasoning process, which uses explicitly written rules in prompts to decompose the task into concrete reasoning steps. Our experimental results demonstrate the effectiveness of the proposed method by showing 10% ~ 20% relative improvement in WER over competitive ASR systems -- across multiple test domains and in zero-shot settings.
- OpenAI, “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- “An analysis of incorporating an external language model into a sequence-to-sequence model,” IEEE ICASSP, pp. 1–5828, 2018.
- “Cold fusion: Training seq2seq models together with language models,” Proc. Interspeech, pp. 387–391, 2018.
- “Effect and analysis of large-scale language model rescoring on competitive asr systems,” in Annual Conference of the International Speech Communication Association, 2022.
- “Adapting gpt, gpt-2 and bert language models for speech recognition,” in IEEE ASRU, 2021, pp. 162–168.
- “Prompt-based re-ranking language model for asr,” Proc. Interspeech 2022, pp. 3864–3868, 2022.
- “Can generative large language models perform asr error correction?,” arXiv preprint arXiv:2307.04172, 2023.
- “Robust speech recognition via large-scale weak supervision,” in ICML, 2023, pp. 28492–28518.
- “Rethinking evaluation in asr: Are our models robust enough?,” arXiv preprint arXiv:2010.11745, 2020.
- “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model,” 2021.
- “Understanding softmax confidence and uncertainty,” arXiv preprint arXiv:2106.04972, 2021.
- “Librispeech: an asr corpus based on public domain audio books,” in IEEE ICASSP, 2015, pp. 5206–5210.
- “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
- “Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in SPECOM. Springer, 2018, pp. 198–208.
- “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020.
- “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, 2019.
- “Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation,” in IEEE ICASSP, 2020, pp. 7689–7693.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- “Pushing the limits of semi-supervised learning for automatic speech recognition,” arXiv preprint arXiv:2010.10504, 2020.
- “Speechstew: Simply mix all available speech recognition data to train one large neural network,” arXiv preprint arXiv:2104.02133, 2021.
- Jie Pu (12 papers)
- Thai-Son Nguyen (13 papers)
- Sebastian Stüker (11 papers)