Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition (2401.10446v1)

Published 19 Jan 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Abstract: Recent advances in LLMs have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising.

Introduction

LLMs have successfully demonstrated significant capabilities across numerous natural language processing tasks. This advancement has spurred research into leveraging LLMs for Automatic Speech Recognition (ASR), particularly in recognition error correction using Generative Error Correction (GER). While GER has shown promise in improving recognition results by finetuning LLMs on transcribed N-best hypotheses from ASR decoding, performance in noisy environments—a common real-world challenge—has not received sufficient focus. Against this backdrop, the authors of this paper address the deficit by extending the GER benchmark to noisy conditions, introducing the novel Robust HyPoradise (RobustHP) dataset.

Methodology

The authors contend with the challenge of noise-robust GER through extracting a noise embedding in language space from N-best hypotheses. Their insight is predicated upon the hypothesis that more adverse noise conditions yield greater diversity within the N-best hypotheses, which can then be represented as a noise embedding for the denoising process. The paper proposes a Knowledge Distillation (KD) strategy leveraging Mutual Information Estimation (MIE) to distill real noise information from audio embeddings into the language-space noise embedding, enhancing its representational capacity.

Experimental Results

Applying recent LLMs, including LLaMA-2, LLaMA, and Falcon, the proposed approach termed RobustGER is demonstrated to achieve significant performance improvements. Specifically, it garners up to a 53.9% reduction in Word Error Rate (WER) on the RobustHP test sets. Furthermore, ablation studies explore the relative contributions of utterance-level versus token-level information contained within the noise embedding, corroborating the essential role of the latter in denoising efficacy for GER.

Analysis

A closer examination unveils that while the abstracted language embedding can represent certain noise types adequately, others remain entangled with clean speech representations. The KD technique enhances noise distinguishability, leading to improved noise-representativeness and WER outcomes. Moreover, data efficiency is established through sustained GER performance despite substantial reductions in the training data volume, highlighting the robustness and generalizability of RobustGER. Lastly, cases illustrating the GER capabilities underscore its proficiency in rectifying transcription errors that carry significant semantic implications.

Conclusion

This paper significantly augments the utility of GER for ASR under noisy conditions by implementing a refined, noise-aware correction method. By deploying a language-space noise embedding and finetuning it via KD from audio embeddings, the proposed method not only represents audio noise more effectively but also instructs LLMs efficiently, advancing the state of GER in noisy environments without a heavy training data dependency. This milestone likely paves the way for advanced, practical ASR systems robust against real-world acoustic disturbances. The open-sourced work invites further enhancements and adaptations within the speech processing community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Bidirectional recurrent neural network language models for automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5421–5425. IEEE, 2015.
  2. Mutual information neural estimation. In International conference on machine learning, pp. 531–540. PMLR, 2018.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Generative error correction for code-switching speech recognition using large language models. arXiv preprint arXiv:2310.13013, 2023a.
  5. Hyporadise: An open baseline for generative speech recognition with large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b.
  6. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023c.
  7. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  8. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  9. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
  10. Trapping llm hallucinations using tagged context prompts. arXiv preprint arXiv:2306.06085, 2023.
  11. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, pp.  411–412, 2013.
  12. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In International Conference on Machine Learning, pp. 2031–2041. PMLR, 2019.
  13. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  14. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, 2023a.
  15. Joint audio and speech understanding. In IEEE Proc. ASRU, 2023b.
  16. The rats collection: Supporting hlt research with degraded audio data. In LREC, pp.  1970–1977. Citeseer, 2014.
  17. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  18. A spelling correction model for end-to-end speech recognition. In Proc. ICASSP, pp.  5651–5655. IEEE, 2019.
  19. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ASR2000-Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRW), 2000.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  21. Deliberation model based two-pass end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7799–7803. IEEE, 2020.
  22. Improving deliberation by text-only and semi-supervised training. arXiv preprint arXiv:2206.14716, 2022.
  23. Scaling up deliberation for multilingual asr. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 771–776. IEEE, 2023.
  24. Yi Hu and Philipos C Loizou. Subjective comparison of speech enhancement algorithms. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 1, pp.  I–I. IEEE, 2006.
  25. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  26. Speech recognition with no speech or with noisy speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1090–1094. IEEE, 2019.
  27. Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition. arXiv preprint arXiv:2109.14420, 2021.
  28. An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):745–777, 2014.
  29. Robust automatic speech recognition: a bridge to practical applications, chapter 1, pp.  1–20. Academic Press, 2015.
  30. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  31. Spatial-channel token distillation for vision mlps. In International Conference on Machine Learning, pp. 12685–12695. PMLR, 2022.
  32. Prompting large language models for zero-shot domain adaptation in speech recognition. arXiv preprint arXiv:2306.16007, 2023b.
  33. Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport. Advances in Neural Information Processing Systems, 34:19935–19946, 2021.
  34. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  35. N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space. arXiv preprint arXiv:2303.00456, 2023.
  36. Recurrent neural network based language model. In Interspeech, volume 2, pp.  1045–1048. Makuhari, 2010.
  37. OpenAI. Introducing chatgpt. OpenAI Blog, 2022.
  38. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  39. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  40. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  41. Dual application of speech enhancement for automatic speech recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 223–228. IEEE, 2021.
  42. Enhancing speaker diarization with large language models: A contextual beam search approach. arXiv preprint arXiv:2309.05248, 2023.
  43. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  44. An investigation of end-to-end models for robust speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6893–6897. IEEE, 2021.
  45. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
  46. Whispering llama: A cross-modal generative error correction framework for speech recognition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10007–10016, 2023.
  47. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  48. Effective sentence scoring method using bert for speech recognition. In Asian Conference on Machine Learning, pp.  1081–1093. PMLR, 2019.
  49. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  51. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In SSW, pp.  146–152, 2016.
  52. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  53. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 O-COCOSDA/CASLRE, pp.  1–4, 2013.
  54. The 4th chime speech separation and recognition challenge. URL: http://spandh. dcs. shef. ac. uk/chime challenge {normal-{\{{Last Accessed on 1 August, 2018}normal-}\}}, 2016.
  55. Diarizationlm: Speaker diarization post-processing with large language models. arXiv preprint arXiv:2401.03506, 2024.
  56. Can whisper perform speech-based in-context learning. arXiv preprint arXiv:2309.07081, 2023.
  57. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015, 2018.
  58. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  59. On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917, 2023a.
  60. Improving audio captioning models with fine-grained audio features, text embedding supervision, and llm mix-up augmentation. arXiv preprint arXiv:2309.17352, 2023b.
  61. Multi-task language modeling for improving speech recognition of rare words. In Proc. IEEE ASRU, pp.  1087–1093. IEEE, 2021.
  62. Generative speech recognition error correction with large language models and task-activating prompting. In Proc. IEEE ASRU, 2023a.
  63. From english to more languages: Parameter-efficient model reprogramming for cross-lingual speech recognition. In Proc. ICASSP, pp.  1–5. IEEE, 2023b.
  64. Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In IEEE Proc. ASRU, 2023.
  65. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  66. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  67. Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12793–12802, 2021.
  68. Arbitrary talking face generation via attentional audio-visual coherence learning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2362–2368, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yuchen Hu (60 papers)
  2. Chen Chen (752 papers)
  3. Chao-Han Huck Yang (89 papers)
  4. Ruizhe Li (40 papers)
  5. Chao Zhang (907 papers)
  6. Pin-Yu Chen (311 papers)
  7. EnSiong Chng (2 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com