Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust ASR Error Correction with Conservative Data Filtering (2407.13300v2)

Published 18 Jul 2024 in cs.CL and eess.AS

Abstract: Error correction (EC) based on LLMs is an emerging technology to enhance the performance of automatic speech recognition (ASR) systems. Generally, training data for EC are collected by automatically pairing a large set of ASR hypotheses (as sources) and their gold references (as targets). However, the quality of such pairs is not guaranteed, and we observed various types of noise which can make the EC models brittle, e.g. inducing overcorrection in out-of-domain (OOD) settings. In this work, we propose two fundamental criteria that EC training data should satisfy: namely, EC targets should (1) improve linguistic acceptability over sources and (2) be inferable from the available context (e.g. source phonemes). Through these criteria, we identify low-quality EC pairs and train the models not to make any correction in such cases, the process we refer to as conservative data filtering. In our experiments, we focus on Japanese ASR using a strong Conformer-CTC as the baseline and finetune Japanese LLMs for EC. Through our evaluation on a suite of 21 internal benchmarks, we demonstrate that our approach can significantly reduce overcorrection and improve both the accuracy and quality of ASR results in the challenging OOD settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Hyporadise: An open baseline for generative speech recognition with large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  2. It’s never too late: Fusing acoustic information into large language models for automatic speech recognition. In The Twelfth International Conference on Learning Representations.
  3. Error correction in asr using sequence-to-sequence models. arXiv preprint arXiv:2202.01157.
  4. Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128:32–37.
  5. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pages 369–376.
  6. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. INTERSPEECH 2020, pages 5036–5040.
  7. A spelling correction model for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5651–5655. IEEE.
  8. Harnessing the zero-shot power of instruction-tuned large language model in end-to-end speech recognition. arXiv preprint arXiv:2309.10524.
  9. Correction of automatic speech recognition with transformer sequence-to-sequence model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7074–7078. IEEE.
  10. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  11. Mistral 7b. arXiv preprint arXiv:2310.06825.
  12. Lenient evaluation of Japanese speech recognition: Modeling naturally occurring spelling inconsistency. In Proceedings of the Workshop on Computation and Written Language (CAWL 2023), pages 61–70, Toronto, Canada. Association for Computational Linguistics.
  13. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230–237, Barcelona, Spain. Association for Computational Linguistics.
  14. Jaesong Lee and Shinji Watanabe. 2021. Intermediate loss regularization for ctc-based speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6224–6228. IEEE.
  15. Softcorrect: Error correction with soft detection for automatic speech recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13034–13042.
  16. Fastcorrect: Fast error correction with edit alignment for automatic speech recognition. In Advances in Neural Information Processing Systems 35.
  17. Jinyu Li et al. 2022. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1).
  18. Crossmodal asr error correction with discrete speech units. arXiv preprint arXiv:2405.16677.
  19. N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space. In Proc. INTERSPEECH 2023, pages 3267–3271.
  20. Can generative large language models perform asr error correction? arXiv preprint arXiv:2307.04172.
  21. Asr error correction and domain adaptation using machine translation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6344–6348. IEEE.
  22. Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion. In Proc. INTERSPEECH 2022, pages 2688–2692.
  23. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  24. A Comparison of Sequence-to-Sequence Models for Speech Recognition. In Proc. INTERSPEECH 2017, pages 939–943.
  25. Whispering LLaMA: A cross-modal generative error correction framework for speech recognition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10007–10016, Singapore. Association for Computational Linguistics.
  26. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proc. INTERSPEECH 2014, pages 338–342.
  27. Masked language model scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2699–2712, Online. Association for Computational Linguistics.
  28. Extracting representative subset from extensive text data for training pre-trained language models. Information Processing & Management, 60(3):103249.
  29. Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems. In Proc. INTERSPEECH 2022, pages 3919–3923.
  30. ASR Error Correction with Augmented Transformer for Entity Retrieval. In Proc. INTERSPEECH 2020, pages 1550–1554.
  31. Hypr: A comprehensive study for asr hypothesis revising with a reference corpus. arXiv preprint arXiv:2309.09838.
  32. Generative speech recognition error correction with large language models and task-activating prompting. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
  33. Representative demonstration selection for in-context learning with two-stage determinantal point process. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5443–5456, Singapore. Association for Computational Linguistics.
  34. BART Based Semantic Correction for Mandarin Automatic Speech Recognition System. In Proc. INTERSPEECH 2021, pages 2017–2021.
  35. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.
  36. Improving asr error correction using n-best hypotheses. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 83–89. IEEE.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com