Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Crossmodal ASR Error Correction with Discrete Speech Units (2405.16677v2)

Published 26 May 2024 in eess.AS, cs.CL, and cs.SD

Abstract: ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data as well as its generalizability and superiority on large-scale data. Finally, a study on speech emotion recognition confirms that our model produces ASR error-robust transcripts suitable for downstream applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  2. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  3. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
  4. “Exploration of a self-supervised speech model: A study on emotional corpora,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 868–875.
  5. “On-device constrained self-supervised speech representation learning for keyword spotting via knowledge distillation,” Interspeech, 2023.
  6. “Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers,” Interspeech, 2023.
  7. “ASR and emotional speech: A word-level investigation of the mutual impact of speech and emotion recognition,” in Interspeech 2023, 2023.
  8. “Self-supervised representations in speech-based depression detection,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  9. “Challenges remain in building asr for spontaneous preschool children speech in naturalistic educational environments,” in Interspeech, 2022.
  10. “Fusing ASR outputs in joint training for speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7362–7366.
  11. “Neural error corrective language models for automatic speech recognition.,” in INTERSPEECH, 2018, pp. 401–405.
  12. “Generative speech recognition error correction with large language models,” arXiv preprint arXiv:2309.15649, 2023.
  13. “ASR error correction and domain adaptation using machine translation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6344–6348.
  14. “Improving readability for automatic speech recognition transcription,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 5, pp. 1–23, 2023.
  15. “Multi-modal ASR error correction with joint ASR error detection,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  16. “Cross-modal ASR post-processing system for error correction and utterance rejection,” arXiv preprint arXiv:2201.03313, 2022.
  17. “Whispering llama: A cross-modal generative error correction framework for speech recognition,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  18. “It’s never too late: Fusing acoustic information into large language models for automatic speech recognition,” arXiv preprint arXiv:2402.05457, 2024.
  19. “End-to-end spelling correction conditioned on acoustic feature for code-switching speech recognition.,” in Interspeech, 2021.
  20. “Cross-modal transformer-based neural correction models for automatic speech recognition,” Interspeech, 2021.
  21. “Common voice: A massively-multilingual speech corpus,” Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), 2020.
  22. “IEMOCAP: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
  23. “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,” IEEE Intelligent Systems, vol. 31, no. 6, pp. 82–88, 2016.
  24. “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2017.
  25. “Conformer: Convolution-augmented transformer for speech recognition,” Interspeech, 2020.
  26. “Espnet: End-to-end speech processing toolkit,” Interspeech, 2018.
  27. “A spelling correction model for end-to-end speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5651–5655.
  28. “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
  29. “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
  30. “Exploring data augmentation for code generation tasks,” in Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1497–1505.
  31. “WIT3: Web inventory of transcribed and translated talks,” in Proceedings of the 16th Annual Conference of the European Association for Machine Translation, Trento, Italy, 2012, pp. 261–268, European Association for Machine Translation.
  32. “Word-level acoustic modeling with convolutional vector regression,” in Proc. ICML Workshop Representation Learn, 2012.
  33. “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, 2013, pp. 410–415.
  34. “Analyzing autoencoder-based acoustic word embeddings,” in ICLR Workshop on Bridging AI and Cognitive Science, 2020.
  35. “Analyzing acoustic word embeddings from pre-trained self-supervised speech models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  36. “Layer-wise analysis of self-supervised acoustic word embeddings: A study on speech emotion recognition,” in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2024.
  37. “Comparative layer-wise analysis of self-supervised speech models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  38. “Stanford alpaca: An instruction-following llama model,” 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuanchao Li (24 papers)
  2. Pinzhen Chen (27 papers)
  3. Peter Bell (60 papers)
  4. Catherine Lai (24 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.