Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview (2403.00370v1)

Published 1 Mar 2024 in cs.CL, cs.SD, and eess.AS

Abstract: End-to-end (E2E) approach is gradually replacing hybrid models for automatic speech recognition (ASR) tasks. However, the optimization of E2E models lacks an intuitive method for handling decoding shifts, especially in scenarios with a large number of domain-specific rare words that hold specific important meanings. Furthermore, the absence of knowledge-intensive speech datasets in academia has been a significant limiting factor, and the commonly used speech corpora exhibit significant disparities with realistic conversation. To address these challenges, we present Medical Interview (MED-IT), a multi-turn consultation speech dataset that contains a substantial number of knowledge-intensive named entities. We also explore methods to enhance the recognition performance of rare words for E2E models. We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions. This guides the model to prioritize recognizing words in the biasing list. In our experiments, for subsets of rare words appearing in the training speech between 10 and 20 times, and between 1 and 5 times, the proposed method achieves a relative improvement of 9.3% and 5.1%, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Bringing contextual information to google speech recognition.
  2. Contextual speech recognition with difficult negative training examples. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6440–6444. IEEE.
  3. Paul Boersma and Vincent Van Heuven. 2001. Speak and unspeak with praat. Glot International, 5(9/10):341–347.
  4. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4960–4964. IEEE.
  5. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909.
  6. Speech recognition for medical conversations. arXiv preprint arXiv:1711.07274.
  7. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1):30–42.
  8. Listen, know and spell: Knowledge-infused subword modeling for improving asr performance of oov named entities. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7887–7891. IEEE.
  9. Earnings-21: A practical benchmark for asr in the wild. arXiv preprint arXiv:2104.11348.
  10. A dataset of simulated patient-physician medical interviews with a focus on respiratory cases. Scientific Data, 9(1):313.
  11. Jennifer Drexler Fox and Natalie Delworth. 2022. Improving contextual recognition of rare words with an alternate spelling prediction model. arXiv preprint arXiv:2209.01250.
  12. Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
  13. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376.
  14. Conformer: Convolution-augmented transformer for speech recognition. Interspeech 2020.
  15. Improving end-to-end contextual speech recognition with fine-grained contextual knowledge selection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8532–8536. IEEE.
  16. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97.
  17. An analysis of incorporating an external language model into a sequence-to-sequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5828. IEEE.
  18. Suyoun Kim and Florian Metze. 2018. Dialog-context aware end-to-end speech recognition. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 434–440. IEEE.
  19. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. EMNLP 2018, page 66.
  20. Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion. arXiv preprint arXiv:2104.02194.
  21. Deep shallow fusion for rnn-t personalization. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 251–257. IEEE.
  22. Improved factorized neural transducer model for text-only domain adaptation. arXiv preprint arXiv:2309.09524.
  23. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
  24. Keyword spotting for google assistant using contextual speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 272–278. IEEE.
  25. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  26. Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech 2019.
  27. A comparison of sequence-to-sequence models for speech recognition.
  28. Deep context: end-to-end contextual speech recognition. In 2018 IEEE spoken language technology workshop (SLT), pages 418–425. IEEE.
  29. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
  30. Attention is all you need. Advances in neural information processing systems, 30.
  31. Espnet: End-to-end speech processing toolkit. INTERSPEECH 2018, Hyderabad, India.
  32. Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240–1253.
  33. Evaluating automatic speech recognition in an incremental setting. arXiv preprint arXiv:2302.12049.
  34. Contextual speech recognition in end-to-end neural network systems using beam search. In Interspeech, pages 2227–2231.
  35. Marliyya Zayyan. 2011. Objective structured clinical examination: the assessment of choice. Oman medical journal, 26(4):219.
  36. Zhengyi Zhang and Pan Zhou. 2022. End-to-end contextual asr based on posterior distribution adaptation for hybrid ctc/attention system. arXiv preprint arXiv:2202.09003.
  37. Shallow-fusion end-to-end contextual biasing. In Interspeech, pages 1418–1422.
  38. Knowledge-aware bayesian co-attention for multimodal emotion recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  39. Knowledge-enriched transformer for emotion detection in textual conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 165–176.

Summary

We haven't generated a summary for this paper yet.