Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search (2401.10449v1)

Published 19 Jan 2024 in eess.AS, cs.CL, and cs.SD

Abstract: End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
  2. Jinyu Li et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1.
  3. “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
  4. Alex Graves, “Sequence transduction with recurrent neural networks,” in Proc. ICML, 2012.
  5. “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
  6. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
  7. “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  8. Tara N Sainath et al., “Two-pass end-to-end speech recognition,” arXiv preprint arXiv:1908.10992, 2019.
  9. “4D ASR: Joint modeling of CTC, attention, transducer, and mask-predict decoders,” in Proc. Interspeech, 2023, pp. 3312–3316.
  10. “Class lm and word mapping for contextual biasing in end-to-end asr,” in Proc. Interspeech, 2020, pp. 4348–4351.
  11. “Contextual speech recognition in end-to-end neural network systems using beam search,” in Proc. Interspeech, 2018.
  12. Atsushi Kojima, “A study of biasing technical terms in medical speech recognition using weighted finite-state transducer,” Journal of the Acoustical Society of Japan, vol. 43, pp. 66–68, 2022.
  13. “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP, 2018, pp. 5824–5828.
  14. “Cold fusion: training seq2seq models together with language models,” in Proc. Interspeech, 2018, pp. 387–391.
  15. Xiaoqiang Wang et al., “Towards contextual spelling correction for customization of end-to-end speech recognition systems,” IEEE Trans. Audio, Speech, Lang. Process., vol. 30, pp. 3089–3097, 2022.
  16. “Retraining-free customized asr for enharmonic words based on a named-entity-aware model and phoneme similarity estimation,” in Proc. Interspeech, 2023, pp. 3312–3316.
  17. “Deep context: End-to-end contextual speech recognition,” in Proc. SLT, 2018, pp. 418–425.
  18. “Contextual rnn-t for open domain asr,” in Proc. Interspeech, 2020, pp. 11–15.
  19. “Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,” in Proc. ICASSP, 2019, pp. 6171–6175.
  20. “Personalization of ctc speech recognition models,” in Proc. SLT, 2023, pp. 302–309.
  21. “Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network,” in Proc. Interspeech, 2023, pp. 4933–4937.
  22. “Improving end-to-end contextual speech recognition with fine-grained contextual knowledge selection,” in Proc. ICASSP, 2022, pp. 491–495.
  23. “Instant one-shot word-learning for context-specific neural sequence-to-sequence speech recognition,” in Proc. ASRU, 2021, pp. 1–7.
  24. “Copyne: Better contextual asr by copying named entities,” arXiv preprint arXiv:2305.12839, 2023.
  25. Anmol Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
  26. “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
  27. “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
  28. Shinji Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
  29. “Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion,” in Proc. Interspeech, 2021, pp. 1772–1776.
  30. Kikuo Maekawa, “Corpus of spontaneous Japanese: Its design and evaluation,” in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003.
  31. Akira Kurematsu et al., “Atr japanese speech database as a tool of speech recognition and synthesis,” Speech Communication, vol. 9, no. 4, pp. 357–363, 1990.
Citations (5)

Summary

We haven't generated a summary for this paper yet.