Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving ASR Contextual Biasing with Guided Attention (2401.08835v1)

Published 16 Jan 2024 in cs.CL and eess.AS

Abstract: In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To address this challenge, we employ a GA loss as an additional training objective besides the Transducer loss. The proposed GA loss aims to teach the cross attention how to align bias phrases with text tokens or audio frames. Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement. Through extensive experiments based on Conformer Transducer with Contextual Adapter, we demonstrate that the proposed method not only leads to a lower WER but also retains its effectiveness as the number of bias phrases increases. Specifically, the GA loss decreases the WER of rare vocabularies by up to 19.2% on LibriSpeech compared to the contextual biasing baseline, and up to 49.3% compared to a vanilla Transducer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, 2006, p. 369–376, Association for Computing Machinery.
  2. Alex Graves, “Sequence transduction with recurrent neural networks,” ArXiv, vol. abs/1211.3711, 2012.
  3. “Speech recognition with deep recurrent neural networks,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013.
  4. “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
  5. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
  6. “Learning Personalized Pronunciations for Contact Name Recognition,” in Proc. Interspeech 2016, 2016, pp. 3096–3100.
  7. “Deep shallow fusion for rnn-t personalization,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 251–257.
  8. “Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion,” in Proc. Interspeech 2021, 2021, pp. 1772–1776.
  9. “Shallow-Fusion End-to-End Contextual Biasing,” in Proc. Interspeech 2019, 2019, pp. 1418–1422.
  10. “Class lm and word mapping for contextual biasing in end-to-end asr,” in Interspeech, 2020.
  11. “Deep context: End-to-end contextual speech recognition,” 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 418–425, 2018.
  12. “Contextual rnn-t for open domain asr,” in Interspeech, 2020.
  13. “Context-aware transformer transducer for speech recognition,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 503–510.
  14. “Contextual adapters for personalized speech recognition in neural transducers,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8537–8541, 2022.
  15. “Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages,” in Proc. INTERSPEECH 2023, 2023, pp. 3302–3306.
  16. “Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 1668–1672.
  17. “Gated contextual adapters for selective contextual biasing in neural transducers,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  18. “Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network,” in Proc. INTERSPEECH 2023, 2023, pp. 4933–4937.
  19. “Fast contextual adaptation with neural associative memory for on-device personalized speech recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6632–6636.
  20. “Robust acoustic and semantic contextual biasing in neural transducers for speech recognition,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  21. “Slot-triggered contextual biasing for personalized speech recognition using neural transducers,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  22. “Copyne: Better contextual asr by copying named entities,” ArXiv, vol. abs/2305.12839, 2023.
  23. “Multi-task rnn-t with semantic decoder for streamable spoken language understanding,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7507–7511, 2022.
  24. “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  25. “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 2020, pp. 5036–5040, ISCA.
  26. “Espnet: End-to-end speech processing toolkit,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018. 2018, pp. 2207–2211, ISCA.
  27. “Ctc-segmentation of large corpora for german end-to-end speech recognition,” in International Conference on Speech and Computer, 2020.
  28. “Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR,” in Proc. INTERSPEECH 2023, 2023, pp. 221–225.
  29. “Two Stage Contextual Word Filtering for Context Bias in Unified Streaming and Non-streaming Transducer,” in Proc. INTERSPEECH 2023, 2023, pp. 3257–3261.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets