Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Locality enhanced dynamic biasing and sampling strategies for contextual ASR (2401.13146v1)

Published 23 Jan 2024 in eess.AS, cs.CL, and cs.SD

Abstract: Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the training of CB for ASR with correlation plots between the bias embeddings among various training stages. Secondly, we introduce a neighbourhood attention (NA) that localizes self attention (SA) to the nearest neighbouring frames to further refine the CB output. The results show that this proposed approach provides on average a 25.84% relative WER improvement on LibriSpeech sets and rare-word evaluation compared to the baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. “Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion,” arXiv preprint arXiv:2104.02194, 2021.
  2. “Fast contextual adaptation with neural associative memory for on-device personalized speech recognition,” in ICASSP, 2022, pp. 6632–6636.
  3. “Nam+: Towards scalable end-to-end contextual biasing for adaptive asr,” in SLT, 2023, pp. 190–196.
  4. “Approximate nearest neighbour phrase mining for contextual speech recognition,” arXiv preprint arXiv:2304.08862, 2023.
  5. “End-to-end contextual asr based on posterior distribution adaptation for hybrid ctc/attention system,” arXiv preprint arXiv:2202.09003, 2022.
  6. “Gated contextual adapters for selective contextual biasing in neural transducers,” 2023.
  7. “Contextual rnn-t for open domain asr,” arXiv preprint arXiv:2006.03411, 2020.
  8. “A light-weight contextual spelling correction model for customizing transducer-based speech recognition systems,” arXiv preprint arXiv:2108.07493, 2021.
  9. “Improving contextual recognition of rare words with an alternate spelling prediction model,” arXiv preprint arXiv:2209.01250, 2022.
  10. “Contextual speech recognition with difficult negative training examples,” in ICASSP, 2019, pp. 6440–6444.
  11. “Personalization of end-to-end speech recognition on mobile devices for named entities,” arXiv e-prints, pp. arXiv–1912, 2019.
  12. “Class lm and word mapping for contextual biasing in end-to-end asr,” arXiv preprint arXiv:2007.05609, 2020.
  13. “Conformer: Convolution-augmented transformer for speech recognition,” ArXiv, vol. abs/2005.08100, 2020.
  14. “Deep context: end-to-end contextual speech recognition,” in SLT, 2018, pp. 418–425.
  15. “Improving contextual spelling correction by external acoustics attention and semantic aware data augmentation,” arXiv preprint arXiv:2302.11192, 2023.
  16. “Personalization of ctc speech recognition models,” in SLT, 2023, pp. 302–309.
  17. “Minimising biasing word errors for contextual asr with the tree-constrained pointer generator,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 345–354, 2022.
  18. “Cb-conformer: Contextual biasing conformer for biased word recognition,” arXiv preprint arXiv:2304.09607, 2023.
  19. “Fast and robust unsupervised contextual biasing for speech recognition,” 2020.
  20. “Contextual adapters for personalized speech recognition in neural transducers,” in ICASSP, 2022, pp. 8537–8541.
  21. “Procter: Pronunciation-aware contextual adapter for personalized speech recognition in neural transducers,” arXiv preprint arXiv:2303.17131, 2023.
  22. “Improving end-to-end contextual speech recognition with fine-grained contextual knowledge selection,” in ICASSP, 2022, pp. 8532–8536.
  23. “Context-aware transformer transducer for speech recognition,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 503–510.
  24. “Matching-based term semantics pre-training for spoken patient query understanding,” arXiv preprint arXiv:2303.01341, 2023.
  25. “Dialog act guided contextual adapter for personalized speech recognition,” arXiv preprint arXiv:2303.17799, 2023.
  26. “FLAIR: An easy-to-use framework for state-of-the-art NLP,” in NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 54–59.
  27. “LUKE: Deep contextualized entity representations with entity-aware self-attention,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, Nov. 2020, pp. 6442–6454, Association for Computational Linguistics.
  28. “Neighborhood attention transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6185–6194.
  29. “Google’s neural machine translation system: Bridging the gap between human and machine translation,” ArXiv, vol. abs/1609.08144, 2016.
  30. “Transformer-xl: Attentive language models beyond a fixed-length context,” 2019.
  31. “Attention is all you need,” in NIPS, 2017.
  32. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  33. “Training data-efficient image transformers & distillation through attention,” in Proceedings of the 38th International Conference on Machine Learning, Marina Meila and Tong Zhang, Eds. 18–24 Jul 2021, vol. 139 of Proceedings of Machine Learning Research, pp. 10347–10357, PMLR.
  34. “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10002.
  35. “Stand-alone self-attention in vision models,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. 2019, vol. 32, Curran Associates, Inc.
  36. “Librispeech: An asr corpus based on public domain audio books,” in ICASSP, 2015, pp. 5206–5210.
  37. “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
  38. “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP, 2017, pp. 5220–5224.
  39. “Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” Advances in neural information processing systems, vol. 30, 2017.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets