Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Phoneme-aware Encoding for Prefix-tree-based Contextual ASR (2312.09582v1)

Published 15 Dec 2023 in cs.CL, cs.SD, and eess.AS

Abstract: In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree. While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations. As TCPGen handles biasing words as subword units, we propose obtaining subword-level phoneme-aware encoding by using alignment between phonemes and subwords. Furthermore, we propose injecting phoneme-level predictions from CTC into queries of TCPGen so that the model better interprets the phoneme-aware encodings. We conducted ASR experiments with TCPGen for RNN transducer. We observed that proposed phoneme-aware encoding outperformed ordinary grapheme-based encoding on both the English LibriSpeech and Japanese CSJ datasets, demonstrating the robustness of our approach across linguistically diverse languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. ``End-to-end speech recognition: A survey,'' ArXiv, 2023.
  2. ``Shallow-fusion end-to-end contextual biasing,'' in Interspeech, 2019.
  3. ``Personalization strategies for end-to-end speech recognition systems,'' in ICASSP, 2021.
  4. ``Deep context: End-to-end contextual speech recognition,'' SLT, 2018.
  5. ``Contextual RNN-T for open domain ASR,'' in Interspeech, 2020.
  6. ``Contextual adapters for personalized speech recognition in neural transducers,'' ICASSP, 2022.
  7. ``Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion,'' in Interspeech, 2021.
  8. ``Deep shallow fusion for RNN-T personalization,'' in SLT, 2021.
  9. ``Tree-constrained pointer generator for end-to-end contextual speech recognition,'' ASRU, 2021.
  10. ``Graph neural networks for contextual ASR with the tree-constrained pointer generator,'' CoRR, 2023.
  11. ``Can contextual biasing remain effective with Whisper and GPT-2?,'' in Interspeech, 2023.
  12. ``Get to the point: Summarization with pointer-generator networks,'' in ACL, 2017.
  13. ``Robust speech recognition via large-scale weak supervision,'' ArXiv, 2022.
  14. ``Joint grapheme and phoneme embeddings for contextual end-to-end ASR,'' in Interspeech, 2019.
  15. ``Procter: Pronunciation-aware contextual adapter for personalized speech recognition in neural transducers,'' in ICASSP, 2023.
  16. ``Jointly learning to align and convert graphemes to phonemes with neural attention models,'' in SLT, 2016.
  17. ``SoundChoice: Grapheme-to-phoneme models with semantic disambiguation,'' in Interspeech, 2022.
  18. ``Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion,'' in NAACL, 2007.
  19. ``Joint-sequence models for grapheme-to-phoneme conversion,'' Speech Communication, 2008.
  20. ``Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework,'' Natural Language Engineering, 2016.
  21. ``Attention-based models for speech recognition,'' in NIPS, 2015.
  22. Alex Graves, ``Sequence transduction with recurrent neural networks,'' ArXiv, 2012.
  23. ``Semi-supervised classification with graph convolutional networks,'' in ICLR, 2017.
  24. ``Joint CTC-attention based end-to-end speech recognition using multi-task learning,'' ICASSP, 2017.
  25. ``Librispeech: An ASR corpus based on public domain audio books,'' in ICASSP, 2015.
  26. ``ESPnet: End-to-end speech processing toolkit,'' in Interspeech, 2018.
  27. ``Neural machine translation of rare words with subword units,'' in ACL, 2016.
  28. ``Specaugment on large scale datasets,'' ICASSP, 2019.
  29. K. Maekawa, ``Corpus of Spontaneous Japanese : its design and evaluation,'' SSPR, 2003.
Citations (5)

Summary

We haven't generated a summary for this paper yet.