Phoneme-aware Encoding for Prefix-tree-based Contextual ASR (2312.09582v1)
Abstract: In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree. While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations. As TCPGen handles biasing words as subword units, we propose obtaining subword-level phoneme-aware encoding by using alignment between phonemes and subwords. Furthermore, we propose injecting phoneme-level predictions from CTC into queries of TCPGen so that the model better interprets the phoneme-aware encodings. We conducted ASR experiments with TCPGen for RNN transducer. We observed that proposed phoneme-aware encoding outperformed ordinary grapheme-based encoding on both the English LibriSpeech and Japanese CSJ datasets, demonstrating the robustness of our approach across linguistically diverse languages.
- ``End-to-end speech recognition: A survey,'' ArXiv, 2023.
- ``Shallow-fusion end-to-end contextual biasing,'' in Interspeech, 2019.
- ``Personalization strategies for end-to-end speech recognition systems,'' in ICASSP, 2021.
- ``Deep context: End-to-end contextual speech recognition,'' SLT, 2018.
- ``Contextual RNN-T for open domain ASR,'' in Interspeech, 2020.
- ``Contextual adapters for personalized speech recognition in neural transducers,'' ICASSP, 2022.
- ``Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion,'' in Interspeech, 2021.
- ``Deep shallow fusion for RNN-T personalization,'' in SLT, 2021.
- ``Tree-constrained pointer generator for end-to-end contextual speech recognition,'' ASRU, 2021.
- ``Graph neural networks for contextual ASR with the tree-constrained pointer generator,'' CoRR, 2023.
- ``Can contextual biasing remain effective with Whisper and GPT-2?,'' in Interspeech, 2023.
- ``Get to the point: Summarization with pointer-generator networks,'' in ACL, 2017.
- ``Robust speech recognition via large-scale weak supervision,'' ArXiv, 2022.
- ``Joint grapheme and phoneme embeddings for contextual end-to-end ASR,'' in Interspeech, 2019.
- ``Procter: Pronunciation-aware contextual adapter for personalized speech recognition in neural transducers,'' in ICASSP, 2023.
- ``Jointly learning to align and convert graphemes to phonemes with neural attention models,'' in SLT, 2016.
- ``SoundChoice: Grapheme-to-phoneme models with semantic disambiguation,'' in Interspeech, 2022.
- ``Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion,'' in NAACL, 2007.
- ``Joint-sequence models for grapheme-to-phoneme conversion,'' Speech Communication, 2008.
- ``Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework,'' Natural Language Engineering, 2016.
- ``Attention-based models for speech recognition,'' in NIPS, 2015.
- Alex Graves, ``Sequence transduction with recurrent neural networks,'' ArXiv, 2012.
- ``Semi-supervised classification with graph convolutional networks,'' in ICLR, 2017.
- ``Joint CTC-attention based end-to-end speech recognition using multi-task learning,'' ICASSP, 2017.
- ``Librispeech: An ASR corpus based on public domain audio books,'' in ICASSP, 2015.
- ``ESPnet: End-to-end speech processing toolkit,'' in Interspeech, 2018.
- ``Neural machine translation of rare words with subword units,'' in ACL, 2016.
- ``Specaugment on large scale datasets,'' ICASSP, 2019.
- K. Maekawa, ``Corpus of Spontaneous Japanese : its design and evaluation,'' SSPR, 2003.